Zero-Shot Cross-Domain Code Search without Fine-Tuning

K.M. Liang; Zhongxin Liu; Chao Liu; Zhiyuan Wan; David Lo; Xiaohu Yang

doi:10.1145/3729357

ScienceGate Book Chapters

JOURNAL ARTICLE

Zero-Shot Cross-Domain Code Search without Fine-Tuning

K.M. Liang Zhongxin Liu Chao Liu Zhiyuan Wan David Lo Xiaohu Yang

Year: 2025 Journal: Proceedings of the ACM on software engineering. Vol: 2 (FSE)Pages: 1937-1959 Publisher: Association for Computing Machinery

DOI: 10.1145/3729357

Get Full-Text PDF Get Analytical Report

Abstract

Code search is a crucial task in software engineering, aiming to retrieve code snippets that are semantically relevant to a natural language query. Recently, Pre-trained Language Models (PLMs) have shown remarkable success and are widely adopted for code search tasks. However, PLM-based methods often struggle in cross-domain scenarios. When applied to a new domain, they typically require extensive fine-tuning with substantial data. Even worse, the data scarcity problem in new domains often forces these methods to operate in a zero-shot setting, resulting in a significant decline in performance. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. We first conduct an empirical study to investigate the effectiveness of these two matching schemas in zero-shot cross-domain code search. Our findings highlight the strong complementarity among the three matching schemas, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge first employs zero-shot prompting to guide Large Language Models (LLMs) to generate a comment for each code snippet in the codebase and produce a code for each query. Subsequently, it encodes queries, code snippets, comments, and the generated code using PLMs and assesses similarities through three matching schemas: query-code, query-comment, and generated code-code. Lastly, CodeBridge leverages a sampling-based fusion approach that combines these three similarity scores to rank the final search outcomes. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.

Keywords:

Computer science Code (set theory) Matching (statistics) Domain (mathematical analysis) Source code Data mining Information retrieval Theoretical computer science Programming language Mathematics

Metrics

Cited By

4.82

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Software Engineering Research

Physical Sciences → Computer Science → Information Systems

Zero-Shot Cross-Domain Code Search without Fine-Tuning

Abstract

Metrics

Citation History

Topics

Related Documents

Zero-Shot Domain-Sensitive Speech Recognition with Prompt-Conditioning Fine-Tuning

Robust fine-tuning of zero-shot models

CODE: Contrastive Pre-training with Adversarial Fine-Tuning for Zero-Shot Expert Linking

Domain-Oriented Prefix-Tuning: Towards Efficient and Generalizable Fine-tuning for Zero-Shot Dialogue Summarization

DeFT-X: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer