Retrieval-Augmented Generation (RAG) was originally developed for document question-answering, but its application to code is arguably more powerful. A code RAG system dynamically retrieves the most relevant code chunks for any given query, injecting precise, task-specific context while keeping token costs minimal. Where static context files provide persistent background knowledge, RAG provides dynamic, query-responsive foreground knowledge.
This topic covers code-specific RAG architectures, embedding strategies optimized for code, chunking approaches that respect code structure, retrieval ranking and reranking, and complete implementation patterns using LlamaIndex, LangChain, and custom pipelines.
Why Code RAG Is Different from Document RAG
General RAG systems treat documents as bags of chunks, embed them semantically, and retrieve by cosine similarity. This works well for natural language documents but fails to capture the key relationships in code:
Structural dependencies: A function call is meaningless without the function definition. Document RAG might retrieve the call site but miss the definition if they are in different files and not textually similar.
Import chains: Retrieved code that references imported symbols is incomplete without those symbol definitions. Pure semantic retrieval does not follow import edges.
Type definitions: Strongly-typed code (TypeScript, Java, C#) cannot be correctly understood without the types it references. Embedding-based retrieval may not score type definition files highly enough.
Naming conventions over semantics: Two functions named processOrder and handleOrder might be more related than two functions with similar natural-language descriptions doing completely different things. Code RAG must weight identifiers more heavily than general document RAG.
Cross-file coherence: Unlike paragraphs in a document, code chunks are often meaningless in isolation. A method body without its class declaration, a module without its imports — these confuse models rather than helping them.
Effective code RAG addresses all of these concerns through specialized chunking, hybrid retrieval, and post-retrieval enrichment.
Tip: Before implementing a full code RAG system, test whether simple keyword (BM25) search outperforms pure embedding search for your specific use cases. For code, which has consistent, precise naming, keyword search often rivals or beats embedding search at a fraction of the infrastructure cost. Always establish this baseline.
Code Chunking Strategies That Preserve Meaning
The first decision in any RAG pipeline is how to chunk your corpus. For code, chunk boundaries should respect semantic units — the same units a developer would think in.
Function-level chunking (the baseline):
Each chunk is one complete function or method, including its signature, docstring, and body. This is the most natural unit for code and produces the most interpretable retrievals.
import ast
from pathlib import Path
from dataclasses import dataclass
@dataclass
class CodeChunk:
content: str # the actual code
file_path: str
symbol_name: str # function/class/method name
symbol_type: str # 'function', 'class', 'method'
start_line: int
end_line: int
docstring: str | None
imports: list[str] # imports visible at module level
def chunk_python_file(filepath: str) -> list[CodeChunk]:
"""Split a Python file into function/class-level chunks."""
source = Path(filepath).read_text()
tree = ast.parse(source)
lines = source.split('\n')
chunks = []
# Extract module-level imports
module_imports = []
for node in ast.walk(tree):
if isinstance(node, (ast.Import, ast.ImportFrom)):
if node.col_offset == 0: # only top-level imports
module_imports.append(ast.get_source_segment(source, node))
for node in tree.body:
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
chunk_source = '\n'.join(lines[node.lineno - 1:node.end_lineno])
docstring = ast.get_docstring(node) or None
chunks.append(CodeChunk(
content=chunk_source,
file_path=filepath,
symbol_name=node.name,
symbol_type='class' if isinstance(node, ast.ClassDef) else 'function',
start_line=node.lineno,
end_line=node.end_lineno,
docstring=docstring,
imports=module_imports
))
return chunks
def chunk_typescript_file(filepath: str) -> list[CodeChunk]:
"""
Split a TypeScript file into semantic chunks.
Uses regex-based extraction (production should use tree-sitter).
"""
import re
source = Path(filepath).read_text()
chunks = []
# Pattern for function declarations, arrow functions, class methods
patterns = [
# Exported function declarations
r'(?:export\s+)?(?:async\s+)?function\s+(\w+)\s*\([^)]*\)[^{]*\{',
# Class declarations
r'(?:export\s+)?(?:abstract\s+)?class\s+(\w+)',
# Interface declarations
r'(?:export\s+)?interface\s+(\w+)',
]
# For a production implementation, use tree-sitter-typescript
# This simplified version demonstrates the structure
lines = source.split('\n')
for i, line in enumerate(lines):
for pattern in patterns:
match = re.match(pattern.strip(), line.strip())
if match:
symbol_name = match.group(1)
# Find the end of this block by brace counting
end_line = find_block_end(lines, i)
chunk_content = '\n'.join(lines[i:end_line + 1])
chunks.append(CodeChunk(
content=chunk_content,
file_path=filepath,
symbol_name=symbol_name,
symbol_type='class' if 'class' in line else 'function',
start_line=i + 1,
end_line=end_line + 1,
docstring=extract_jsdoc_above(lines, i),
imports=extract_imports(source)
))
break
return chunks
def find_block_end(lines: list[str], start: int) -> int:
"""Find the closing brace of a code block starting at start_line."""
depth = 0
for i in range(start, len(lines)):
depth += lines[i].count('{') - lines[i].count('}')
if depth <= 0 and i > start:
return i
return len(lines) - 1
Class-level chunking with method summaries:
For large classes, include the full class header and method signatures but only the body of the method relevant to the query. This is a hybrid approach that provides class-level context without including every method's full implementation.
def chunk_class_with_signatures(class_node: ast.ClassDef, source: str) -> list[CodeChunk]:
"""
Creates two types of chunks from a class:
1. Class-level overview (header + method signatures only)
2. Individual method chunks (full implementation)
"""
chunks = []
# Chunk 1: Class overview
method_sigs = []
for node in class_node.body:
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
sig_lines = ast.get_source_segment(source, node).split('\n')[:3]
method_sigs.append('\n '.join(sig_lines))
overview_content = f"class {class_node.name}:\n " + "\n ...\n ".join(method_sigs)
chunks.append(CodeChunk(
content=overview_content,
file_path='',
symbol_name=class_node.name,
symbol_type='class_overview',
start_line=class_node.lineno,
end_line=class_node.end_lineno,
docstring=ast.get_docstring(class_node),
imports=[]
))
# Chunk 2+: Individual methods
for node in class_node.body:
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
method_source = ast.get_source_segment(source, node)
chunks.append(CodeChunk(
content=method_source,
file_path='',
symbol_name=f"{class_node.name}.{node.name}",
symbol_type='method',
start_line=node.lineno,
end_line=node.end_lineno,
docstring=ast.get_docstring(node),
imports=[]
))
return chunks
Tip: Always include the file path and function signature in the chunk metadata, and prepend them to the chunk content before embedding. Embedding # file: src/services/auth.py\ndef authenticate(user_id: str, password: str) -> Token: produces a much better embedding vector than embedding the function body alone, because the name and file path encode rich semantic information about what the function does.
Embedding Models Optimized for Code
Not all embedding models perform equally on code. The choice of embedding model significantly affects retrieval quality.
Specialized code embedding models:
| Model | Strengths | Best For |
|---|---|---|
code-search-babbage-code-001 (OpenAI) |
Optimized for code search | Mixed natural language / code queries |
text-embedding-3-large (OpenAI) |
Strong general performance, understands code | Cross-language retrieval |
voyage-code-2 (Voyage AI) |
Specifically trained on code | Highest quality for pure code retrieval |
nomic-embed-text (open source) |
Good balance, runs locally | Privacy-sensitive codebases |
mxbai-embed-large (open source) |
Strong on structured text including code | Self-hosted setups |
Setting up embeddings with LlamaIndex:
from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.voyageai import VoyageEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-large",
dimensions=1024 # reduced from 3072 for cost/speed, minimal quality loss
)
Settings.embed_model = VoyageEmbedding(
model_name="voyage-code-2",
voyage_api_key="your-key"
)
def build_code_index(chunks: list[CodeChunk]) -> VectorStoreIndex:
"""Build a LlamaIndex vector index from code chunks."""
client = chromadb.PersistentClient(path="./code-index")
collection = client.get_or_create_collection("codebase")
vector_store = ChromaVectorStore(chroma_collection=collection)
documents = []
for chunk in chunks:
# Construct content with rich metadata prepended for better embeddings
content = f"""File: {chunk.file_path}
Symbol: {chunk.symbol_name} ({chunk.symbol_type})
{f'Description: {chunk.docstring}' if chunk.docstring else ''}
{chunk.content}"""
doc = Document(
text=content,
metadata={
"file_path": chunk.file_path,
"symbol_name": chunk.symbol_name,
"symbol_type": chunk.symbol_type,
"start_line": chunk.start_line,
"end_line": chunk.end_line,
}
)
documents.append(doc)
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
show_progress=True
)
return index
Tip: Use dimensions=1024 (or even dimensions=512) with OpenAI's text-embedding-3 models instead of the default 3072. OpenAI's v3 models support Matryoshka representation learning, meaning smaller dimension vectors retain most of the semantic quality at dramatically lower storage and computation cost. Run a quality benchmark on your own codebase before committing to a dimension.
Hybrid Retrieval: Combining Semantic and Keyword Search
Pure embedding (semantic) search struggles with exact identifier matching — if you query for getUserByEmail, an embedding model may retrieve semantically similar functions like findUserByUsername rather than the exact function you want. Pure keyword (BM25) search misses semantic similarity.
Hybrid retrieval combines both approaches:
from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
def build_hybrid_retriever(index: VectorStoreIndex, chunks: list[CodeChunk]):
"""
Build a hybrid retriever combining semantic (vector) and keyword (BM25) search.
"""
# Semantic retriever
vector_retriever = index.as_retriever(similarity_top_k=10)
# Keyword retriever
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore,
similarity_top_k=10
)
# Fusion retriever — combines and deduplicates results
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
retriever_weights=[0.6, 0.4], # weight semantic slightly higher
similarity_top_k=5, # final number of chunks to return
num_queries=1, # do not generate query variations
mode="reciprocal_rerank", # RRF fusion algorithm
)
return hybrid_retriever
retriever = build_hybrid_retriever(index, chunks)
query = "how does user authentication work with JWT tokens"
results = retriever.retrieve(query)
for node in results:
print(f"Score: {node.score:.3f} | {node.metadata['symbol_name']} in {node.metadata['file_path']}")
print(node.text[:300])
print("---")
Reranking for higher precision:
After initial retrieval, reranking uses a more powerful cross-encoder model to reorder the candidates:
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.postprocessor import SimilarityPostprocessor
def build_retrieval_pipeline(index: VectorStoreIndex):
"""
Full retrieval pipeline: hybrid retrieval → reranking → token budget enforcement.
"""
hybrid_retriever = build_hybrid_retriever(index, [])
# Reranker: uses a cross-encoder to score query-document relevance
reranker = CohereRerank(
api_key="your-cohere-key",
top_n=5, # keep top 5 after reranking
model="rerank-english-v3.0"
)
# Token budget enforcer: ensure we do not exceed context limits
token_limiter = SimilarityPostprocessor(similarity_cutoff=0.7)
query_engine = RetrieverQueryEngine(
retriever=hybrid_retriever,
node_postprocessors=[reranker, token_limiter]
)
return query_engine
Tip: Implement a "retrieval quality dashboard" that logs query, retrieved chunks, and whether the AI's response correctly used those chunks. Review 10 samples per week. This feedback loop identifies where your retrieval is failing — which is almost always more valuable than tuning embedding parameters blindly.
Graph-Enhanced Retrieval: Following Dependency Edges
Pure embedding retrieval is context-blind — it retrieves based on text similarity and ignores the dependency graph. Graph-enhanced retrieval combines vector retrieval with graph traversal, ensuring that retrieved functions are enriched with their dependencies.
import networkx as nx
from collections import defaultdict
class GraphEnhancedRetriever:
"""
Retriever that augments vector search results with dependency graph traversal.
Ensures retrieved code includes its direct dependencies.
"""
def __init__(self, vector_index: VectorStoreIndex, dependency_graph: dict):
self.vector_index = vector_index
self.graph = nx.DiGraph()
# Build the graph from dependency data
for source, targets in dependency_graph.items():
for target in targets:
self.graph.add_edge(source, target)
self.vector_retriever = vector_index.as_retriever(similarity_top_k=5)
def retrieve(self, query: str, enrich_depth: int = 1, max_tokens: int = 6000) -> list[dict]:
"""
Retrieve relevant chunks and enrich with dependency graph context.
"""
# Step 1: Vector retrieval
initial_results = self.vector_retriever.retrieve(query)
retrieved_symbols = {r.metadata['symbol_name'] for r in initial_results}
# Step 2: Graph enrichment — add direct dependencies of retrieved symbols
enriched_symbols = set(retrieved_symbols)
for symbol in retrieved_symbols:
if symbol in self.graph:
for dep in list(self.graph.successors(symbol))[:3]: # limit to 3 deps
enriched_symbols.add(dep)
# Step 3: Fetch chunks for enriched symbol set
enriched_chunks = self._fetch_chunks_by_symbols(enriched_symbols)
# Step 4: Enforce token budget — rank by relevance and cut
ranked = self._rank_by_relevance(query, enriched_chunks, initial_results)
return self._apply_token_budget(ranked, max_tokens)
def _rank_by_relevance(self, query, chunks, vector_results):
"""Score chunks: initial vector results rank higher than graph-added chunks."""
vector_symbols = {r.metadata['symbol_name']: r.score for r in vector_results}
scored = []
for chunk in chunks:
vector_score = vector_symbols.get(chunk['symbol_name'], 0.0)
# Graph-added chunks get a base score of 0.5 (below direct retrievals)
score = vector_score if vector_score > 0 else 0.5
scored.append((score, chunk))
return [chunk for _, chunk in sorted(scored, reverse=True)]
def _apply_token_budget(self, chunks, max_tokens):
"""Keep chunks in order until token budget is exhausted."""
result = []
token_count = 0
for chunk in chunks:
chunk_tokens = len(chunk['content'].split()) * 1.3
if token_count + chunk_tokens <= max_tokens:
result.append(chunk)
token_count += chunk_tokens
return result
def _fetch_chunks_by_symbols(self, symbols: set[str]) -> list[dict]:
"""Fetch stored chunks by symbol name from the index."""
# Implementation depends on your vector store
# This is a conceptual placeholder
return []
Tip: Graph-enhanced retrieval dramatically reduces "missing context" errors where the AI generates code that calls undefined functions. Before implementing the full graph retrieval system, do a quick manual test: take 5 recent AI coding errors, check whether the missing context would have been caught by 1-hop dependency enrichment. If the answer is yes for 3+ out of 5, graph enrichment is worth building.
Building a Code RAG Pipeline with LangChain
For teams already using LangChain, here is a complete code RAG implementation:
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from pathlib import Path
class CodebaseRAG:
"""
Complete RAG pipeline for codebase-aware AI assistance.
Supports Python and TypeScript out of the box.
"""
LANGUAGE_MAP = {
'.py': Language.PYTHON,
'.ts': Language.TS,
'.tsx': Language.TS,
'.js': Language.JS,
'.jsx': Language.JS,
'.java': Language.JAVA,
'.cpp': Language.CPP,
'.go': Language.GO,
}
def __init__(self, codebase_path: str, persist_dir: str = "./rag-index"):
self.codebase_path = Path(codebase_path)
self.persist_dir = persist_dir
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
self.vectorstore = None
def index_codebase(self, force_rebuild: bool = False):
"""Build or load the vector index for the codebase."""
if not force_rebuild:
try:
self.vectorstore = Chroma(
persist_directory=self.persist_dir,
embedding_function=self.embeddings
)
print(f"Loaded existing index with {self.vectorstore._collection.count()} chunks")
return
except Exception:
pass
all_chunks = []
for filepath in self.codebase_path.rglob('*'):
if filepath.suffix not in self.LANGUAGE_MAP:
continue
if any(skip in str(filepath) for skip in ['node_modules', '.git', 'dist', '__pycache__']):
continue
try:
content = filepath.read_text(encoding='utf-8', errors='ignore')
language = self.LANGUAGE_MAP[filepath.suffix]
splitter = RecursiveCharacterTextSplitter.from_language(
language=language,
chunk_size=1500, # ~375 tokens, good for function-level chunks
chunk_overlap=100, # small overlap to avoid cutting function signatures
)
chunks = splitter.create_documents(
texts=[content],
metadatas=[{"source": str(filepath.relative_to(self.codebase_path))}]
)
all_chunks.extend(chunks)
except Exception as e:
print(f"Skipping {filepath}: {e}")
print(f"Indexing {len(all_chunks)} chunks from codebase...")
self.vectorstore = Chroma.from_documents(
documents=all_chunks,
embedding=self.embeddings,
persist_directory=self.persist_dir
)
print(f"Index built: {len(all_chunks)} chunks indexed")
def query(self, question: str, k: int = 5) -> str:
"""Query the codebase with a natural language question."""
PROMPT = PromptTemplate(
template="""You are an expert software engineer with deep knowledge of this codebase.
Use the following code context to answer the question. Be specific and reference actual
function names, file paths, and patterns from the code.
Context:
{context}
Question: {question}
Answer (include file paths and code snippets where relevant):""",
input_variables=["context", "question"]
)
retriever = self.vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance — reduces redundancy
search_kwargs={"k": k, "fetch_k": 20}
)
chain = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True
)
result = chain.invoke({"query": question})
# Print source attribution
print("\nSources used:")
for doc in result["source_documents"]:
print(f" - {doc.metadata['source']}")
return result["result"]
rag = CodebaseRAG('./src')
rag.index_codebase()
answer = rag.query("How does the payment service handle Stripe webhook validation?")
answer = rag.query("What validations are applied to the checkout request DTO?")
answer = rag.query("Which modules are involved when a customer places an order?")
Incremental index updates:
def update_index_for_changed_files(rag: CodebaseRAG, changed_files: list[str]):
"""Update only the changed files in the index, avoiding full rebuild."""
for filepath in changed_files:
# Delete existing chunks for this file
rag.vectorstore._collection.delete(
where={"source": filepath}
)
# Re-index the file
content = Path(filepath).read_text()
suffix = Path(filepath).suffix
language = CodebaseRAG.LANGUAGE_MAP.get(suffix)
if language:
splitter = RecursiveCharacterTextSplitter.from_language(
language=language, chunk_size=1500, chunk_overlap=100
)
chunks = splitter.create_documents(
texts=[content],
metadatas=[{"source": filepath}]
)
rag.vectorstore.add_documents(chunks)
print(f"Re-indexed {filepath}: {len(chunks)} chunks")
Tip: Use MMR (Maximum Marginal Relevance) retrieval instead of simple similarity search. MMR balances relevance with diversity — it avoids returning 5 nearly identical chunks from the same file when a query matches one function strongly. In practice, MMR-retrieved contexts produce better AI responses because the model gets varied perspectives rather than redundant repetition of the same code.