RAG for Codebases | Token Optimization Masterclass

Retrieval-Augmented Generation (RAG) was originally developed for document question-answering, but its application to code is arguably more powerful. A code RAG system dynamically retrieves the most relevant code chunks for any given query, injecting precise, task-specific context while keeping token costs minimal. Where static context files provide persistent background knowledge, RAG provides dynamic, query-responsive foreground knowledge.

This topic covers code-specific RAG architectures, embedding strategies optimized for code, chunking approaches that respect code structure, retrieval ranking and reranking, and complete implementation patterns using LlamaIndex, LangChain, and custom pipelines.

Why Code RAG Is Different from Document RAG

General RAG systems treat documents as bags of chunks, embed them semantically, and retrieve by cosine similarity. This works well for natural language documents but fails to capture the key relationships in code:

Structural dependencies: A function call is meaningless without the function definition. Document RAG might retrieve the call site but miss the definition if they are in different files and not textually similar.

Import chains: Retrieved code that references imported symbols is incomplete without those symbol definitions. Pure semantic retrieval does not follow import edges.

Type definitions: Strongly-typed code (TypeScript, Java, C#) cannot be correctly understood without the types it references. Embedding-based retrieval may not score type definition files highly enough.

Naming conventions over semantics: Two functions named processOrder and handleOrder might be more related than two functions with similar natural-language descriptions doing completely different things. Code RAG must weight identifiers more heavily than general document RAG.

Cross-file coherence: Unlike paragraphs in a document, code chunks are often meaningless in isolation. A method body without its class declaration, a module without its imports — these confuse models rather than helping them.

Effective code RAG addresses all of these concerns through specialized chunking, hybrid retrieval, and post-retrieval enrichment.

Tip: Before implementing a full code RAG system, test whether simple keyword (BM25) search outperforms pure embedding search for your specific use cases. For code, which has consistent, precise naming, keyword search often rivals or beats embedding search at a fraction of the infrastructure cost. Always establish this baseline.

Code Chunking Strategies That Preserve Meaning

The first decision in any RAG pipeline is how to chunk your corpus. For code, chunk boundaries should respect semantic units — the same units a developer would think in.

Function-level chunking (the baseline):

Each chunk is one complete function or method, including its signature, docstring, and body. This is the most natural unit for code and produces the most interpretable retrievals.

import ast
from pathlib import Path
from dataclasses import dataclass

@dataclass
class CodeChunk:
    content: str           # the actual code
    file_path: str
    symbol_name: str       # function/class/method name
    symbol_type: str       # 'function', 'class', 'method'
    start_line: int
    end_line: int
    docstring: str | None
    imports: list[str]     # imports visible at module level

def chunk_python_file(filepath: str) -> list[CodeChunk]:
    """Split a Python file into function/class-level chunks."""
    source = Path(filepath).read_text()
    tree = ast.parse(source)
    lines = source.split('\n')
    chunks = []

    # Extract module-level imports
    module_imports = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            if node.col_offset == 0:  # only top-level imports
                module_imports.append(ast.get_source_segment(source, node))

    for node in tree.body:
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            chunk_source = '\n'.join(lines[node.lineno - 1:node.end_lineno])
            docstring = ast.get_docstring(node) or None

            chunks.append(CodeChunk(
                content=chunk_source,
                file_path=filepath,
                symbol_name=node.name,
                symbol_type='class' if isinstance(node, ast.ClassDef) else 'function',
                start_line=node.lineno,
                end_line=node.end_lineno,
                docstring=docstring,
                imports=module_imports
            ))

    return chunks

def chunk_typescript_file(filepath: str) -> list[CodeChunk]:
    """
    Split a TypeScript file into semantic chunks.
    Uses regex-based extraction (production should use tree-sitter).
    """
    import re
    source = Path(filepath).read_text()
    chunks = []

    # Pattern for function declarations, arrow functions, class methods
    patterns = [
        # Exported function declarations
        r'(?:export\s+)?(?:async\s+)?function\s+(\w+)\s*\([^)]*\)[^{]*\{',
        # Class declarations
        r'(?:export\s+)?(?:abstract\s+)?class\s+(\w+)',
        # Interface declarations
        r'(?:export\s+)?interface\s+(\w+)',
    ]

    # For a production implementation, use tree-sitter-typescript
    # This simplified version demonstrates the structure
    lines = source.split('\n')
    for i, line in enumerate(lines):
        for pattern in patterns:
            match = re.match(pattern.strip(), line.strip())
            if match:
                symbol_name = match.group(1)
                # Find the end of this block by brace counting
                end_line = find_block_end(lines, i)
                chunk_content = '\n'.join(lines[i:end_line + 1])

                chunks.append(CodeChunk(
                    content=chunk_content,
                    file_path=filepath,
                    symbol_name=symbol_name,
                    symbol_type='class' if 'class' in line else 'function',
                    start_line=i + 1,
                    end_line=end_line + 1,
                    docstring=extract_jsdoc_above(lines, i),
                    imports=extract_imports(source)
                ))
                break

    return chunks

def find_block_end(lines: list[str], start: int) -> int:
    """Find the closing brace of a code block starting at start_line."""
    depth = 0
    for i in range(start, len(lines)):
        depth += lines[i].count('{') - lines[i].count('}')
        if depth <= 0 and i > start:
            return i
    return len(lines) - 1

Class-level chunking with method summaries:

For large classes, include the full class header and method signatures but only the body of the method relevant to the query. This is a hybrid approach that provides class-level context without including every method's full implementation.

def chunk_class_with_signatures(class_node: ast.ClassDef, source: str) -> list[CodeChunk]:
    """
    Creates two types of chunks from a class:
    1. Class-level overview (header + method signatures only)
    2. Individual method chunks (full implementation)
    """
    chunks = []

    # Chunk 1: Class overview
    method_sigs = []
    for node in class_node.body:
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            sig_lines = ast.get_source_segment(source, node).split('\n')[:3]
            method_sigs.append('\n    '.join(sig_lines))

    overview_content = f"class {class_node.name}:\n    " + "\n    ...\n    ".join(method_sigs)
    chunks.append(CodeChunk(
        content=overview_content,
        file_path='',
        symbol_name=class_node.name,
        symbol_type='class_overview',
        start_line=class_node.lineno,
        end_line=class_node.end_lineno,
        docstring=ast.get_docstring(class_node),
        imports=[]
    ))

    # Chunk 2+: Individual methods
    for node in class_node.body:
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            method_source = ast.get_source_segment(source, node)
            chunks.append(CodeChunk(
                content=method_source,
                file_path='',
                symbol_name=f"{class_node.name}.{node.name}",
                symbol_type='method',
                start_line=node.lineno,
                end_line=node.end_lineno,
                docstring=ast.get_docstring(node),
                imports=[]
            ))

    return chunks

Tip: Always include the file path and function signature in the chunk metadata, and prepend them to the chunk content before embedding. Embedding # file: src/services/auth.py\ndef authenticate(user_id: str, password: str) -> Token: produces a much better embedding vector than embedding the function body alone, because the name and file path encode rich semantic information about what the function does.

Embedding Models Optimized for Code

Not all embedding models perform equally on code. The choice of embedding model significantly affects retrieval quality.

Specialized code embedding models:

Model	Strengths	Best For
`code-search-babbage-code-001` (OpenAI)	Optimized for code search	Mixed natural language / code queries
`text-embedding-3-large` (OpenAI)	Strong general performance, understands code	Cross-language retrieval
`voyage-code-2` (Voyage AI)	Specifically trained on code	Highest quality for pure code retrieval
`nomic-embed-text` (open source)	Good balance, runs locally	Privacy-sensitive codebases
`mxbai-embed-large` (open source)	Strong on structured text including code	Self-hosted setups

Setting up embeddings with LlamaIndex:

from llama_index.core import VectorStoreIndex, Document, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.voyageai import VoyageEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-large",
    dimensions=1024  # reduced from 3072 for cost/speed, minimal quality loss
)

Settings.embed_model = VoyageEmbedding(
    model_name="voyage-code-2",
    voyage_api_key="your-key"
)

def build_code_index(chunks: list[CodeChunk]) -> VectorStoreIndex:
    """Build a LlamaIndex vector index from code chunks."""

    client = chromadb.PersistentClient(path="./code-index")
    collection = client.get_or_create_collection("codebase")
    vector_store = ChromaVectorStore(chroma_collection=collection)

    documents = []
    for chunk in chunks:
        # Construct content with rich metadata prepended for better embeddings
        content = f"""File: {chunk.file_path}
Symbol: {chunk.symbol_name} ({chunk.symbol_type})
{f'Description: {chunk.docstring}' if chunk.docstring else ''}

{chunk.content}"""

        doc = Document(
            text=content,
            metadata={
                "file_path": chunk.file_path,
                "symbol_name": chunk.symbol_name,
                "symbol_type": chunk.symbol_type,
                "start_line": chunk.start_line,
                "end_line": chunk.end_line,
            }
        )
        documents.append(doc)

    index = VectorStoreIndex.from_documents(
        documents,
        vector_store=vector_store,
        show_progress=True
    )

    return index

Tip: Use dimensions=1024 (or even dimensions=512) with OpenAI's text-embedding-3 models instead of the default 3072. OpenAI's v3 models support Matryoshka representation learning, meaning smaller dimension vectors retain most of the semantic quality at dramatically lower storage and computation cost. Run a quality benchmark on your own codebase before committing to a dimension.

Hybrid Retrieval: Combining Semantic and Keyword Search

Pure embedding (semantic) search struggles with exact identifier matching — if you query for getUserByEmail, an embedding model may retrieve semantically similar functions like findUserByUsername rather than the exact function you want. Pure keyword (BM25) search misses semantic similarity.

Hybrid retrieval combines both approaches:

from llama_index.core import VectorStoreIndex
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

def build_hybrid_retriever(index: VectorStoreIndex, chunks: list[CodeChunk]):
    """
    Build a hybrid retriever combining semantic (vector) and keyword (BM25) search.
    """

    # Semantic retriever
    vector_retriever = index.as_retriever(similarity_top_k=10)

    # Keyword retriever  
    bm25_retriever = BM25Retriever.from_defaults(
        docstore=index.docstore,
        similarity_top_k=10
    )

    # Fusion retriever — combines and deduplicates results
    hybrid_retriever = QueryFusionRetriever(
        retrievers=[vector_retriever, bm25_retriever],
        retriever_weights=[0.6, 0.4],  # weight semantic slightly higher
        similarity_top_k=5,            # final number of chunks to return
        num_queries=1,                 # do not generate query variations
        mode="reciprocal_rerank",      # RRF fusion algorithm
    )

    return hybrid_retriever


retriever = build_hybrid_retriever(index, chunks)

query = "how does user authentication work with JWT tokens"
results = retriever.retrieve(query)

for node in results:
    print(f"Score: {node.score:.3f} | {node.metadata['symbol_name']} in {node.metadata['file_path']}")
    print(node.text[:300])
    print("---")

Reranking for higher precision:

After initial retrieval, reranking uses a more powerful cross-encoder model to reorder the candidates:

from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.core.postprocessor import SimilarityPostprocessor

def build_retrieval_pipeline(index: VectorStoreIndex):
    """
    Full retrieval pipeline: hybrid retrieval → reranking → token budget enforcement.
    """
    hybrid_retriever = build_hybrid_retriever(index, [])

    # Reranker: uses a cross-encoder to score query-document relevance
    reranker = CohereRerank(
        api_key="your-cohere-key",
        top_n=5,                        # keep top 5 after reranking
        model="rerank-english-v3.0"
    )

    # Token budget enforcer: ensure we do not exceed context limits
    token_limiter = SimilarityPostprocessor(similarity_cutoff=0.7)

    query_engine = RetrieverQueryEngine(
        retriever=hybrid_retriever,
        node_postprocessors=[reranker, token_limiter]
    )

    return query_engine

Tip: Implement a "retrieval quality dashboard" that logs query, retrieved chunks, and whether the AI's response correctly used those chunks. Review 10 samples per week. This feedback loop identifies where your retrieval is failing — which is almost always more valuable than tuning embedding parameters blindly.

Graph-Enhanced Retrieval: Following Dependency Edges

Pure embedding retrieval is context-blind — it retrieves based on text similarity and ignores the dependency graph. Graph-enhanced retrieval combines vector retrieval with graph traversal, ensuring that retrieved functions are enriched with their dependencies.

import networkx as nx
from collections import defaultdict

class GraphEnhancedRetriever:
    """
    Retriever that augments vector search results with dependency graph traversal.
    Ensures retrieved code includes its direct dependencies.
    """

    def __init__(self, vector_index: VectorStoreIndex, dependency_graph: dict):
        self.vector_index = vector_index
        self.graph = nx.DiGraph()

        # Build the graph from dependency data
        for source, targets in dependency_graph.items():
            for target in targets:
                self.graph.add_edge(source, target)

        self.vector_retriever = vector_index.as_retriever(similarity_top_k=5)

    def retrieve(self, query: str, enrich_depth: int = 1, max_tokens: int = 6000) -> list[dict]:
        """
        Retrieve relevant chunks and enrich with dependency graph context.
        """
        # Step 1: Vector retrieval
        initial_results = self.vector_retriever.retrieve(query)
        retrieved_symbols = {r.metadata['symbol_name'] for r in initial_results}

        # Step 2: Graph enrichment — add direct dependencies of retrieved symbols
        enriched_symbols = set(retrieved_symbols)
        for symbol in retrieved_symbols:
            if symbol in self.graph:
                for dep in list(self.graph.successors(symbol))[:3]:  # limit to 3 deps
                    enriched_symbols.add(dep)

        # Step 3: Fetch chunks for enriched symbol set
        enriched_chunks = self._fetch_chunks_by_symbols(enriched_symbols)

        # Step 4: Enforce token budget — rank by relevance and cut
        ranked = self._rank_by_relevance(query, enriched_chunks, initial_results)
        return self._apply_token_budget(ranked, max_tokens)

    def _rank_by_relevance(self, query, chunks, vector_results):
        """Score chunks: initial vector results rank higher than graph-added chunks."""
        vector_symbols = {r.metadata['symbol_name']: r.score for r in vector_results}

        scored = []
        for chunk in chunks:
            vector_score = vector_symbols.get(chunk['symbol_name'], 0.0)
            # Graph-added chunks get a base score of 0.5 (below direct retrievals)
            score = vector_score if vector_score > 0 else 0.5
            scored.append((score, chunk))

        return [chunk for _, chunk in sorted(scored, reverse=True)]

    def _apply_token_budget(self, chunks, max_tokens):
        """Keep chunks in order until token budget is exhausted."""
        result = []
        token_count = 0
        for chunk in chunks:
            chunk_tokens = len(chunk['content'].split()) * 1.3
            if token_count + chunk_tokens <= max_tokens:
                result.append(chunk)
                token_count += chunk_tokens
        return result

    def _fetch_chunks_by_symbols(self, symbols: set[str]) -> list[dict]:
        """Fetch stored chunks by symbol name from the index."""
        # Implementation depends on your vector store
        # This is a conceptual placeholder
        return []

Tip: Graph-enhanced retrieval dramatically reduces "missing context" errors where the AI generates code that calls undefined functions. Before implementing the full graph retrieval system, do a quick manual test: take 5 recent AI coding errors, check whether the missing context would have been caught by 1-hop dependency enrichment. If the answer is yes for 3+ out of 5, graph enrichment is worth building.

Building a Code RAG Pipeline with LangChain

For teams already using LangChain, here is a complete code RAG implementation:

from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from pathlib import Path

class CodebaseRAG:
    """
    Complete RAG pipeline for codebase-aware AI assistance.
    Supports Python and TypeScript out of the box.
    """

    LANGUAGE_MAP = {
        '.py': Language.PYTHON,
        '.ts': Language.TS,
        '.tsx': Language.TS,
        '.js': Language.JS,
        '.jsx': Language.JS,
        '.java': Language.JAVA,
        '.cpp': Language.CPP,
        '.go': Language.GO,
    }

    def __init__(self, codebase_path: str, persist_dir: str = "./rag-index"):
        self.codebase_path = Path(codebase_path)
        self.persist_dir = persist_dir
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.vectorstore = None

    def index_codebase(self, force_rebuild: bool = False):
        """Build or load the vector index for the codebase."""

        if not force_rebuild:
            try:
                self.vectorstore = Chroma(
                    persist_directory=self.persist_dir,
                    embedding_function=self.embeddings
                )
                print(f"Loaded existing index with {self.vectorstore._collection.count()} chunks")
                return
            except Exception:
                pass

        all_chunks = []

        for filepath in self.codebase_path.rglob('*'):
            if filepath.suffix not in self.LANGUAGE_MAP:
                continue
            if any(skip in str(filepath) for skip in ['node_modules', '.git', 'dist', '__pycache__']):
                continue

            try:
                content = filepath.read_text(encoding='utf-8', errors='ignore')
                language = self.LANGUAGE_MAP[filepath.suffix]

                splitter = RecursiveCharacterTextSplitter.from_language(
                    language=language,
                    chunk_size=1500,    # ~375 tokens, good for function-level chunks
                    chunk_overlap=100,  # small overlap to avoid cutting function signatures
                )

                chunks = splitter.create_documents(
                    texts=[content],
                    metadatas=[{"source": str(filepath.relative_to(self.codebase_path))}]
                )
                all_chunks.extend(chunks)

            except Exception as e:
                print(f"Skipping {filepath}: {e}")

        print(f"Indexing {len(all_chunks)} chunks from codebase...")

        self.vectorstore = Chroma.from_documents(
            documents=all_chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_dir
        )

        print(f"Index built: {len(all_chunks)} chunks indexed")

    def query(self, question: str, k: int = 5) -> str:
        """Query the codebase with a natural language question."""

        PROMPT = PromptTemplate(
            template="""You are an expert software engineer with deep knowledge of this codebase.

Use the following code context to answer the question. Be specific and reference actual 
function names, file paths, and patterns from the code.

Context:
{context}

Question: {question}

Answer (include file paths and code snippets where relevant):""",
            input_variables=["context", "question"]
        )

        retriever = self.vectorstore.as_retriever(
            search_type="mmr",           # Maximum Marginal Relevance — reduces redundancy
            search_kwargs={"k": k, "fetch_k": 20}
        )

        chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=retriever,
            chain_type_kwargs={"prompt": PROMPT},
            return_source_documents=True
        )

        result = chain.invoke({"query": question})

        # Print source attribution
        print("\nSources used:")
        for doc in result["source_documents"]:
            print(f"  - {doc.metadata['source']}")

        return result["result"]


rag = CodebaseRAG('./src')
rag.index_codebase()

answer = rag.query("How does the payment service handle Stripe webhook validation?")

answer = rag.query("What validations are applied to the checkout request DTO?")

answer = rag.query("Which modules are involved when a customer places an order?")

Incremental index updates:

def update_index_for_changed_files(rag: CodebaseRAG, changed_files: list[str]):
    """Update only the changed files in the index, avoiding full rebuild."""

    for filepath in changed_files:
        # Delete existing chunks for this file
        rag.vectorstore._collection.delete(
            where={"source": filepath}
        )

        # Re-index the file
        content = Path(filepath).read_text()
        suffix = Path(filepath).suffix
        language = CodebaseRAG.LANGUAGE_MAP.get(suffix)

        if language:
            splitter = RecursiveCharacterTextSplitter.from_language(
                language=language, chunk_size=1500, chunk_overlap=100
            )
            chunks = splitter.create_documents(
                texts=[content],
                metadatas=[{"source": filepath}]
            )
            rag.vectorstore.add_documents(chunks)
            print(f"Re-indexed {filepath}: {len(chunks)} chunks")

Tip: Use MMR (Maximum Marginal Relevance) retrieval instead of simple similarity search. MMR balances relevance with diversity — it avoids returning 5 nearly identical chunks from the same file when a query matches one function strongly. In practice, MMR-retrieved contexts produce better AI responses because the model gets varied perspectives rather than redundant repetition of the same code.