Repo Maps & File Summaries | Token Optimization Masterclass

When an AI agent is working on your codebase, it needs two fundamentally different kinds of knowledge: structural knowledge (how the codebase is organized, what exists where, how components relate to each other) and content knowledge (the actual code inside a specific file). Naive inclusion conflates these two needs and satisfies both by dumping full file contents — which is enormously wasteful because structural knowledge can be conveyed at a fraction of the token cost.

This topic covers the techniques and tools that provide rich structural context without the token overhead of full file inclusion: repo maps, file summaries, symbol indexes, and layered structural representations.

What Is a Repo Map and Why Does It Work

A repo map is a compact representation of your codebase's structure that answers the question: "What exists in this codebase and where?" without requiring the model to read every line of code.

The concept was pioneered and popularized by aider, the AI pair programming tool, which generates a repo map automatically before each interaction. aider's repo map uses tree-sitter to parse every file in the repository and extract only the signatures: class names, function names, method signatures, exported symbols, and the file paths they live in.

A repo map for a typical Express.js API might look like:

src/
  api/
    routes/
      users.ts
        - GET /users → getUserList(req, res)
        - POST /users → createUser(req, res)
        - GET /users/:id → getUserById(req, res)
        - DELETE /users/:id → deleteUser(req, res)
      auth.ts
        - POST /auth/login → login(req, res)
        - POST /auth/refresh → refreshToken(req, res)
  services/
    UserService.ts
      class UserService
        + constructor(db: Database, cache: Redis)
        + findById(id: string): Promise<User>
        + create(data: CreateUserDto): Promise<User>
        + delete(id: string): Promise<void>
    AuthService.ts
      class AuthService
        + constructor(userService: UserService, jwt: JWTConfig)
        + authenticate(email: string, password: string): Promise<Token>
        + validateToken(token: string): Promise<Payload>
  models/
    User.ts
      interface User { id, email, passwordHash, createdAt, role }
    Token.ts
      interface Token { accessToken, refreshToken, expiresIn }

This map covers the entire meaningful structure of a 12-file, 1,500-line codebase in roughly 500 tokens. The full file contents would cost 6,000–9,000 tokens. The map delivers 90% of the structural understanding at 6–8% of the cost.

Why models respond well to repo maps: Language models are trained on enormous quantities of code documentation, README files, and API references — all of which describe code structure in exactly this format. A repo map is written in a language the model already speaks fluently.

Tip: Configure aider with --map-tokens 1024 to limit the repo map to 1,024 tokens on large codebases. Aider's map algorithm will automatically select the most relevant files based on your current task. For very large repos, --map-tokens 2048 gives richer structural context while remaining far below the cost of full inclusion.

Generating Repo Maps: Tools and Techniques

Several tools generate repo maps, each with different depth and customizability.

Using aider's built-in repo map:

Aider generates its repo map using a sophisticated algorithm that weights files by their importance to the current task (based on imports, references, and recency). You can inspect what it generates:

aider --show-repo-map

aider --map-tokens 1500 --model claude-sonnet-4-5

Generating custom repo maps with ctags:

Universal Ctags generates a rich symbol index that can be reformatted into a map:

brew install universal-ctags  # macOS
apt install universal-ctags   # Debian/Ubuntu

ctags -R --fields=+n --extras=+q --output-format=json -o tags.json ./src

python3 << 'EOF'
import json

with open('tags.json') as f:
    tags = [json.loads(line) for line in f]

from collections import defaultdict
by_file = defaultdict(list)
for tag in tags:
    if tag.get('kind') in ('function', 'class', 'method', 'interface', 'type'):
        by_file[tag['path']].append(f"  {tag['kind']}: {tag['name']}")

for path, symbols in sorted(by_file.items()):
    print(f"\n{path}")
    for sym in symbols:
        print(sym)
EOF

Generating repo maps with tree-sitter:

Tree-sitter provides language-aware parsing that produces more accurate maps than ctags for modern languages:

from tree_sitter import Language, Parser
from pathlib import Path

def extract_typescript_signatures(filepath: str) -> list[str]:
    """Extract function/class signatures from a TypeScript file."""
    # This is a simplified version; production use should handle
    # all edge cases and use the full tree-sitter TypeScript grammar
    signatures = []

    with open(filepath) as f:
        content = f.read()

    # Parse with tree-sitter
    parser = Parser()
    # parser.set_language(TS_LANGUAGE)  # configure with your grammar
    tree = parser.parse(bytes(content, 'utf8'))

    def visit(node, depth=0):
        if node.type in ('function_declaration', 'method_definition', 
                          'class_declaration', 'interface_declaration',
                          'export_statement'):
            # Extract the first line of the node as the signature
            first_line = content[node.start_byte:].split('\n')[0].strip()
            signatures.append('  ' * depth + first_line)
        for child in node.children:
            visit(child, depth + (1 if node.type in ('class_body',) else 0))

    visit(tree.root_node)
    return signatures

def build_repo_map(root: str, extensions: list[str] = ['.ts', '.tsx', '.py', '.java']) -> str:
    """Build a full repo map from a directory."""
    lines = []
    for path in Path(root).rglob('*'):
        if path.suffix in extensions and '.git' not in str(path):
            relative = path.relative_to(root)
            sigs = extract_typescript_signatures(str(path))
            if sigs:
                lines.append(f"\n{relative}")
                lines.extend(sigs[:20])  # cap at 20 symbols per file
    return '\n'.join(lines)

map_content = build_repo_map('./src')
print(f"Repo map: {len(map_content.split())} words")
print(map_content[:2000])

Tip: Build your repo map generation as a pre-commit hook or CI step that writes the map to a repo-map.txt or STRUCTURE.md file. This way every AI session can ingest an always-fresh map without recomputing it, and the file can be checked into version control for team-wide benefit.

File Summaries: Describing Files Without Showing Them

Where repo maps convey structure, file summaries convey intent. A file summary answers: "What does this file do and why does it exist?" in 2–5 sentences. Together with a repo map, summaries give an AI agent enough understanding to navigate a codebase intelligently.

Format of effective file summaries:

src/services/UserService.ts
Purpose: Core business logic for user lifecycle management (CRUD, authentication handoff).
Exports: UserService class (singleton via IoC), CreateUserDto, UpdateUserDto types.
Dependencies: Database (TypeORM entity manager), Redis (session cache), AuthService (password hashing delegation).
Key constraints: All mutations go through this service — do not write directly to the User entity elsewhere.

This summary costs ~80 tokens but communicates the file's role, boundaries, and key constraints — information that prevents the most common AI mistakes (violating architectural boundaries, duplicating logic that already exists).

Generating file summaries at scale:

You can generate file summaries using AI itself, then store them as a permanent artifact:

import anthropic
from pathlib import Path
import json

client = anthropic.Anthropic()

SUMMARY_PROMPT = """You are analyzing a source code file to produce a concise summary for an AI coding assistant.

File path: {filepath}
File content:
{content}

Produce a JSON summary with these fields:
- purpose: 1-2 sentences on what this file does
- exports: key classes, functions, types exported
- dependencies: key external/internal dependencies
- constraints: architectural rules or invariants that should not be violated
- typical_use: when an engineer would need to read/modify this file

Keep each field under 30 words. Return only valid JSON."""

def summarize_file(filepath: str) -> dict:
    content = Path(filepath).read_text()
    if len(content) > 8000:
        # For large files, summarize the first 4K and last 2K characters
        content = content[:4000] + "\n... [truncated] ...\n" + content[-2000:]

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": SUMMARY_PROMPT.format(filepath=filepath, content=content)
        }]
    )

    return json.loads(response.content[0].text)

def generate_summaries(src_root: str, output_file: str = "file-summaries.json"):
    summaries = {}
    for path in Path(src_root).rglob('*.ts'):
        if '.git' not in str(path) and 'node_modules' not in str(path):
            relative = str(path.relative_to(src_root))
            print(f"Summarizing: {relative}")
            summaries[relative] = summarize_file(str(path))

    with open(output_file, 'w') as f:
        json.dump(summaries, f, indent=2)

    print(f"Generated {len(summaries)} summaries → {output_file}")

generate_summaries('./src', './docs/file-summaries.json')

Using summaries in prompts:

def get_relevant_summaries(query: str, summaries: dict, top_k: int = 5) -> str:
    """
    Simple keyword-based relevance selection.
    In production, replace with embedding similarity (see topic-05).
    """
    query_terms = set(query.lower().split())

    scored = []
    for filepath, summary in summaries.items():
        summary_text = json.dumps(summary).lower()
        score = sum(1 for term in query_terms if term in summary_text)
        scored.append((score, filepath, summary))

    scored.sort(reverse=True)
    top = scored[:top_k]

    result = "Relevant file summaries:\n"
    for score, filepath, summary in top:
        result += f"\n{filepath}:\n"
        result += f"  Purpose: {summary['purpose']}\n"
        result += f"  Exports: {summary['exports']}\n"

    return result

Tip: Regenerate file summaries on a schedule (weekly) or as part of your CI pipeline when files change by more than 20%. Stale summaries are worse than no summaries because they actively mislead the model. A simple git diff check can identify which files have changed enough to warrant re-summarization.

Structural Context Layers: Combining Maps and Summaries

The most effective approach combines three layers of structural context, each at a different level of granularity:

Layer 1 — Architecture overview (50–150 tokens)

A 3–5 sentence description of the codebase's major architectural components and their relationships. This lives in CLAUDE.md or a designated context file (covered in Topic 4), but it is worth noting here as the structural context foundation.

Architecture: 3-tier NestJS API. Modules in src/modules/ each follow 
controller → service → repository pattern. Shared utilities in src/common/. 
All external API clients in src/integrations/. Database access only through 
TypeORM repositories — never raw SQL or direct entity manager calls outside repositories.

Layer 2 — Repo map (300–1500 tokens)

The file-and-symbol map described above. This gives the model navigational awareness — it knows what exists and where without reading every file.

Layer 3 — File summaries for relevant files (50–100 tokens per file)

For the 3–8 files most likely relevant to the current task, include their summaries. This is different from including their full content — summaries give purpose and constraint information, not implementation details.

Full file content is added only for the 1–3 files that the AI is actually being asked to modify or that contain directly referenced code.

A combined context builder:

def build_optimal_context(
    query: str,
    repo_root: str,
    architecture_doc: str,
    summaries: dict,
    max_tokens: int = 8000
) -> str:
    """
    Build a layered context that maximizes relevance within a token budget.
    """
    sections = []
    token_count = 0

    # Layer 1: Architecture overview (~100 tokens, always include)
    sections.append(f"## Architecture\n{architecture_doc}\n")
    token_count += 100

    # Layer 2: Repo map (up to 1500 tokens)
    repo_map = build_repo_map(repo_root)
    map_tokens = min(1500, max_tokens // 4)
    truncated_map = truncate_to_tokens(repo_map, map_tokens)
    sections.append(f"## Codebase Structure\n{truncated_map}\n")
    token_count += map_tokens

    # Layer 3: Relevant file summaries (up to 500 tokens)
    relevant_summaries = get_relevant_summaries(query, summaries, top_k=5)
    sections.append(f"## Relevant Files\n{relevant_summaries}\n")
    token_count += 500

    # Remaining budget: actual file content for the most targeted files
    remaining = max_tokens - token_count
    relevant_files = identify_target_files(query, summaries, top_k=2)
    for filepath in relevant_files:
        content = Path(repo_root, filepath).read_text()
        file_tokens = count_tokens(content)
        if file_tokens <= remaining:
            sections.append(f"## {filepath}\n```\n{content}\n```\n")
            remaining -= file_tokens

    return "\n".join(sections)

Tip: Think of your context as a pyramid: wide structural overview at the top, narrowing to specific file content at the bottom. The model uses the wide layers to orient itself and navigate, then uses the narrow layers to produce correct, idiomatic code. Always include all three layers, even if the specific file content layer is small.

Structural Context in Specific AI Tools

Each major AI coding tool has its own way to provide structural context:

Cursor:

Cursor's @codebase command performs semantic search across your codebase. To guide it effectively, create a cursor-summary.md in your repo root:


## Stack
NestJS 10, TypeScript 5.3, PostgreSQL 15 with TypeORM, Redis 7, Jest.

## Key Directories
- `src/modules/` — Feature modules (each: controller, service, repository, dto, entity)
- `src/common/` — Shared decorators, guards, interceptors, filters
- `src/integrations/` — Third-party API clients (Stripe, SendGrid, S3)
- `src/config/` — Environment configuration with validation schemas

## Critical Patterns
- Services are injected via NestJS DI — never instantiated directly
- Entities live in `*.entity.ts` files — never modify their shape without a migration
- DTOs use class-validator decorators — always validate at the controller boundary
- All async operations use async/await — no raw Promise chains

GitHub Copilot:

Copilot's context comes primarily from open files and the current file. Optimize by:
- Keeping only relevant files open in your editor tabs
- Writing descriptive // Purpose: comments at the top of complex files
- Using rich JSDoc/TSDoc comments on exported functions

/**
 * @file UserService — manages the full user lifecycle
 * @module users
 * @description All user mutations MUST go through this service.
 * Direct repository access outside this service is a bug.
 */
export class UserService {
  // ...
}

Claude Code:

Claude Code reads from CLAUDE.md for project context (covered in depth in Topic 4) and supports @path/to/file syntax for precise file inclusion. Combine with a minimal repo map in CLAUDE.md:

Tip: Test your structural context by asking your AI tool a navigation question: "Which file would I modify to change how users are authenticated?" A well-structured context should enable the model to answer this correctly without reading any file content. If it cannot, your structural context needs more detail.

Maintaining Structural Context Over Time

Repo maps and file summaries are only valuable if they stay current. A stale map pointing to deleted files or missing new services is worse than no map.

Automation strategies:

#!/bin/bash

echo "Updating repo map..."
aider --show-repo-map > docs/repo-map.txt 2>&1

echo "Checking for new/modified files since last summary update..."
git diff --name-only HEAD~1 | grep -E '\.(ts|py|java|go)$' | while read filepath; do
    echo "File changed: $filepath — consider regenerating its summary"
done

echo "Context files updated."

name: Update Codebase Context
on:
  push:
    branches: [main]
    paths:
      - 'src/**/*.ts'
      - 'src/**/*.py'

jobs:
  update-context:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Generate repo map
        run: |
          npm install -g aider-chat  # or use ctags
          aider --show-repo-map > docs/repo-map.txt
      - name: Commit updated context
        run: |
          git add docs/repo-map.txt
          git diff --staged --quiet || git commit -m "chore: update repo map"
          git push

Tip: Treat your repo map and file summaries as first-class documentation artifacts, not throwaway AI aids. Check them into version control, review changes to them in PRs, and include updating them in your "definition of done" for feature work. A team that maintains good structural context files will onboard new engineers (both human and AI) dramatically faster.