Hands-On: Collaborative System Design Session

This capstone takes you through a complete system design session using AI at every stage — from a blank-page requirements dump to a fully documented, stress-tested architecture ready for engineering review.

What You Will Build

You will design a URL shortener service at scale. The requirements are not trivial:

Store 1 billion shortened URLs
Handle 10,000 write requests per second (URL creation)
Handle 100,000 read requests per second (URL redirects)
99.99% availability for redirects (reads are more critical than writes)
P99 redirect latency under 50 milliseconds globally
URLs expire after a configurable duration (default 1 year, max 10 years)
Analytics: click count, country, referrer, device type per shortened URL
Custom aliases supported (user-defined short codes)
Abuse prevention: block malicious destination URLs

By the end of this capstone, you will have produced: a domain model, an entity relationship diagram, a component design, a data model, an API design, two ADRs for key decisions, a C4 Context diagram, and a system overview document. Every step includes a copy-paste ready prompt.

Learning tip: Do not skip ahead to later steps. Each step builds on the output of the previous one. Running steps out of order produces incoherent results because the AI loses the context accumulated in earlier steps.

Phase 1: Requirements Discovery

Before designing anything, you must surface the requirements that are not in the brief above. Use the following prompt to start the discovery session.

Step 1: Run the requirements interview

I am designing a URL shortener service with the following initial requirements:
- 1 billion stored URLs
- 10,000 URL creation requests per second
- 100,000 redirect requests per second
- 99.99% availability for redirects
- P99 redirect latency under 50ms globally
- Configurable URL expiration (default 1 year, max 10 years)
- Per-URL analytics: click count, referrer, country, device type
- Custom aliases (user-defined short codes)
- Abuse prevention for malicious destination URLs

Before we design anything, interview me to surface requirements I have not specified. Ask about:
- Multi-tenancy and authentication model
- URL ownership and deletion rules
- Analytics latency requirements (real-time vs. eventual)
- Rate limiting and quota model
- Compliance and data residency requirements
- The short code character set and length
- Redirect type (301 vs. 302) and caching implications
- Geographic distribution requirements

Ask the questions grouped by category. Wait for my answers before moving on.

Expected output: A series of clarifying questions that reveal unstated requirements. For this exercise, answer the questions with:

Single-tenant with API key auth
URL owners can delete their URLs; deleted URLs return 404
Analytics are eventual (up to 5 minutes delay acceptable)
1,000 URL creations per day per API key
No strict data residency requirements; global distribution for reads
Short codes: 7 alphanumeric characters (a-z, 0-9), case-insensitive
302 redirects (no browser caching to maintain accurate analytics)
Multi-region active-active for reads; single-region primary for writes

Learning tip: Notice how the 302 vs. 301 decision has a direct impact on the analytics architecture. 301 redirects are cached by browsers, meaning you lose visibility into repeat visits from the same user. Decisions in one requirement category have non-obvious dependencies in others.

Phase 2: Domain Modeling

Step 2: Identify domain entities

Based on the requirements we have just established for the URL shortener service, identify the domain entities. For each entity:
- Name and brief description
- Key attributes (name and type)
- Whether it is an Entity (has persistent identity) or a Value Object
- Its most important business rules (constraints that must always be true)

Also identify the key domain events (past-tense, things that happen in the system).

Do not generate a database schema yet. Work at the conceptual domain model level.

Expected domain entities include: ShortUrl (entity), CustomAlias (value object or entity depending on design), ApiKey (entity), ClickEvent (entity or value object), Organization (entity if multi-tenant, skipped for single-tenant), AbuseReport (entity).

Key domain events include: UrlCreated, UrlDeleted, UrlExpired, RedirectServed, AbuseReported.

Step 3: Generate a Mermaid ERD

Generate a Mermaid entity relationship diagram for the URL shortener domain model. Include:
- All entities identified above
- Key attributes and data types
- All relationships with cardinality notation
- Foreign key references

Use snake_case for table and column names. Include a short_code as the primary business key on the ShortUrl table. Use UUIDs for all surrogate primary keys. Omit created_at/updated_at for clarity — we will add those later.

Expected output (example):

erDiagram
    API_KEY {
        uuid id PK
        string key_hash
        string name
        int daily_write_quota
        boolean is_active
    }
    SHORT_URL {
        uuid id PK
        string short_code UK
        string destination_url
        uuid api_key_id FK
        timestamp expires_at
        boolean is_deleted
        boolean is_blocked
    }
    CLICK_EVENT {
        uuid id PK
        uuid short_url_id FK
        string country_code
        string referrer
        string device_type
        timestamp clicked_at
    }
    API_KEY ||--o{ SHORT_URL : "owns"
    SHORT_URL ||--o{ CLICK_EVENT : "receives"

Learning tip: Notice that CLICK_EVENT is append-only and will grow at 100,000 rows per second at peak. This table cannot live in the same OLTP database as SHORT_URL. Spotting this kind of data volume mismatch in the domain model phase prevents a painful migration later.

Phase 3: Component Design

Step 4: Generate the system component design

Now design the system components for the URL shortener. The scale requirements are:
- 10K writes/sec, 100K reads/sec
- 1B total URLs stored
- Sub-50ms P99 redirect latency globally
- Eventual analytics (5-minute delay acceptable)
- 99.99% redirect availability
- Single-region write primary, multi-region read replicas

Design the component architecture. For each component:
- Name and responsibility (one sentence)
- The technology you recommend and why
- Estimated load it will handle
- Key failure modes

Address these specific design questions:
1. How will you generate unique 7-character short codes at 10K/sec without collisions?
2. How will you serve redirects at sub-50ms globally with a single-region database?
3. How will you handle the analytics write volume without impacting redirect latency?
4. How will you implement abuse prevention without adding latency to the redirect path?

Describe the architecture as a numbered component list, then describe the data flow for redirect and creation operations separately.

Expected key components:
- API Gateway / Load Balancer
- URL Creation Service
- Redirect Service (read-optimized, with aggressive caching)
- Short Code Generator (pre-generation pool or hash-based)
- Multi-region CDN / edge cache layer
- Primary database (PostgreSQL or similar) for URL metadata
- Distributed cache (Redis) for hot URL lookups
- Analytics ingestion pipeline (Kafka or similar async queue)
- Analytics processing service (consuming from queue, writing to analytics store)
- Analytics read store (ClickHouse or similar columnar store for aggregation queries)
- Abuse check service (async, does not block redirect path)

Step 5: Stress-test the component design

Stress-test the component design with these scenarios. For each, describe the failure mode and the mitigation already present in the design, or identify a gap:

1. The Redis cache cluster loses a node. Impact on redirect latency? How quickly does the cache warm?
2. The analytics Kafka cluster falls behind by 30 minutes. Impact on users?
3. A single short code receives 1 million redirect requests in 10 minutes (a viral link). What breaks?
4. A customer deletes a URL. How quickly is the cached entry invalidated globally?
5. The short code generation service is under-provisioned at 15K writes/sec.
6. What is the blast radius if the primary PostgreSQL database becomes unavailable?

Learning tip: Step 5 will almost always find at least one gap that Step 4 missed. In this case, the cache invalidation for deleted URLs and the viral link thundering herd are the two issues most commonly overlooked in URL shortener designs.

Phase 4: Data Model

Step 6: Design the physical data model

Design the physical data model for the URL shortener. We are using:
- PostgreSQL 15 for URL metadata (SHORT_URL and API_KEY tables)
- Redis for the redirect cache
- ClickHouse for analytics (CLICK_EVENT)

For PostgreSQL:
- Full CREATE TABLE statements with all columns including created_at, updated_at
- All indexes needed for the following queries:
  a. Look up a SHORT_URL by short_code (redirect path — must be extremely fast)
  b. List all SHORT_URLs owned by an API_KEY with pagination
  c. Check if a short_code is available for custom alias registration
  d. Find all expired SHORT_URLs for cleanup jobs
- Explain your index choices

For Redis:
- The cache key structure for URL lookups
- The value structure stored in cache
- The TTL strategy

For ClickHouse:
- The CLICK_EVENT table definition with appropriate engine (MergeTree family)
- The primary key / sorting key choice and why
- The partitioning strategy

Expected output highlights: A short_code index that is the most critical index in the system. Redis key structure url:{short_code} storing a JSON blob or hash of destination_url and metadata. ClickHouse with ReplicatedMergeTree partitioned by toYYYYMM(clicked_at) and sorted by (short_url_id, clicked_at) for efficient per-URL analytics queries.

Phase 5: API Design

Step 7: Generate the API specification

Design the REST API for the URL shortener service. Cover these endpoints:
1. Create a short URL
2. Get short URL details (metadata, not redirect)
3. Delete a short URL
4. List short URLs for the authenticated API key
5. Get analytics for a short URL (click count, top countries, top referrers for a date range)
6. The redirect endpoint itself

For each endpoint provide:
- HTTP method and path
- Request body or query parameters (with types and validation rules)
- Success response (status code and body)
- Error responses (status codes and when they occur)
- Rate limiting behavior

Use REST conventions. The redirect endpoint should be designed for maximum CDN cacheability — or explain why it cannot be cached.

Expected key design decisions:
- POST /urls for creation returning 201 with the full short URL object
- GET /{short_code} for redirect — this is the hot path, separate domain preferred (e.g., sho.rt/{code})
- Redirect returns 302 with Cache-Control: no-store because analytics require every hit
- Analytics endpoint is on the API subdomain, not the redirect subdomain, to separate concerns

Learning tip: Separating the redirect domain (sho.rt) from the API domain (api.sho.rt) is a critical performance decision. The redirect domain can be optimized entirely for CDN and edge caching, while the API domain handles authenticated management traffic with different latency and throughput requirements.

Phase 6: Architecture Decision Records

Step 8: Generate ADR for the short code generation strategy

Generate an Architecture Decision Record for the short code generation strategy we chose. The decision is:

We will use a pre-generation pool approach: a background service pre-generates batches of unique 7-character short codes and stores them in a "code pool" table in PostgreSQL. The URL creation service atomically claims a code from this pool. Custom aliases bypass the pool and are inserted directly with a uniqueness constraint check.

Context:
- 10,000 URL creation requests per second
- 7-character alphanumeric codes (a-z, 0-9), 36^7 = ~78 billion possible codes
- We need uniqueness guarantees without race conditions
- Alternatives considered: hash-based (MD5 of destination URL, truncated), random generation with retry, counter-based with Base62 encoding, distributed ID generator (Snowflake)

Format as a complete ADR with: Title, Status (Accepted), Context, Decision, Options Considered (with pros/cons for each), Consequences (positive and negative), and When to Revisit.

Step 9: Generate ADR for the analytics architecture

Generate an Architecture Decision Record for the analytics architecture decision.

Decision: Analytics click events will be written asynchronously to Kafka by the Redirect Service, consumed by an Analytics Processing Service, and stored in ClickHouse for querying. Analytics are eventually consistent with up to 5 minutes of delay.

Context:
- 100,000 redirects per second = 100,000 click events per second at peak
- Analytics must not add latency to the redirect path (sub-50ms P99 requirement)
- Analytics queries: click counts per URL per day, top 10 countries, top 10 referrers
- Eventual consistency up to 5 minutes is acceptable per business requirements
- Team has experience with PostgreSQL and Redis; no current Kafka or ClickHouse expertise
- Alternatives: write directly to PostgreSQL, write to PostgreSQL async via outbox pattern, write to BigQuery, write to a time-series database

Format as a complete ADR. Include a specific section on the operational complexity cost and what investments are needed to operate this stack safely.

Learning tip: The operational complexity note in this ADR is important. Kafka and ClickHouse are powerful but require dedicated operational expertise. An honest ADR acknowledges this cost and makes it visible to decision-makers who may not realize it. "We accept X trade-off" is a much safer decision than "we didn't consider X."

Phase 7: C4 Context Diagram

Step 10: Generate the C4 Context diagram

Generate a Mermaid C4 Context diagram for the URL shortener system.

System name: URL Shortener Platform
Description: Allows API clients to create shortened URLs and serves high-speed redirects to end users globally

Users:
- API Client (developer or application): creates and manages shortened URLs via REST API
- End User: clicks a shortened link and is redirected to the destination

External systems:
- Abuse Check Database (third-party): used to validate destination URLs against known malicious URL lists (e.g., Google Safe Browsing API)
- Analytics Dashboard (internal): consumes analytics data via read API to render customer-facing reports

Internal system boundaries to show:
- URL Management API (handles creation, deletion, listing)
- Redirect Service (serves the actual redirects at high speed)
- Analytics Service (stores and serves click analytics)

Use proper Mermaid C4Context syntax. Include Boundary() wrappers and label all relationships with the communication type.

Expected Mermaid output:

C4Context
  title System Context — URL Shortener Platform

  Person(apiClient, "API Client", "Developer or application creating and managing shortened URLs")
  Person(endUser, "End User", "Person clicking a shortened link")

  Enterprise_Boundary(platform, "URL Shortener Platform") {
    System(urlApi, "URL Management API", "Create, list, delete short URLs. REST over HTTPS.")
    System(redirectService, "Redirect Service", "Resolves short codes and issues 302 redirects. Globally distributed.")
    System(analyticsService, "Analytics Service", "Stores click events and serves aggregated analytics.")
  }

  System_Ext(safeBrowsing, "Google Safe Browsing API", "Validates destination URLs against known malicious URL lists")
  System_Ext(analyticsDashboard, "Analytics Dashboard", "Internal BI tool consuming analytics for customer reports")

  Rel(apiClient, urlApi, "Creates/manages URLs", "REST/HTTPS")
  Rel(endUser, redirectService, "Clicks short link", "HTTP/HTTPS")
  Rel(redirectService, urlApi, "Looks up short code", "Internal gRPC")
  Rel(urlApi, safeBrowsing, "Validates destination URL", "HTTPS")
  Rel(redirectService, analyticsService, "Publishes click event", "Async/Kafka")
  Rel(analyticsDashboard, analyticsService, "Reads analytics data", "REST/HTTPS")

Phase 8: System Overview Document

Step 11: Generate the system overview

Generate a system overview document for the URL shortener platform based on everything we have designed in this session. Structure it as follows:

1. **Purpose and Scope** (2-3 sentences)
2. **Key Technical Decisions** (bulleted list of 4-5 decisions with one-sentence rationale, referencing ADRs where applicable)
3. **Component Inventory** (table: Component | Technology | Responsibility | Scale Target)
4. **Critical Data Flows** — describe in numbered steps:
   a. URL creation flow
   b. URL redirect flow
   c. Analytics ingestion flow
5. **Operational Characteristics** (availability target, expected traffic, deployment model)
6. **Known Limitations and Technical Debt** (honest list of 3-4 items)

Write in plain, direct technical prose. Avoid bullet points except where specified. Target audience: a new senior backend engineer joining the team.

Expected key items in Known Limitations: The operational overhead of Kafka and ClickHouse being new to the team. The cold start problem for the redirect cache after a deployment or cache flush. The eventual consistency window in analytics. The current lack of URL ownership transfer or team-based access control.

Learning tip: The "Known Limitations" section is the most important section of a system overview document. It tells a new engineer what they should not try to fix without a full design review, and it signals intellectual honesty about the current state of the system.

Reviewing the Full Design Output

After completing all eleven steps, you should have:

A list of clarified requirements (from Phase 1)
A domain entity model and Mermaid ERD (Phase 2)
A component design with stress test results (Phase 3)
PostgreSQL, Redis, and ClickHouse physical data models (Phase 4)
A REST API specification (Phase 5)
Two Architecture Decision Records (Phase 6)
A C4 Context diagram in Mermaid (Phase 7)
A system overview document (Phase 8)

This is a complete design package. In a professional context, this output — produced in a few hours with AI assistance — would normally take one to two weeks of collaborative design work. The quality of the output depends on the quality of your engagement: answering the AI's questions honestly, validating each output before using it as input to the next step, and applying your own judgment to override AI suggestions that do not fit your context.

Learning tip: After the design session, spend 15 minutes writing a "decisions I would make differently" note. Note which AI suggestions you overrode and why, and which AI-found issues you had not considered. This reflection compounds your learning much faster than simply completing the exercise.

Key Takeaways

A full system design session with AI — from blank page to complete documentation — can be completed in hours rather than days when you structure the conversation in phases.
The quality of each design phase depends on the completeness of the previous phase; do not skip requirements discovery because it will cause rework in every downstream step.
Analytics and OLTP data have fundamentally different access patterns and should be stored in purpose-built databases from the start; retrofitting this separation is expensive.
ADRs for the two or three most consequential decisions are sufficient for most systems; not every decision needs documentation, only the ones where the rationale will not be obvious six months later.
The system overview document's "Known Limitations" section is the honest representation of the current design state and the most valuable section for new team members.