+1

πŸ›οΈ The System Design Playbook - Part 1 πŸ“–

A deeply-synthesized, opinionated reference distilled from five canonical sources: donnemartin/system-design-primer Β· ByteByteGoHq/system-design-101 Β· karanpratapsingh/system-design Β· ashishps1/awesome-system-design-resources Β· binhnguyennus/awesome-scalability

Use it as: a study guide for interviews, a checklist for design reviews, and a vocabulary for cross-team discussions.


Table of Contents

  1. πŸ“– How to Use This Playbook
  2. 🧠 The System Design Mindset
  3. πŸ”‘ Core Mental Models
  4. 🎯 The Interview Framework (RAPID-S)
  5. πŸ”’ Back-of-Envelope Math
  6. 🌐 Networking Fundamentals
  7. 🌍 DNS, CDN, and Proxies
  8. βš–οΈ Load Balancing & API Gateways
  9. πŸ—„οΈ Databases: Pick Your Engine
  10. πŸ”€ Replication, Sharding, Federation
  11. πŸ”’ Consistency, Transactions & Isolation
  12. ⚑ Caching
  13. πŸ“¨ Asynchronous Communication
  14. πŸ”Œ API Design
  15. πŸ—οΈ Architectural Patterns
  16. πŸ•ΈοΈ Distributed Systems Primitives
  17. πŸ›‘οΈ Reliability & Resilience Patterns
  18. πŸ“Š Observability, SLA/SLO/SLI
  19. πŸ” Security
  20. πŸ“ˆ Capacity Planning & Scaling Playbook
  21. 🏭 Data Engineering & Analytics
  22. πŸš€ Deployment, Release & Schema Evolution
  23. πŸ“‹ Tradeoffs Cheat Sheet
  24. πŸ’‘ Interview Problem Templates
  25. 🌟 Real-World Case Studies
  26. ⚠️ Anti-Patterns to Avoid
  27. πŸ“š Must-Read Papers & Further Reading

1. πŸ“– How to Use This Playbook

There are three audiences:

  • Interview candidate. Read sections 2–5 cold, drill section 22, then revisit section 21 the night before.
  • Engineer in a design review. Open the relevant chapter (cache, queue, db) plus section 21 and challenge each tradeoff explicitly.
  • Tech lead writing an RFC. Use section 4 as the document spine; sections 17, 18, 24 for the "Risks" section.

Reading rule: Every concept here has a counter-concept. If a passage feels like an absolute, you have not read carefully enough β€” find the tradeoff sentence.


2. 🧠 The System Design Mindset

System design is the art of making a small set of large, hard-to-reverse decisions explicit. It is rarely about choosing the "best" component; it is about choosing the component whose failure modes you can tolerate.

A good design:

  • Scales with growth without full rewrites at each 10x.
  • Fails gracefully rather than catastrophically β€” partial loss is preferable to total loss.
  • Lets independent teams move in parallel without cross-team handoffs blocking releases.
  • Makes tradeoffs explicit β€” every choice should have a paragraph saying what we gave up.

Three habits that separate senior from staff designers:

  1. Quantify before you draw. No box on the diagram should exist without an estimated QPS, latency budget, or storage size attached.
  2. Name the failure modes. For every component, ask: "what happens when this is slow / down / wrong?" If you cannot answer, you have not designed it.
  3. Defer the exotic. Reach for the boring tool (Postgres, Redis, Nginx, Kafka) until measurements force the exotic one. Instagram's three rules: use proven tech, don't reinvent, keep it simple.

3. πŸ”‘ Core Mental Models

3.1 The Six Axes Every Design Lives On

Axis Left extreme Right extreme Drives choice of
Consistency vs Availability Strong consistency (CP) High availability (AP) Database, replication strategy
Latency vs Throughput Optimize p99 of one request Maximize req/sec aggregate Sync vs batched, queueing
Read-heavy vs Write-heavy Cache + replicas Shard + partition + queue Storage + access pattern
Monolith vs Microservices Single deployable Many fine-grained services Org structure + deployment cadence
Sync vs Async In-line response Decoupled, eventual Coupling + tolerance to lag
Stateless vs Stateful Scales linearly Sharding complexity required Where you put the hard problem

3.2 CAP and PACELC

CAP (Brewer): in a network partition, a distributed system can only guarantee two of three: Consistency, Availability, Partition tolerance. Since partitions are inevitable in distributed systems, the practical choice is CP or AP.

  • CP (consistency + partition tolerance): HBase, MongoDB (default), Spanner, Zookeeper. Reject requests during partitions to preserve correctness.
  • AP (availability + partition tolerance): Cassandra, DynamoDB (default), CouchDB. Accept stale reads during partitions; reconcile later.
  • CA without P: only single-node systems. Postgres, MySQL on one box. Not a real distributed-system choice.

PACELC extends CAP with normal-operation behavior: "if Partitioned, choose A or C; Else, choose Latency or Consistency." Examples: Spanner is PC/EC (consistent always, pays latency); Cassandra is PA/EL (favors availability + low latency).

Practical rule: Most "we need strong consistency" claims are really "we need linearizability for one specific operation." Design that one operation around a sequencer (single shard, leader, lock, distributed transaction) and let the rest be eventually consistent.

3.3 ACID vs BASE

ACID BASE
Atomicity / Basic Availability Transaction is all-or-nothing System keeps responding even if degraded
Consistency / Soft state Constraints hold post-tx State may change without input
Isolation / Eventual consistency Concurrent tx behave as serial Nodes converge over time
Durability Committed writes persist (implicit)
Use when Money, inventory, identity Feeds, search, analytics, leaderboards

3.4 Performance vs Scalability β€” Distinct Problems

  • Performance problem: the system is slow for one user.
  • Scalability problem: the system is fine for one user but degrades as you add load.

You can have a fast non-scalable system (single beefy box) or a scalable slow system (loosely-coupled microservices with bad cache hit rate). You usually want both, but you fix them with different techniques.

3.5 Latency vs Throughput vs Bandwidth

  • Latency: time to do one thing (ms).
  • Throughput: things per unit time (QPS, MB/s).
  • Bandwidth: maximum throughput a channel could carry.

Little's Law: concurrency = throughput Γ— latency. If a service handles 1000 req/s with 100 ms latency, it has 100 in-flight requests on average. This is the back-of-envelope formula for thread/connection pool sizing.


4. 🎯 The Interview Framework (RAPID-S)

A 6-step structure that fits a 45-minute design interview, adapted from system-design-primer and reinforced by ByteByteGo.

Step Time Output
Requirements 5 min Functional + non-functional list, scale numbers
API 5 min Endpoints, request/response shapes
Plumbing (HLD) 10 min Boxes-and-arrows diagram
Internals (LLD) 15 min Schema, indexes, partition keys, algorithms
Deep dives 5 min One or two areas the interviewer steers you to
Scale + reliability 5 min Bottlenecks, failure modes, observability

4.1 Step 1 β€” Requirements

Ask before assuming. Functional ("what does it do?") and non-functional ("how well?"):

  • DAU / MAU, peak QPS (often 5x average), read/write ratio.
  • p50 and p99 latency budgets.
  • Durability β€” how much data loss is acceptable (RPO)?
  • Availability target β€” three nines? four?
  • Geographic distribution β€” single region vs global?
  • Consistency requirement β€” strong on which entities?

State assumptions explicitly: "I'll assume 100M DAU, 10:1 read:write, p99 < 200 ms, eventual consistency on feed but strong on payments."

4.2 Step 2 β€” APIs first

Defining the public contract first forces clarity. For each endpoint specify method, path, params, response, idempotency. This anchors the rest of the design.

4.3 Step 3 β€” High-Level Design

Draw 5-7 boxes. Typical: client β†’ CDN β†’ LB β†’ API gateway β†’ service(s) β†’ cache β†’ primary DB + replicas + queue + worker. Justify each box; remove any you cannot justify.

4.4 Step 4 β€” Low-Level Design

This is where you earn the title. Per service: data model with PK/SK, indexes, partition key, hot-key strategy, cache key, TTL. Per algorithm: name it (consistent hash, geohash, bloom filter, top-k via count-min sketch).

4.5 Step 5 β€” Deep Dives

Expect interviewer to pick the weakest area. Common targets: hot partition handling, idempotency for retries, exactly-once semantics, schema migration without downtime.

4.6 Step 6 β€” Bottlenecks & Reliability

Walk every box and ask: what fails when this is slow / dies / lies? Add timeouts, retries with jitter, circuit breakers, rate limits, fallbacks, dead-letter queues. State your monitoring (RED + USE), alerts, and runbook headings.


5. πŸ”’ Back-of-Envelope Math

In a 45-minute design interview, you have ~5 minutes to size the system. The goal is not precision β€” it's getting within an order of magnitude in seconds, then defending the assumption. The numbers below are the toolbox; this chapter shows how to wield them.

The same math runs the design review: when someone proposes a new dependency, a new cache layer, or a 10Γ— scale-up, an engineer who can compute the consequence on a napkin out-arguments three engineers who can't.

5.1 Powers of Two (memorize)

Computers count in powers of 2; capacity, addressing, and memory come in 2ⁿ. The convenient coincidence: each power of 2¹⁰ β‰ˆ 10Β³, so binary and decimal numbers line up cleanly and you can convert in your head.

Power Approx Name Where you see it
2^10 10^3 thousand (KB) Packet, small file
2^20 10^6 million (MB) Image, document
2^30 10^9 billion (GB) Per-host RAM, HD video
2^40 10^12 trillion (TB) Database, single dataset
2^50 10^15 quadrillion (PB) Datacenter-scale storage
2^60 10^18 exabyte (EB) Hyperscaler totals

Bit-budget shortcuts that come up constantly:

  • A signed 32-bit int holds ~2.1 Γ— 10⁹. User IDs, tweet IDs, and bigint counters all hit this ceiling β€” that's why you'll find production migrations from int β†’ bigint in every old codebase.
  • A signed 64-bit int holds ~9.2 Γ— 10¹⁸ β€” effectively infinite for any counter you'll ever build.
  • A 64-bit nanosecond timestamp covers ~292 years from 1970.
  • UUIDv4 = 128 bits = 16 bytes binary, ~36 chars hex, ~22 chars base64.

Typical record sizes (memorize the order of magnitude):

Item Size
Boolean, int8, char 1 B
int32, float32, IPv4 4 B
int64, float64, timestamp 8 B
UUID (binary) 16 B
SHA-256 hash 32 B
Tweet text ~140 B
URL ~100 B
JSON user record 0.5–2 KB
Web image (compressed) 50–500 KB
Phone photo (full) 1–5 MB
HD video (per minute) ~30 MB
4K video (per minute) ~200 MB

These prevent the most common interview mistake: estimating storage off by 1000Γ— because you mixed up KB and MB.

5.2 Latency Numbers Every Programmer Should Know

Originally compiled by Jeff Dean and updated by Peter Norvig. The values below are the modern, rounded version. Memorize them β€” every capacity argument descends from this table.

Operation Time Mental model
L1 cache reference 0.5 ns "free"
Branch mispredict 5 ns Flush the pipeline
L2 cache reference 7 ns 14Γ— L1
Mutex lock/unlock 25 ns Uncontended; contention is much worse
Main memory reference 100 ns 200Γ— L1
Compress 1 KB with Zippy / Snappy 10 Β΅s
Send 1 KB over 1 Gbps 10 Β΅s Network bandwidth, not latency
Read 4 KB random from SSD 150 Β΅s NVMe is faster (10–50 Β΅s)
Read 1 MB sequential from memory 250 Β΅s
Round-trip within same datacenter 500 Β΅s (0.5 ms) One AZ-to-AZ hop
Read 1 MB sequential from SSD 1 ms
Disk seek 10 ms Why databases hate random I/O
Read 1 MB sequential from disk 20 ms 80Γ— SSD
Cross-region (intra-continent) 10–60 ms
Cross-continent round-trip ~150 ms Speed of light through fiber

Time-scaled to human terms (intuition pump). If 1 ns = 1 second:

Operation Human-scale
L1 hit 0.5 s (a heartbeat)
Memory access ~2 minutes
SSD random read ~1.5 days
Same-DC round trip ~6 days
1 MB from disk ~8 months
Cross-continent round trip ~5 years

This is why crossing layers β€” process β†’ host β†’ datacenter β†’ region β€” is the dominant design concern. Each boundary is 10–100Γ— slower than the one before.

Operational implications:

  • Never block a user request on a cross-region call unless you absolutely must. 150 ms is a non-negotiable speed-of-light tax that blows most p99 budgets.
  • Disk seeks are the enemy. Sequential I/O is ~100Γ— faster than random. This is the reason LSM-trees, log-structured storage, and append-only logs win for write-heavy workloads.
  • A network call costs roughly the same as 1 MB of memory work. A chatty service that issues 50 RPCs per page-render burns 50 Γ— 0.5 ms = 25 ms in network alone, before any actual work.
  • Memory bandwidth dominates within a process. Allocating millions of small objects is often slower than fewer big ones, because cache misses, not CPU work, are the bottleneck.
  • Compression is essentially free at 10 Β΅s per KB compared to network I/O β€” always compress payloads crossing the network.

Typical p99 latency budget for a 200 ms web request:

Component Budget
TLS handshake + LB + ingress 5–10 ms
App server processing 20–30 ms
1–3 cache lookups 1–5 ms
1–2 database queries 20–50 ms
1–2 downstream RPCs 10–30 ms each
Response serialization + egress 5 ms
Headroom for tail / GC / retries the rest

If any single component eats > 50 ms, scrutinize it. The discipline of budgeting latency before building catches more performance bugs than any profiler.

5.3 Time, Throughput, and Storage Quick Reference

Time conversions to memorize:

  • 1 day = 86,400 s β‰ˆ 10⁡ s
  • 1 month β‰ˆ 2.6 Γ— 10⁢ s
  • 1 year β‰ˆ 3.15 Γ— 10⁷ s β‰ˆ 32 M s

Throughput conversions:

  • QPS = daily_requests Γ· 86,400. 1 M requests/day β‰ˆ 12 QPS average.
  • Peak QPS β‰ˆ 2–10Γ— average, depending on workload. Consumer apps spike hard at evenings and weekends; B2B SaaS spikes at business hours; ad systems are flatter. Default to 5Γ— when you don't know.
  • Bandwidth = QPS Γ— payload_size. 1,000 QPS Γ— 100 KB = 100 MB/s = 800 Mbps.
  • Daily ingest = QPS Γ— payload Γ— 86,400.

Storage growth:

  • Annual storage = avg_QPS Γ— bytes_per_record Γ— 86,400 Γ— 365 Γ— replication_factor
  • 5-year retention with 3Γ— replication = 15Γ— the year-1 raw number.
  • Rule of thumb: a 1 KB record at 1,000 QPS sustained for a year Γ— 3 replicas β‰ˆ 100 TB.

Worked example β€” Twitter sizing.

  • 500 M DAU, each posts 0.2 tweets/day and reads 100 tweets/day.
  • Writes: 500 M Γ— 0.2 = 100 M tweets/day β†’ ~1,200 write QPS avg, ~6,000 peak.
  • Reads: 500 M Γ— 100 = 50 B reads/day β†’ ~580 K read QPS avg, ~3 M peak. Read:write = 500:1 β€” read-dominated, cache aggressively.
  • Per tweet: ~1 KB with metadata. Daily ingest = 100 GB. 5 years Γ— 3 replicas β‰ˆ 550 TB. Storage fits on one cluster, so storage isn't the dominant constraint β€” read QPS and fan-out are.

This is the right shape of an interview answer: numbers anchored, ratio called out, and the constraint named.

Read-to-write ratios (rough priors for common system types):

System Read : Write
Social feed (Twitter, Instagram, TikTok) 100:1 to 1000:1
Document collab (Notion, Google Docs) 5:1 to 20:1
E-commerce browse vs purchase ~100:1
Banking / ledger ~1:1
Logging / metrics / event ingest 1:100 (write-heavy)
Search (queries vs reindex) ~100:1

Read:write ratio is the most important early signal for the design. Read-heavy β†’ cache + replicas + denormalize. Write-heavy β†’ partition + queue + LSM-tree.

5.4 Availability in Numbers

Availability Annual downtime Monthly Daily
99% (2-9s) 3.65 days 7.2 h 14.4 min
99.9% (3-9s) 8.77 h 43.8 min 1.44 min
99.95% 4.38 h 21.9 min 43.2 s
99.99% (4-9s) 52.6 min 4.32 min 8.6 s
99.999% (5-9s) 5.26 min 25.9 s 0.86 s
99.9999% (6-9s) 31.5 s 2.6 s 0.09 s

Each additional 9 costs roughly 10Γ— more in engineering hours, infrastructure, and operational complexity. Industry reality:

  • Most consumer products live at 99.9–99.95%.
  • Tier-1 SaaS commits to 99.95–99.99%.
  • Payment networks aim for 99.99%.
  • Telephone networks were the canonical 99.999% (~5 min/year).
  • 6-9s is mythological for any single system; you only get there by composing redundant systems and counting carefully.

Series vs parallel β€” the math that drives architecture.

When components are in series (every one must be up), availabilities multiply and total goes down:

A_total = A1 Γ— A2 Γ— A3 Γ— …

A typical request path: LB (99.99%) β†’ App (99.95%) β†’ Cache (99.99%) β†’ DB (99.95%) β†’ External API (99.9%). Total: 0.9999 Γ— 0.9995 Γ— 0.9999 Γ— 0.9995 Γ— 0.999 = **99.78%** β€” worse than the worst single component.

Lesson 1. Adding a dependency always lowers your availability. Each external service is an availability tax. This is one of the strongest arguments against gratuitous microservice splits β€” every hop is a 9 you didn't earn.

When components are in parallel (any one up keeps the system up), failure probabilities multiply and total goes up:

A_total = 1 βˆ’ (1βˆ’A1) Γ— (1βˆ’A2) Γ— (1βˆ’A3) Γ— …

Two 99% replicas: 1 βˆ’ 0.01Β² = 99.99%. Three: 1 βˆ’ 0.01Β³ = 99.9999%. Redundancy compounds exponentially β€” but only if failures are independent.

Lesson 2. A redundant cluster is only as good as the correlation of its failures. Two replicas in the same rack share PDU and switch failures; two regions share a deploy pipeline; all replicas share a software bug. Audit shared dependencies, not just replica counts. The truly correlated failures (a bad deploy, a poisoned cache key) are what take down "highly available" systems.

Composite reasoning β€” what you actually compute in a design review:

A_system = A_series_path Γ— A_redundant_groups

A 3-replica DB cluster (effective 99.9999%) behind an LB (99.99%) behind an app tier (99.95%): 0.99999 Γ— 0.9999 Γ— 0.9995 β‰ˆ **99.94%** β€” roughly 5 hours downtime/year. To improve this, you fix the weakest link (the 99.95% app tier here), not by piling on more DB replicas β€” those bought you a 9 that another tier is already throwing away.

Error budget. If your SLO is 99.9%, you have 0.1% Γ— 30 days β‰ˆ 43 min/month of allowed downtime. That budget is spent on: deploys, experiments, planned maintenance, and unplanned outages. Burn it intentionally on shipping; preserve it during incidents. (See Β§18.3 for the operational practice.)


6. 🌐 Networking Fundamentals

6.1 OSI Model (the practical version)

Layer Name Examples When you care
7 Application HTTP, gRPC, DNS, SMTP Always
6 Presentation TLS, compression Auth + perf
5 Session RPC sessions Rarely
4 Transport TCP, UDP, QUIC LB algorithms, sockets
3 Network IP, ICMP Routing, VPC, subnets
2 Data link Ethernet, MAC DC engineers
1 Physical Cables, wifi Hardware

Practical takeaway: L4 vs L7 load balancing, TLS at L6, CDN at L7. Most senior engineers live in L7, occasionally drop to L4 for performance, and only touch L3 for VPC/peering.

6.2 TCP vs UDP vs QUIC

TCP UDP QUIC (HTTP/3)
Connection Handshake (3-way) None TLS+handshake combined (1 RTT, 0-RTT resumption)
Reliability Guaranteed in-order None Guaranteed
Congestion control Yes No Yes (better than TCP)
Head-of-line blocking Yes N/A No (per-stream)
Use for HTTP/1.1, HTTP/2, DBs, SSH DNS, video, VoIP, gaming HTTP/3, gRPC over QUIC

Connection pooling: TCP handshake costs an RTT. Reusing connections (keep-alive, gRPC channels, DB connection pools) is the #1 micro-optimization for backend services.

6.3 IP Basics

  • IPv4: 32-bit, ~4.3 B addresses (exhausted; NAT + CIDR keep it alive).
  • IPv6: 128-bit, effectively unlimited.
  • Static vs dynamic: services use static; clients use DHCP-assigned dynamic.
  • Public vs private: RFC1918 ranges (10.0.0.0/8, 172.16/12, 192.168/16) are private; NAT gateways translate to public.

7. 🌍 DNS, CDN, and Proxies

7.1 DNS

DNS resolves a domain name to an IP via a hierarchical lookup: stub resolver β†’ recursive resolver β†’ root β†’ TLD β†’ authoritative. Caching at every layer (browser, OS, resolver) is critical to performance.

Record types you must know:

  • A β€” domain β†’ IPv4
  • AAAA β€” domain β†’ IPv6
  • CNAME β€” alias to another name
  • MX β€” mail exchange
  • NS β€” authoritative nameservers
  • TXT β€” arbitrary text (SPF, DKIM, domain verification)
  • PTR β€” reverse lookup

TTL: the cache duration. Low TTL (60s) enables fast failover but increases lookup load. High TTL (24h) is efficient but slow to propagate changes. Production rule: low TTL on records you will fail over (api.example.com), high TTL on stable records (www.example.com).

Routing strategies via DNS:

  • Weighted round-robin (canary deploys).
  • Latency-based (Route 53).
  • Geolocation (compliance-driven).
  • Failover (active-passive).

7.2 CDN

A CDN caches static (and increasingly dynamic) content at geographically distributed PoPs. Reduces latency for the user and load on the origin.

Push CDN Pull CDN
Trigger You upload on change CDN fetches on first miss
Storage All content always present Hot content cached
Best for Low-traffic, infrequent updates High-traffic, frequent changes
Stale risk Until next push Until TTL expires

Cache key tips: include version in path or query (/v3/style.css, ?v=hash). Prefer immutable URLs + long TTLs over short TTLs + invalidation. Use stale-while-revalidate for the best of both worlds.

Edge compute (Cloudflare Workers, Lambda@Edge): A/B routing, request rewriting, light auth β€” anything that benefits from running close to the user.

7.3 Forward vs Reverse Proxy

  • Forward proxy sits in front of clients. Used for anonymity, content filtering, corporate egress, geo-bypass (VPN).
  • Reverse proxy sits in front of servers. Provides TLS termination, caching, compression, rate limiting, request rewriting, blue-green routing. Examples: Nginx, Envoy, HAProxy, Traefik.

A reverse proxy is often also a load balancer; the terms overlap when you have multiple backends. The distinction: load balancer's primary job is distribution; reverse proxy's primary job is interface unification + edge concerns.


8. βš–οΈ Load Balancing & API Gateways

8.1 Load Balancer Layers

L4 (transport): routes by IP + port. Cheap, fast, content-blind. Connection-level stickiness only. Use for: TCP services, gRPC (with care), MySQL/Redis frontends.

L7 (application): routes by HTTP path, host, header, cookie. Expensive, flexible. Can do: SSL termination, canary by header, JSON-based routing, request rewriting. Use for: web traffic, API gateways.

8.2 Algorithms

Algorithm Behavior Best for
Round-robin Rotate through backends Homogeneous backends
Weighted round-robin Bigger machines get more Heterogeneous fleet
Least connections Send to least-busy Long-lived connections, websockets
Least response time Send to fastest Mixed workloads
IP hash / consistent hash Same client β†’ same backend Sticky cache, stateful sessions
Random / random-2-choices Pick 2 random, choose lesser Best general default at scale

Power of 2 random choices outperforms round-robin under realistic latency variance.

8.3 Sticky Sessions vs Stateless

Sticky sessions tie a client to one backend. They make caching easier but break when that backend dies (session lost) or scales down. Prefer stateless services with session in Redis/JWT; use sticky only for stateful protocols (websockets) and even then expect to handle disconnects.

8.4 API Gateway

A specialized reverse proxy + L7 LB at the edge of a microservice cluster. Concerns it owns:

  • AuthN / AuthZ (JWT validation, mTLS)
  • Rate limiting and quotas
  • Request transformation (protocol bridging β€” REST β†’ gRPC)
  • Response aggregation (BFF pattern)
  • API versioning and routing
  • Observability (request logs, traces)
  • WAF / IP blocklist

Pitfall: the gateway can become a god-object. Keep business logic in services; gateway is for cross-cutting concerns.


9. πŸ—„οΈ Databases: Pick Your Engine

9.1 Decision Matrix

Use case Pick Why
Money, inventory, identity, anything regulated Postgres / MySQL ACID, mature, strong constraints
Flexible JSON-shaped data, modest scale Postgres (JSONB) or MongoDB Document flexibility
Massive write volume, time-series, IoT Cassandra, ScyllaDB, InfluxDB Wide-column / TSDB
Sub-ms reads, ephemeral state Redis In-memory KV
Petabyte analytics Snowflake, BigQuery, Redshift Columnar OLAP
Full-text search Elasticsearch / OpenSearch Inverted index
Highly relational queries (recommendations, fraud) Neo4j, JanusGraph Graph traversal
Globally consistent + scale Spanner, CockroachDB, YugabyteDB Distributed SQL

9.2 SQL (RDBMS)

Strengths: schema enforcement, joins, ACID transactions, decades of tooling, well-understood failure modes. Weaknesses: vertical scaling first, schema migrations under load, joins across shards are painful.

When stuck, try in this order before switching to NoSQL: index, denormalize, partition table, read replica, vertical scale, shard.

9.3 NoSQL Families

Key-Value (Redis, Memcached, DynamoDB, Riak)

  • O(1) get/put. No queries beyond key. Great for cache, session, leaderboard, rate limiter state.
  • Limitation: no rich query, easy to corrupt invariants by writing piecemeal.

Document (MongoDB, Couchbase, DynamoDB)

  • JSON/BSON values, queryable by field, secondary indexes.
  • Schemaless feels easy at first, painful at year 3 β€” invest in schema-on-read tooling.

Wide-Column (Cassandra, HBase, BigTable, ScyllaDB)

  • Row key + dynamic columns, sparse, sorted on disk.
  • Built for write-heavy time-series and event logs at PB scale.
  • Consistency tunable per query (R+W>N for strong reads).
  • Modeling rule: design tables per query, never normalize.

Graph (Neo4j, JanusGraph, Amazon Neptune)

  • First-class nodes + edges + properties. Cypher / Gremlin.
  • Killer app: many-hop relationship queries (friends-of-friends, fraud rings).

Time-Series (InfluxDB, TimescaleDB, Prometheus, Druid)

  • Optimized for (metric, timestamp, value, tags) ingestion + windowed aggregation + downsampling.

Search (Elasticsearch, OpenSearch, Solr)

  • Inverted index. Full-text + faceted search + ranking.
  • Not a primary store β€” index is rebuildable; use a real DB as source of truth.

9.4 SQL vs NoSQL β€” Selection Heuristic

Pick SQL when:

  • Schema is stable and relationships matter.
  • You need joins, multi-row transactions, or constraints.
  • Data fits comfortably on one large server (or a small cluster).

Pick NoSQL when:

  • Schema is flexible / multi-tenant.
  • Write rate exceeds what one master can absorb.
  • Access pattern is well-known and narrow (key lookup, time range).
  • Operating ACID across rows is not required.

The most expensive lesson teams learn: picking NoSQL because "we'll be web-scale" when they have 100K rows. Start SQL until measurements force change. (Pinterest, GitHub, Shopify all run massive Postgres/MySQL clusters.)

9.5 Storage Engines: B-Tree vs LSM-Tree

The choice of storage engine is the biggest single determinant of a database's read/write profile. Two families dominate.

B-Tree (Postgres, MySQL InnoDB, MongoDB WiredTiger, SQLite, Oracle)

  • In-place updates: writes mutate pages on disk via WAL + buffer pool.
  • ~2Γ— write amplification (page rewrite + WAL).
  • Read-optimized: O(log n) seek, page locality.
  • Mature ecosystem: indexing, MVCC, transactions, concurrency control built around it.

LSM-Tree (Cassandra, RocksDB, LevelDB, HBase, ScyllaDB, BigTable)

  • Append-only memtable β†’ flushed as immutable sorted files (SSTables) β†’ compacted in background.
  • Write-friendly: pure sequential I/O, no in-place updates.
  • Read amplification: a key may live across many SSTables β†’ bloom filter + per-file index narrow the search.
  • Space amplification + compaction CPU are the costs.

The amplification triangle. A storage engine optimizes at most two of: write amp, read amp, space amp. B-trees pay write amp for read perf; LSM-trees pay read+space amp for write perf.

Workload Pick
Read-heavy OLTP, joins, transactions B-tree
Write-heavy time-series, event logs, telemetry LSM-tree
Mixed but reads dominate the latency budget B-tree
Append-mostly, batch-tolerant reads LSM-tree

Implication for design: when an interviewer says "10Γ— write rate vs read rate," that's an LSM signal even before they say "Cassandra."


10. πŸ”€ Replication, Sharding, Federation

10.1 Replication

Master-Slave (Primary-Replica)

  • One writer, many readers. Replicas serve read traffic and act as failover candidates.
  • Async replication: low write latency, replica lag, possible data loss on failover.
  • Semi-sync: wait for one replica ack β€” middle ground.
  • Sync: strong durability, write latency dominated by slowest replica.
  • Pitfall: read-your-writes anomalies β€” solve with sticky read-from-primary for a session window after a write, or version tokens.

Master-Master (Multi-Primary)

  • Both nodes accept writes. Requires conflict resolution (last-write-wins, vector clocks, CRDTs).
  • Higher availability for writes; harder correctness.

Quorum (R + W > N)

  • N replicas, write to W, read from R. If R+W>N you read at least one node that has the latest write.
  • Cassandra, Dynamo. Tune per-query for AP-vs-CP tradeoff.

10.2 Sharding (Horizontal Partitioning)

Splits data across nodes by a shard key. Three strategies:

Strategy How Pros Cons
Range shard = f(range(key)) (e.g., A–F, G–M…) Range queries fast Hotspots if data skewed
Hash shard = hash(key) % N Even distribution Range queries scatter; resharding rehashes everything
Consistent hash Map nodes onto a ring, key β†’ next node clockwise Minimal movement on add/remove More complex
Directory Lookup table from key β†’ shard Maximum flexibility Lookup service is SPOF; extra hop
Geographic Shard by user region Latency wins Cross-region traffic harder

Shard key selection β€” the most important decision:

  • Cardinality: millions of distinct values, not dozens.
  • Even access: no celebrity hot key (e.g., a global counter).
  • Query alignment: queries should be answerable from one shard whenever possible.
  • Mutability: key must not change.

Examples: (user_id, created_at) for chat messages, (tenant_id, doc_id) for SaaS, (date, event_id) for events.

Resharding is the hardest operational problem. Plan for it from day one β€” version your shard map, build a backfill pipeline, accept dual-writes during migration.

10.3 Federation (Functional Partitioning)

Split the database by domain, not by rows: users_db, orders_db, inventory_db. Each owned by one team.

  • Pro: clean ownership, independent schema evolution, smaller blast radius.
  • Con: cross-domain joins now require app-level fan-out or duplication.
  • Plays well with microservices (one DB per service).

10.4 Consistent Hashing

Place nodes at hashed positions on a 0…2^32 ring. A key maps to the first node clockwise from hash(key).

  • Adding a node moves only ~K/N keys (the slice between predecessor and new node).
  • Virtual nodes: each physical node owns many ring positions β€” smooths distribution and prevents hotspots when nodes differ in capacity.
  • Used by Memcached client-side, Cassandra, DynamoDB, Discord routing layer.

10.5 Replication + Sharding Combined

Real systems do both. Each shard is itself a replica set (e.g., 3-node Raft group). A 100-shard cluster is 300 nodes. The shard map says "key X lives on shard 7"; the replica set says "shard 7 is hosted by nodes A/B/C with A as leader."


11. πŸ”’ Consistency, Transactions & Isolation

11.1 Consistency Spectrum

From weakest to strongest:

  1. Eventual β€” replicas converge given no new writes.
  2. Read-your-writes β€” a client sees its own writes immediately.
  3. Monotonic reads β€” once seen, never see older.
  4. Causal β€” writes that are causally related are observed in order.
  5. Sequential β€” all clients agree on a single order.
  6. Linearizable β€” operations appear instantaneous and totally ordered (real-time).
  7. Strict serializable β€” linearizable + serializable across multi-key transactions.

Most user-facing systems need read-your-writes + monotonic. Linearizability is reserved for leader election, locking, and money.

11.2 Transaction Isolation Levels (SQL)

Level Dirty read Non-repeatable read Phantom read
Read uncommitted possible possible possible
Read committed (default in Postgres, Oracle) no possible possible
Repeatable read (default in MySQL InnoDB) no no possible*
Snapshot isolation no no no (but write skew possible)
Serializable no no no

* InnoDB's "repeatable read" is actually snapshot isolation in practice.

Anomalies to know:

  • Lost update β€” two read-modify-writes overwrite each other. Fix: SELECT FOR UPDATE, optimistic locking with version, atomic increment.
  • Write skew β€” two transactions read overlapping data, write disjoint data, both commit, breaking an invariant. Only serializable prevents.

11.3 Distributed Transactions

Two-Phase Commit (2PC)

  • Coordinator: PREPARE β†’ all participants vote β†’ if all yes, COMMIT.
  • Atomic, simple to reason about.
  • Blocking: if coordinator dies after PREPARE, participants are stuck holding locks.
  • Fine within one datacenter for short transactions; bad across services or WAN.

Three-Phase Commit (3PC)

  • Adds pre-commit phase to be non-blocking.
  • Theoretically nicer, rarely used in practice.

Saga Pattern (the modern answer)

  • A transaction = a sequence of local transactions, each with a compensating undo.
  • Two flavors:
    • Choreography: services emit events; downstream services react and emit their own.
    • Orchestration: a saga coordinator (state machine) drives the flow.
  • Choose orchestration for >3 steps or complex error paths.

TCC (Try-Confirm-Cancel)

  • Reservation-style: each service "tries" (reserves), then orchestrator either "confirms" or "cancels" all.
  • Stronger than saga (no observed in-between state) but more invasive on services.

Outbox Pattern (must-know companion)

  • Atomically write business state + event row in same DB transaction; a separate process publishes the event row to the bus.
  • Solves the "service updated DB but failed to publish event" problem without distributed transactions.

11.4 Consensus

Paxos / Multi-Paxos β€” the original. Hard to understand, hard to implement. Raft β€” the practical replacement. Used by etcd, Consul, CockroachDB, TiKV. ZAB β€” Zookeeper's variant.

You almost never implement consensus yourself. You use a library (etcd, Zookeeper, Consul) for: leader election, distributed locks, configuration, service discovery, group membership.

Consensus is expensive. Don't put it in the request hot path. Use it for control-plane decisions (who's leader, what's the shard map), then let data-plane traffic flow without consensus on every request.

11.5 Idempotency: A First-Class Design

"At-least-once delivery + idempotent handler" is the practical pattern that replaces the unattainable "exactly once." It also defends against client retries, browser double-clicks, network timeouts, and message-bus redeliveries.

The canonical recipe:

  1. Client generates a UUID per logical operation; sends it as Idempotency-Key header (Stripe pattern).
  2. Server checks a dedup store (Redis, DB table) keyed by (tenant_id, idempotency_key):
    • Present + complete β†’ return the stored response verbatim.
    • Present + in-flight β†’ return 409 Conflict, or block-and-wait.
    • Absent β†’ mark in-flight, perform operation, store the response.
  3. TTL the dedup record (24 h–7 d typical).

Per-operation kind:

  • Create: dedup by client key.
  • Increment / counter: convert to "set value if event_id not seen" (event log + materialized counter), or use natively idempotent commands (SETNX, INCR with seen-set guard).
  • External call (charge card, send email): wrap in dedup table. Record provider's response so retry returns identical payload.
  • Stream processing: dedup by (producer_id, sequence_number) or unique event ID. Kafka transactional producer + offset commits give end-to-end exactly-once within Kafka.
  • HTTP PUT: semantically idempotent already β€” full replacement, repeatable.

Fencing tokens (for distributed locks): every write carries a monotonically increasing token (issued by lock service). Storage rejects writes with stale tokens. Defends against zombie clients holding expired locks (the classic Redis Redlock failure mode).

Hot-take: if your design has a POST without an idempotency-key story, the design has a bug.


12. ⚑ Caching

12.1 Layers (in order, from client to disk)

  1. Browser cache β€” HTTP cache headers, service workers.
  2. CDN β€” geographic edge.
  3. Reverse proxy / web server cache β€” Varnish, Nginx.
  4. Application cache β€” Redis, Memcached.
  5. Database query cache / buffer pool β€” Postgres shared_buffers.
  6. OS page cache β€” Linux page cache.

Each level is faster + smaller than the next. Cache hits compound: a 90% hit rate at three layers = 99.9% of requests never reach the DB.

12.2 Cache Patterns (Read)

Cache-aside (lazy loading) β€” most common.

GET key in cache?
  yes β†’ return cached
  no  β†’ read from DB β†’ write to cache β†’ return
  • Pro: only requested data is cached. Resilient to cache failures.
  • Con: cold-cache spikes. Stale data unless TTL or invalidation.

Read-through β€” same effect, but the cache library does the DB read on miss. App only talks to cache.

Refresh-ahead β€” cache proactively refreshes hot keys before TTL. Reduces tail latency for predictable hot keys.

12.3 Cache Patterns (Write)

Pattern Order Pro Con
Write-through App β†’ cache β†’ DB (sync) Fresh cache, no loss Slow writes
Write-around App β†’ DB; cache filled lazily on read Fast writes First read slow
Write-behind / write-back App β†’ cache β†’ DB (async batch) Fast writes, batchable Risk of loss on cache crash

12.4 Eviction Policies

Policy Behavior Best for
LRU Evict least recently used General purpose default
LFU Evict least frequently used Long-lived hot keys
FIFO Evict oldest inserted Simple, but rarely best
TTL Evict on expiry Time-bounded data
Random / 2-random Pick random victim Low-overhead approximation

Production caches usually combine TTL + LRU.

12.5 Invalidation β€” "the second hardest problem in CS"

Strategies:

  • TTL β€” cheapest, eventually consistent, accept staleness.
  • Write-through β€” synchronous correctness, write cost.
  • Explicit invalidation on write β€” app deletes cache key after DB write. Race condition: if another process repopulates between your write and delete, you cache stale. Mitigations: delete-then-write order, double-delete with delay, bump version key.
  • Versioned keys β€” user:123:v42. Update a version pointer atomically; old keys age out.
  • Pub/sub invalidation β€” DB CDC stream broadcasts invalidations.

12.6 Common Pitfalls

  • Thundering herd: TTL expires under load, every request hits DB simultaneously. Fix: jittered TTL, single-flight (one request fills, others wait), early refresh.
  • Cache stampede on cold start: warm-up script before traffic shift; tiered caches.
  • Cache penetration: queries for non-existent keys bypass cache and hit DB. Fix: cache the "not found" result, or use a bloom filter.
  • Cache avalanche: mass simultaneous expiry. Fix: random jitter on TTL.
  • Hot key: one celebrity key overwhelms one shard. Fix: replicate across N keys, split the key, in-process LRU on app servers.

(... to be continued ...) Read part 2 here https://viblo.asia/p/the-system-design-playbook-part-2-y0VGwOx7VPA


If you found this helpful, let me know by leaving a πŸ‘ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! πŸ˜ƒ


All Rights Reserved

Viblo
Let's register a Viblo Account to get more interesting posts.