Scaling is one of the most fundamental concepts in system design. When your application grows and starts receiving more traffic, you need to scale your system to handle the increased load. There are two primary approaches to scaling: horizontal scaling (scaling out) and vertical scaling (scaling up).
Understanding when and how to use each approach is crucial for designing systems that can grow efficiently. This decision impacts not just performance, but also cost, reliability, and operational complexity.
Why This Matters
Every system design interview at FAANG companies will touch on scaling. Interviewers expect you to understand the trade-offs between horizontal and vertical scaling, and to make informed decisions based on the specific requirements of the system you're designing.
What is Vertical Scaling (Scaling Up)?
Vertical scaling, also known as "scaling up," involves adding more power (CPU, RAM, storage) to your existing server or machine. Instead of adding more machines, you upgrade the hardware of your current machine.
Vertical Scaling (Scaling Up)
Same server, more powerful hardware
How It Works
You start with a server that has, for example, 4 CPU cores and 16GB RAM
When you need more capacity, you upgrade to 8 CPU cores and 32GB RAM
Or upgrade to 16 CPU cores and 64GB RAM
The application code typically doesn't need to change
You're essentially making your single server more powerful
Real-World Example: AWS EC2 Instance Upgrade
You're running your application on an AWS EC2 t3.medium instance (2 vCPUs, 4GB RAM). As traffic increases, you upgrade to t3.xlarge (4 vCPUs, 16GB RAM), then to m5.2xlarge (8 vCPUs, 32GB RAM). This is vertical scaling - you're making the same instance more powerful.
Vertical Scaling: Instance Upgrade Journey
Key Insight: Vertical scaling is simple - just upgrade the hardware. No code changes, no architectural changes.
But you're limited by the most powerful hardware available, and you still have a single point of failure.
Advantages of Vertical Scaling
Simplicity: No code changes required in most cases. Your application continues to run on a single machine.
No Data Distribution: All data remains on one machine, so you don't need to worry about data partitioning or synchronization.
Lower Latency: No network calls between servers, so inter-process communication is faster.
Easier to Implement: Just upgrade hardware or move to a larger cloud instance.
Better for Stateful Applications: Applications that maintain state in memory work well with vertical scaling.
Disadvantages of Vertical Scaling
Hardware Limits: There's a physical limit to how powerful a single machine can be. You can't infinitely upgrade a server.
Single Point of Failure: If your one powerful server fails, your entire system goes down.
Downtime During Upgrades: Upgrading hardware often requires taking the server offline.
Hardware Cost Scaling: More powerful hardware becomes exponentially more expensive. At scale, horizontal scaling with commodity hardware is typically more economical.
Limited Scalability: You can only scale as much as the most powerful available hardware allows.
What is Horizontal Scaling (Scaling Out)?
Horizontal scaling, also known as "scaling out," involves adding more machines or servers to your system. Instead of making one machine more powerful, you add more machines and distribute the load across them.
Horizontal Scaling (Scaling Out)
Multiple servers sharing the load
How It Works
You start with 1 server handling all requests
As traffic increases, you add a 2nd server, then a 3rd, 4th, and so on
Requests are distributed across all servers using a load balancer
Each server runs the same application code
You're essentially creating a cluster of servers
Real-World Example: Web Application Cluster
Your web application starts with 1 server. As users grow, you add 2 more servers behind a load balancer. Now you have 3 servers sharing the load. When traffic spikes, you can quickly add 5 more servers (total 8) to handle the surge. This is horizontal scaling - you're adding more machines rather than making one machine more powerful.
Scaling Journey: From 1 to 8 Servers
Key Insight: Each server maintains the same capacity (1K req/s), but total capacity grows linearly with the number of servers.
The load balancer automatically distributes traffic evenly, and you can add or remove servers without downtime.
Advantages of Horizontal Scaling
Nearly Unlimited Scalability: You can add as many servers as needed (within cloud provider limits). There's no theoretical limit.
High Availability: If one server fails, others continue serving traffic. The system remains operational.
Cost Efficiency at Scale: Commodity hardware is typically more cost-effective at scale. 10 servers with 4GB RAM each often provide better price/performance than 1 server with 40GB RAM.
No Downtime: You can add or remove servers without taking the system offline.
Better Fault Tolerance: System can survive individual server failures.
Geographic Distribution: You can distribute servers across different regions for better performance.
Disadvantages of Horizontal Scaling
Increased Complexity: You need load balancers, service discovery, distributed state management, etc.
Data Distribution Challenges: Data needs to be partitioned or replicated across servers, which adds complexity.
Network Latency: Communication between servers happens over the network, which is slower than in-memory communication.
State Management: Stateless applications work best. Stateful applications require session management or external state stores.
Code Changes May Be Required: Applications may need to be refactored to work in a distributed environment.
Side-by-Side Comparison
Visual Comparison: Vertical vs Horizontal Scaling
Left: One server getting more powerful | Right: Multiple servers sharing load
Aspect
Vertical Scaling
Horizontal Scaling
Definition
Adding more power to existing machine
Adding more machines to the system
Scalability Limit
Limited by hardware maximums
Nearly unlimited (cloud limits)
Scaling Efficiency
Diminishing returns (hardware limits)
Linear scaling (add more servers)
Fault Tolerance
Single point of failure
High availability
Implementation Complexity
Simple (just upgrade hardware)
Complex (load balancing, distribution)
Downtime During Scaling
Usually requires downtime
No downtime (can add/remove live)
Data Management
All data on one machine
Data distributed across machines
Network Calls
Minimal (mostly in-memory)
Frequent (inter-server communication)
Best For
Small to medium applications, stateful apps
Large applications, stateless apps, high traffic
Cloud Examples
AWS: Upgrade EC2 instance type GCP: Upgrade VM machine type
AWS: Add more EC2 instances GCP: Add more VM instances
When to Use Each Approach
The decision between horizontal and vertical scaling isn't always clear-cut. Here's a comprehensive guide to help you make the right choice based on your specific requirements.
Use Vertical Scaling When:
Small to Medium Traffic: Your application doesn't need to handle millions of requests per second. If you're serving less than 10,000 requests/second, vertical scaling might be sufficient.
Stateful Applications: Applications that maintain significant state in memory:
Gaming servers (player state, game world state)
Real-time analytics (in-memory aggregations)
In-memory databases (Redis for complex data structures)
Machine learning inference servers (model in memory)
Simple Architecture: You want to keep the system simple and avoid distributed systems complexity. Fewer moving parts = fewer failure points.
Database Servers (Write Masters): Many databases benefit from vertical scaling initially:
PostgreSQL master: Single powerful machine for writes
MySQL master: Strong consistency requires single node
MongoDB primary: Single primary for write operations
Budget Constraints: You have limited budget and can't invest in distributed infrastructure (load balancers, service discovery, monitoring).
Low Latency Requirements: Applications where network latency between servers would be problematic:
Single-Tenant Applications: Applications serving a single organization or use case where traffic is predictable.
Example: Single-Node Database
A PostgreSQL database for a small to medium application can start with vertical scaling. As data grows, you upgrade from 16GB RAM to 64GB RAM, then to 256GB RAM. This works well until you hit hardware limits or need high availability.
# PostgreSQL Vertical Scaling Journey (Typical Path)
Phase 1: Small Scale (Vertical Scaling)
Instance: t3.medium (4GB RAM) - 10K rows
Instance: t3.xlarge (16GB RAM) - 100K rows
Instance: m5.2xlarge (32GB RAM) - 1M rows
Instance: m5.4xlarge (64GB RAM) - 10M rows
Instance: m5.8xlarge (128GB RAM) - 50M rows
Phase 2: Hit Limits (Time to Scale Horizontally)
At this point, vertical scaling becomes impractical:
- Hardware limits reached
- Single point of failure risk
- Upgrade downtime becomes unacceptable
Phase 3: Hybrid Approach (Recommended)
- Keep powerful master for writes (vertical)
- Add read replicas for reads (horizontal)
- Consider database sharding for writes (horizontal)
- Or migrate to managed service (RDS, Aurora)
Use Horizontal Scaling When:
High Traffic: You need to handle millions of requests per second (like Twitter, Facebook, Netflix). Single server can't handle the load.
Stateless Applications: Web servers, API servers that don't maintain session state:
REST APIs
Microservices
Static file servers
API gateways
High Availability Required: System must remain operational even if servers fail. 99.9%+ uptime requirements.
Variable Traffic: Traffic patterns are unpredictable:
Spikes during events (Super Bowl, product launches)
A web application serving HTTP requests is typically stateless. You can easily add more web servers behind a load balancer. When traffic increases, you add 10 more servers. When traffic decreases, you remove servers to save costs. This is the classic horizontal scaling pattern.
# Horizontal Scaling with Auto-Scaling
Base Configuration:
- Application Servers: 2x t3.medium
- Load Balancer: 1x AWS Application Load Balancer
- Auto-scaling group: 2-20 instances
Normal Traffic (9 AM - 5 PM):
- Servers running: 4
- Handles: 400 requests/second
- Cost: $120/month (servers) + $20/month (ALB)
Peak Traffic (Product Launch):
- Servers: 15 servers automatically added
- Handles: 1,500 requests/second
- Cost: $450/month (servers) + $20/month (ALB)
Low Traffic (Midnight - 6 AM):
- Auto-scales down to: 2 servers
- Handles: 100 requests/second
- Cost: $60/month (servers) + $20/month (ALB)
# Key Benefit: Pay only for what you use
Decision Framework
Start with Vertical Scaling if: Traffic < 1,000 requests/second, budget < $500/month, team < 5 engineers, simple architecture acceptable.
Move to Horizontal Scaling when: Traffic > 5,000 requests/second, need 99.9%+ uptime, traffic is variable, team can handle operational complexity.
Use Hybrid Approach: Most production systems use both - vertical for databases (write masters), horizontal for application servers.
Real-World Examples from Top Companies
Netflix - Horizontal Scaling at Massive Scale
Netflix uses horizontal scaling extensively. They run thousands of microservices across hundreds of thousands of servers. When a popular show releases, they can quickly scale up by adding more servers to handle the traffic spike. This would be impossible with vertical scaling.
Infrastructure: AWS EC2 instances across multiple regions (horizontal scaling)
Scale: Can scale from hundreds to thousands of servers in minutes using auto-scaling
Content Delivery: Distributes content delivery across multiple regions and CDN edge locations
Microservices: Each service scales independently based on demand
Cost Optimization: Uses spot instances and auto-scaling to optimize costs
Netflix Scaling Example
When "Stranger Things" Season 4 launched, Netflix's traffic spiked by 300%. Their horizontal scaling infrastructure automatically added thousands of servers across multiple regions to handle the load. Within 15 minutes, they scaled from ~50,000 servers to ~150,000 servers. This would be impossible with vertical scaling.
Instagram - Hybrid Approach
Instagram uses a hybrid approach, which is common for large-scale applications. Their application servers scale horizontally (thousands of servers), but their database infrastructure uses both approaches strategically:
Horizontal Scaling:
Application servers (Python Django): Thousands of servers behind load balancers
Cache servers: Redis clusters with hundreds of nodes
CDN: CloudFront edge locations globally
Vertical Scaling:
Database master nodes: Very powerful machines (256GB+ RAM, 32+ CPU cores)
Why? Write operations need strong consistency and low latency
Single master avoids distributed transaction complexity
Hybrid for Reads:
Database read replicas: Multiple replicas (horizontal) for read scaling
Each replica is also vertically scaled (powerful machines)
Key Insight: Hybrid is Common
Most large-scale systems use a hybrid approach. Application servers scale horizontally, but databases often use vertical scaling for write masters and horizontal scaling (read replicas) for reads. This balances performance, consistency, and scalability.
Google Search - Horizontal Scaling at Global Scale
Google's search infrastructure is one of the largest horizontally scaled systems in the world:
Scale: Millions of servers distributed globally across hundreds of data centers
Architecture: Each data center has thousands of servers handling different functions
Traffic: Can handle billions of search queries per day (3.5+ billion searches daily)
Scaling Strategy: Uses horizontal scaling for both compute and storage
Geographic Distribution: Servers in every major region for low latency
Fault Tolerance: Can lose entire data centers without service interruption
Small SaaS Application - Vertical to Horizontal Journey
A typical small SaaS application follows this scaling journey:
Phase 1 - Vertical Scaling (Months 0-6):
Starts on a single server (e.g., DigitalOcean droplet with 2GB RAM, $12/month)
Upgrades to 4GB RAM ($24/month) as users grow
Upgrades to 8GB RAM ($48/month) when traffic increases
Simple, cost-effective, no code changes needed
Phase 2 - Hybrid (Months 6-12):
Application server: Still vertical (16GB RAM, $96/month)
Database: Separate server, vertical scaling (8GB RAM, $48/month)
Cache: Add Redis on separate server (2GB RAM, $24/month)
Phase 3 - Horizontal Migration (Months 12+):
Make application stateless (move sessions to Redis)
Add load balancer (AWS ALB, $20/month)
Deploy 3 application servers (3x 4GB = $72/month)
Add database read replicas for read scaling
Total: ~$140/month with high availability
How to Discuss Scaling in System Design Interviews
In FAANG interviews, you'll be asked about scaling strategies. Here's how to approach these discussions effectively.
Step-by-Step Interview Approach
1. Start with Requirements
# Questions to Ask Interviewer
Traffic Patterns:
- "What's the expected traffic? (QPS, concurrent users)"
- "Is traffic consistent or variable? (spikes, seasonal)"
- "What's the growth projection?"
Application Characteristics:
- "Is the application stateless or stateful?"
- "What are the read/write ratios?"
- "What are the latency requirements?"
Constraints:
- "Any budget constraints?"
- "What's the timeline for scaling?"
- "Any existing infrastructure?"
2. Analyze the Use Case
# Decision Framework
Choose Vertical Scaling When:
✅ Small to medium scale (< 10K requests/second)
✅ Stateful application (hard to make stateless)
✅ Simple architecture preferred
✅ Predictable traffic
✅ Single region deployment
Choose Horizontal Scaling When:
✅ Large scale (> 10K requests/second)
✅ Stateless or can be made stateless
✅ High availability required (99.9%+)
✅ Variable/unpredictable traffic
✅ Global/multi-region deployment
✅ Need auto-scaling
3. Discuss Trade-offs
Always mention trade-offs. Interviewers want to see you understand the implications:
Vertical Scaling: "Simple to implement, but hits hardware limits and creates single point of failure"
Horizontal Scaling: "More scalable and fault-tolerant, but requires stateless design and load balancing infrastructure"
4. Mention Hybrid Approach
Most real-world systems use a hybrid approach. Always mention this:
# Hybrid Approach Example
"Most large systems use a hybrid approach:
- Application servers: Horizontal scaling (stateless, behind load balancer)
- Database master: Vertical scaling (powerful machine for writes)
- Database replicas: Horizontal scaling (multiple read replicas)
- Cache layer: Horizontal scaling (Redis cluster)
- CDN: Horizontal scaling (edge locations globally)
This balances performance, consistency, and scalability."
Common Interview Questions
Question 1: "When would you choose vertical vs horizontal scaling?"
# Strong Answer Structure
1. Start with use case:
"It depends on the application characteristics and scale..."
2. Vertical scaling when:
- Small to medium scale
- Stateful application
- Simple architecture needed
- Predictable traffic
3. Horizontal scaling when:
- Large scale
- Stateless application
- High availability needed
- Variable traffic
4. Mention hybrid:
"In practice, most systems use both - vertical for database
masters, horizontal for application servers and reads."
5. Discuss trade-offs:
"Vertical is simpler but limited. Horizontal is more complex
but nearly unlimited scalability."
Question 2: "How would you migrate from vertical to horizontal scaling?"
# Migration Strategy Answer
1. Make application stateless:
"First, I'd move session state to external storage (Redis/database).
This is critical for horizontal scaling."
2. Add load balancer:
"Introduce a load balancer (ALB, Nginx) to distribute traffic."
3. Deploy multiple instances:
"Deploy application to multiple servers behind load balancer."
4. Scale database:
"Add read replicas for database reads, keep powerful master for writes."
5. Add caching:
"Implement caching layer (Redis) to reduce database load."
6. Monitor and optimize:
"Continuously monitor and add/remove servers based on traffic."
Question 3: "What are the challenges of horizontal scaling?"
# Challenges Answer
1. Stateless Design:
"Application must be stateless - no in-memory session state.
This may require refactoring."
2. Data Consistency:
"With multiple servers, ensuring data consistency becomes
challenging. Need to consider CAP theorem."
3. Load Balancing:
"Need load balancer infrastructure and choose right algorithm."
4. Monitoring Complexity:
"More servers means more monitoring, logging, alerting needed."
5. Network Overhead:
"Inter-server communication adds latency and network overhead."
6. Database Bottleneck:
"Application servers can scale, but database becomes bottleneck.
Need read replicas, caching, or sharding."
Red Flags to Avoid
❌ "Always use horizontal scaling" - Not true, depends on use case
❌ "Vertical scaling is never good" - Wrong, it's simpler for small scale
❌ "Cost is the only factor" - Missing complexity, operational overhead
❌ "One size fits all" - Not considering application characteristics
Key Points to Emphasize
✅ Start simple: Begin with vertical scaling, migrate when needed
✅ Hybrid approach: Most systems use both strategies
✅ Stateless first: Critical for horizontal scaling
✅ Database scaling: Don't forget database when scaling application
✅ Trade-offs: Always discuss pros and cons
✅ Real-world examples: Mention how companies actually do it
Migration Strategies
Most applications start with vertical scaling and eventually migrate to horizontal scaling. Understanding migration strategies is important for system design interviews.
Phase 1: Start with Vertical Scaling
Launch application on a single server
Monitor performance and resource usage
Upgrade server as traffic grows (vertical scaling)
Keep it simple and cost-effective for initial growth
Phase 2: Identify Scaling Bottlenecks
Monitor CPU, memory, disk I/O, network I/O
Identify which resource is the bottleneck
Determine if vertical scaling can still solve the problem
Evaluate complexity vs benefits of horizontal scaling
Phase 3: Migrate to Horizontal Scaling
Stateless First: Make application stateless (move session storage to Redis/database)
Monitor and Optimize: Continuously monitor and add/remove servers as needed
Migration Example: E-commerce Application
Phase 1: Start with single server (4GB RAM) handling 100 requests/second Phase 2: Upgrade to 16GB RAM server handling 500 requests/second Phase 3: Add load balancer + 3 servers (each 4GB RAM) handling 2000 requests/second Phase 4: Scale to 10 servers handling 10,000 requests/second Phase 5: Add database read replicas, caching layer, CDN
Trade-offs Analysis
Performance
Vertical: Lower latency (no network calls)
Horizontal: Network latency between servers
Winner: Depends on use case
Reliability
Vertical: Single point of failure
Horizontal: High availability
Winner: Horizontal scaling
Scalability
Vertical: Limited by hardware
Horizontal: Nearly unlimited
Winner: Horizontal scaling
Scaling Efficiency
Vertical: Diminishing returns, hardware limits
Horizontal: Linear scaling, nearly unlimited
Winner: Horizontal scaling (for large scale)
Complexity
Vertical: Simple implementation
Horizontal: Complex distributed system
Winner: Vertical scaling
Flexibility
Vertical: Fixed capacity
Horizontal: Dynamic scaling
Winner: Horizontal scaling
Common Mistakes to Avoid
❌ Mistake 1: Always Choosing Horizontal Scaling
Many engineers assume horizontal scaling is always better. However, for small applications, vertical scaling is simpler and more cost-effective. Don't over-engineer.
❌ Mistake 2: Ignoring State Management
When migrating to horizontal scaling, forgetting to make the application stateless is a common mistake. Session data stored in memory will be lost when requests hit different servers.
❌ Mistake 3: Not Considering Database Scaling
Scaling application servers horizontally is useless if the database becomes the bottleneck. Always consider database scaling (read replicas, sharding) when scaling application servers.
Load balancing is a critical component of distributed systems architecture. When you have multiple servers handling requests, a load balancer acts as a traffic director, distributing incoming requests across available servers to ensure optimal resource utilization, maximize throughput, minimize response time, and avoid overloading any single server.
Understanding load balancing is essential for designing scalable systems. Every FAANG interview will touch on load balancing when discussing horizontal scaling, high availability, and distributed architectures. This topic covers everything from basic concepts to advanced algorithms and real-world implementations.
Why Load Balancing Matters
Without load balancing, you can't effectively scale horizontally. A load balancer is the gateway that makes multiple servers work as one cohesive system. It's the difference between a well-architected distributed system and a collection of independent servers.
What is Load Balancing?
Load balancing is the process of distributing network traffic or application requests across multiple servers (also called backend servers, upstream servers, or server pool). The component that performs this distribution is called a load balancer.
Load Balancer Architecture
Load balancer distributes client requests across multiple backend servers
Key Functions of a Load Balancer
Request Distribution: Routes incoming requests to available servers based on configured algorithms
Health Monitoring: Continuously checks server health and removes unhealthy servers from rotation
Session Management: Handles session persistence (sticky sessions) when required
SSL Termination: Can handle SSL/TLS encryption/decryption, offloading this from application servers
High Availability: Provides redundancy - if one server fails, traffic routes to healthy servers
Why Load Balancing is Essential
1. Prevents Server Overload
Without load balancing, all requests might hit a single server, causing it to become overwhelmed while other servers remain idle. A load balancer ensures even distribution of load.
Example: E-commerce Site During Sale
During Black Friday, an e-commerce site receives 10,000 requests per second. Without a load balancer, all requests hit Server 1, causing it to crash. With a load balancer distributing across 10 servers, each server handles ~1,000 requests/second, maintaining performance.
2. Enables Horizontal Scaling
Load balancing is the foundation of horizontal scaling. You can add or remove servers dynamically, and the load balancer automatically adjusts traffic distribution.
3. Provides High Availability
If a server fails, the load balancer detects it (via health checks) and stops sending traffic to it. Users experience no downtime because other servers continue handling requests.
4. Improves Performance
By distributing load, each server operates at optimal capacity. Response times improve because no single server is overwhelmed.
5. Enables Geographic Distribution
Load balancers can route traffic to servers in different geographic regions, reducing latency for users worldwide.
Load Balancing Algorithms - Deep Dive
The algorithm a load balancer uses determines how requests are distributed. Each algorithm has specific use cases, advantages, and trade-offs. Understanding these is crucial for system design interviews.
1. Round Robin
How it works: Requests are distributed sequentially across servers in rotation. Server 1 gets request 1, Server 2 gets request 2, Server 3 gets request 3, then back to Server 1 for request 4, and so on.
Round Robin Algorithm
How Round Robin Works
The load balancer maintains a simple counter that cycles through the server list. When a request arrives, it selects the server at the current index, then increments the index (wrapping around to 0 when it reaches the end). This ensures each server receives requests in a predictable, rotating pattern.
Example Flow: With 3 servers, Request 1 goes to Server 1, Request 2 to Server 2, Request 3 to Server 3, then Request 4 cycles back to Server 1, and the pattern continues. This works perfectly when all servers have identical capacity and requests take similar processing time.
Advantages:
Simple to implement and understand
Ensures even distribution when servers have equal capacity
No server state tracking required
Works well for stateless applications
Disadvantages:
Doesn't consider server load or capacity
Can overload a slow server if it receives requests at the same rate as fast servers
Doesn't account for request processing time differences
Not ideal when servers have different capabilities
When to Use:
All servers have identical capacity and performance
Requests are similar in processing time
Stateless applications
Simple use cases where even distribution is sufficient
2. Least Connections
How it works: The load balancer tracks the number of active connections to each server and routes new requests to the server with the fewest active connections.
Least Connections: Dynamic Load Balancing
Key Insight: The load balancer maintains a real-time count of active connections per server.
When a new request arrives, it automatically routes to the server with the fewest active connections,
ensuring optimal load distribution even when requests have varying processing times.
Advantages:
Adapts to varying request processing times
Better for long-lived connections (WebSockets, database connections)
Automatically handles servers with different processing speeds
More intelligent than round robin for stateful connections
Disadvantages:
Requires tracking connection state (more memory overhead)
More complex implementation
Doesn't consider server CPU/memory load, only connection count
May not be optimal if connections have very different processing requirements
How it works: Similar to round robin, but each server is assigned a weight (priority). Servers with higher weights receive more requests proportionally.
How it works: The load balancer hashes the client's IP address and uses that hash to determine which server handles the request. The same IP always routes to the same server (unless the server pool changes).
Provides session persistence (sticky sessions) without additional configuration
Simple to implement
Deterministic routing (same IP = same server)
Useful for caching scenarios
Disadvantages:
Uneven distribution if IP addresses aren't uniformly distributed
Problems with NAT (multiple users behind one IP)
Server removal/addition causes hash redistribution
Can create hotspots if many requests come from same IP
When to Use:
When you need session persistence
Applications with server-side caching based on client
When client IP distribution is relatively uniform
Stateful applications requiring sticky sessions
5. Least Response Time
How it works: Routes requests to the server with the lowest average response time. The load balancer continuously monitors response times from each server.
# Least Response Time Implementation
servers = [
{id: 'server1', avg_response_time: 50ms, active_requests: 10},
{id: 'server2', avg_response_time: 30ms, active_requests: 5},
{id: 'server3', avg_response_time: 80ms, active_requests: 15}
]
function get_next_server():
# Find server with minimum average response time
return min(servers, key=lambda s: s.avg_response_time)
# Request → server2 (30ms is lowest)
# After processing, update server2.avg_response_time
Advantages:
Automatically adapts to server performance
Routes to fastest-responding servers
Handles varying server loads dynamically
Optimizes user experience (lowest latency)
Disadvantages:
Requires continuous monitoring and metrics collection
More complex implementation
Response time measurements can be noisy
May cause oscillation if response times fluctuate
When to Use:
When response time is critical
Servers with varying performance characteristics
Real-time applications requiring low latency
When you have good monitoring infrastructure
6. Consistent Hashing
Consistent hashing minimizes key redistribution when servers are added or removed. Unlike simple hashing (`hash(key) % num_servers`), which remaps all keys when server count changes, consistent hashing only remaps ~1/n keys (where n = number of servers).
How It Works
Consistent hashing uses a hash ring - a circle with hash values from 0 to 2^32-1. Both servers and keys are hashed and placed on this ring. To find which server handles a key:
Hash the key to get a position on the ring
Move clockwise from that position
The first server encountered handles that key
Consistent Hashing - Hash Ring
Keys hash to ring positions, route to next server clockwise. Removing a server only affects keys between that server and the previous one.
Virtual Nodes
Without virtual nodes, servers might cluster on the ring, causing uneven distribution. Virtual nodes solve this: each physical server is represented by multiple positions on the ring (typically 150-200 per server). This ensures even distribution.
Example: With 3 servers and 150 virtual nodes each, each server gets ~33% of the ring (150/450 positions). Industry standard is 150-200 virtual nodes per server (used by DynamoDB, Cassandra).
Advantages:
Minimal Redistribution: Adding/removing a server only remaps ~1/n keys (vs 100% with simple hashing)
Scalability: Works efficiently with hundreds of servers
Fault Tolerance: Failed servers' keys automatically move to next server
Industry Proven: Used by DynamoDB, Cassandra, Memcached, Riak
Disadvantages:
More complex than simple hashing (requires sorted ring structure)
Virtual nodes required for even distribution (adds memory overhead)
Distribution is probabilistic, not perfectly even
When to Use:
Distributed caching (Redis, Memcached)
Database sharding (Cassandra, DynamoDB)
CDN edge server selection
Load balancing with frequent server changes
7. Rendezvous Hashing (Highest Random Weight)
Rendezvous hashing is an alternative to consistent hashing. Instead of a hash ring, it calculates a weight for each server-key pair and selects the server with the highest weight.
How It Works
For a given key, calculate `hash(server + key)` for each server. The server with the highest hash value handles that key. This is deterministic - the same key always goes to the same server.
Adding a server doesn't remap existing keys! Existing keys stay on their current servers. Only new keys might go to the new server. This is perfect for caches where you want to avoid invalidation.
Rendezvous vs Consistent Hashing
Rendezvous: Simpler (no ring, no virtual nodes), O(n) lookup, good for < 50 servers
Consistent: More complex (ring + virtual nodes), O(log n) lookup, better for 100+ servers
When to Use:
Small to medium clusters (< 50 servers)
Frequent server additions (existing keys stay put)
Memory-constrained environments
Simpler implementation preferred
Layer 4 vs Layer 7 Load Balancing
Load balancers operate at different OSI model layers. Understanding the difference is crucial for system design interviews and real-world implementations.
Layer 4 (L4) Load Balancing - Transport Layer
Also known as: Network Load Balancing, Connection-level load balancing
How it works: Makes routing decisions based on information from the transport layer (TCP/UDP). It looks at source IP, destination IP, source port, and destination port. It does NOT inspect the application data (HTTP headers, URLs, etc.).
# Layer 4 Load Balancing Decision
Packet Information:
Source IP: 192.168.1.100
Source Port: 54321
Dest IP: 10.0.0.1 (Load Balancer)
Dest Port: 80
Protocol: TCP
# Load balancer makes decision based ONLY on:
# - Source IP/Port
# - Destination IP/Port
# - Protocol
# Does NOT see:
# - HTTP method (GET, POST)
# - URL path (/api/users, /api/products)
# - HTTP headers
# - Request body
Characteristics:
Speed: Very fast - minimal processing overhead
Transparency: Backend servers see original client IP (if using direct server return)
Protocol Agnostic: Works with any TCP/UDP protocol (HTTP, HTTPS, FTP, SMTP, etc.)
Simple: Less complex, fewer features
No Content Awareness: Cannot route based on URL, headers, or content
Simple round-robin or least-connections distribution
When application-level routing isn't needed
Examples:
AWS Network Load Balancer (NLB)
HAProxy in TCP mode
F5 BIG-IP (LTM in Layer 4 mode)
Linux Virtual Server (LVS)
Layer 7 (L7) Load Balancing - Application Layer
Also known as: Application Load Balancing, Content-based load balancing, HTTP(S) load balancing
How it works: Makes routing decisions based on application-layer data. For HTTP/HTTPS, it can inspect URLs, HTTP headers, cookies, and request content. It terminates the connection and creates a new one to the backend server.
# Layer 7 Load Balancing Decision
HTTP Request:
Method: GET
URL: /api/users/123
Headers: Cookie: session_id=abc123
User-Agent: Mozilla/5.0
Body: (empty for GET)
# Load balancer can make decisions based on:
# - URL path (/api/users → user-service, /api/products → product-service)
# - HTTP method (GET → read-replica, POST → master)
# - Headers (Cookie → sticky session, Authorization → specific backend)
# - Query parameters
# - Request body content
# Can also:
# - Modify requests (add headers, rewrite URLs)
# - Terminate SSL/TLS
# - Perform content-based routing
Characteristics:
Intelligent Routing: Can route based on content, URL, headers
Content Modification: Can modify requests/responses (header injection, URL rewriting)
Higher Overhead: More processing required (parsing HTTP, inspecting content)
Backend Sees LB IP: Backend servers see load balancer IP, not client IP (unless X-Forwarded-For header is used)
Use Cases:
Microservices architecture (route /api/users to user-service, /api/orders to order-service)
Content-based routing (route based on URL patterns)
When you need SSL termination at the load balancer
API gateway functionality
When you need to inspect or modify HTTP content
Cookie-based session persistence
Examples:
AWS Application Load Balancer (ALB)
NGINX (as reverse proxy/load balancer)
HAProxy in HTTP mode
F5 BIG-IP (LTM in Layer 7 mode)
Traefik
Kong API Gateway
Comparison Table
Aspect
Layer 4 (L4)
Layer 7 (L7)
OSI Layer
Transport Layer (4)
Application Layer (7)
Decision Based On
IP addresses, ports, protocol
URL, HTTP headers, content
Performance
Very fast (lower latency)
Slower (higher latency due to processing)
Protocol Support
Any TCP/UDP protocol
Primarily HTTP/HTTPS
SSL Termination
No (pass-through)
Yes (can terminate SSL)
Content Awareness
No
Yes
Routing Flexibility
Limited (IP/port based)
High (content-based routing)
Client IP Visibility
Yes (with DSR)
No (requires X-Forwarded-For)
Use Case
High throughput, simple distribution
Microservices, content routing, API gateway
Examples
AWS NLB, HAProxy TCP mode
AWS ALB, NGINX, HAProxy HTTP mode
Hybrid Approach: L4 + L7
Many production systems use both layers. L4 load balancer distributes traffic to L7 load balancers, which then perform content-based routing to application servers. This provides both high throughput (L4) and intelligent routing (L7).
Health Checks and Failover
Health checks are critical for load balancer reliability. They continuously monitor backend servers and automatically remove unhealthy servers from the rotation, ensuring users only hit healthy servers.
How Health Checks Work
The load balancer periodically sends health check requests to each backend server. Based on the response, it determines if the server is healthy or unhealthy.
# Health Check Flow
1. Load Balancer sends HTTP GET /health to Server 1
2. Server 1 responds with:
- Status: 200 OK (healthy)
- Response time: 10ms
- Body: {"status": "healthy", "database": "connected"}
3. Load Balancer evaluates:
- Response code: 200 (✓)
- Response time: 10ms < 50ms threshold (✓)
- Response body contains "healthy" (✓)
→ Server 1 marked as HEALTHY
4. If Server 1 fails health check:
- Response: 500 Internal Server Error
- OR: Timeout (no response in 5 seconds)
- OR: Response time > 50ms
→ Server 1 marked as UNHEALTHY
→ Removed from rotation
→ Traffic stops routing to Server 1
Success Criteria: Connection established successfully
Use Case: Simple connectivity check, faster than HTTP
Limitation: Doesn't verify application is actually working
3. Custom Health Checks
Can use any protocol or method
Script-based health checks
External monitoring integration
Health Check Configuration Parameters
Parameter
Description
Typical Values
Interval
How often to check
10-60 seconds
Timeout
Max time to wait for response
2-10 seconds
Unhealthy Threshold
Consecutive failures to mark unhealthy
2-3 failures
Healthy Threshold
Consecutive successes to mark healthy
2-3 successes
Path
Health check endpoint
/health, /healthz, /ping
Expected Status
HTTP status code for healthy
200 OK
Failover Mechanisms
1. Automatic Failover
When a server fails health checks, the load balancer automatically removes it from rotation. No manual intervention required.
2. Graceful Degradation
If all servers in a pool become unhealthy, the load balancer can:
Return 503 Service Unavailable
Route to backup servers (if configured)
Serve cached responses (if available)
Display maintenance page
3. Circuit Breaker Pattern
Advanced load balancers implement circuit breakers: if a server fails repeatedly, it's marked as "circuit open" and not checked for a cooldown period, reducing unnecessary load.
Real-World Example: AWS ALB Health Checks
AWS Application Load Balancer health check configuration:
# AWS ALB Health Check Configuration
Health Check Protocol: HTTP
Health Check Port: traffic-port (uses same port as target)
Health Check Path: /health
Interval: 30 seconds
Timeout: 5 seconds
Healthy Threshold: 2 consecutive successes
Unhealthy Threshold: 2 consecutive failures
Success Codes: 200
# Behavior:
# - Checks every 30 seconds
# - If 2 consecutive checks fail → mark unhealthy
# - If 2 consecutive checks succeed → mark healthy
# - Unhealthy targets receive no traffic
# - Health checks continue even when unhealthy
Load Balancer Types
Load balancers come in different forms: hardware appliances, software solutions, and cloud-managed services. Each has its place in system architecture.
1. Hardware Load Balancers
Definition: Physical appliances dedicated to load balancing. These are specialized devices optimized for high-performance traffic distribution.
Examples:
F5 BIG-IP: Industry-leading hardware load balancer, supports L4 and L7
Classic Load Balancer (CLB): Legacy, supports both L4 and L7 (being phased out)
Gateway Load Balancer: For deploying third-party virtual appliances
Google Cloud Load Balancers:
Global HTTP(S) Load Balancer: L7, global distribution
Global TCP/UDP Load Balancer: L4, global distribution
Regional Load Balancer: Regional distribution
Internal Load Balancer: For internal traffic within VPC
Azure Load Balancers:
Azure Load Balancer: L4, regional
Application Gateway: L7, web application firewall
Traffic Manager: DNS-based global load balancing
Advantages:
Fully managed (no infrastructure to maintain)
Automatically scalable
High availability built-in
Integrated with cloud services
Pay-as-you-go model (scales with usage)
Global distribution capabilities
Disadvantages:
Vendor lock-in
Ongoing costs (can be expensive at scale)
Less control over configuration
Limited customization compared to self-hosted
When to Use:
Cloud-native applications
When you want to minimize operations overhead
Microservices on cloud platforms
When you need global distribution
Serverless architectures
AWS ALB vs NLB: When to Use Each
Feature
Application Load Balancer (ALB)
Network Load Balancer (NLB)
Layer
Layer 7 (Application)
Layer 4 (Network)
Performance
High (millions of requests/sec)
Ultra-high (millions of connections/sec)
Latency
~100-400ms
~50-100ms (lower)
Content-Based Routing
Yes (URL, headers, host)
No
SSL Termination
Yes
Yes
Use Case
Microservices, HTTP/HTTPS apps
High-performance TCP/UDP, gaming
Cost
$0.0225 per ALB-hour + $0.008 per LCU-hour
$0.0225 per NLB-hour + $0.006 per LCU-hour
Session Persistence (Sticky Sessions)
Session persistence, also called "sticky sessions," ensures that requests from the same client are always routed to the same backend server. This is necessary for stateful applications that store session data on the server.
Why Session Persistence is Needed
In a stateless application, any server can handle any request. But if your application stores session data (like shopping cart, user preferences, temporary data) in server memory, you need to ensure the same user always hits the same server.
Problem Without Session Persistence
User adds item to cart on Server 1 (cart stored in Server 1's memory). User makes another request, but load balancer routes to Server 2. Server 2 doesn't have the cart data → user sees empty cart!
Session Persistence Methods
1. Cookie-Based Session Persistence
Load balancer sets a cookie containing server identifier. Subsequent requests include this cookie, and the load balancer routes to the specified server.
# Cookie-Based Session Persistence Flow
1. First Request:
Client → Load Balancer → Server 1
Server 1 responds, Load Balancer sets cookie:
Set-Cookie: AWSALB=server1_hash; Path=/
2. Subsequent Requests:
Client → Load Balancer (includes cookie: AWSALB=server1_hash)
Load Balancer reads cookie → routes to Server 1
# If Server 1 becomes unhealthy:
# Load balancer removes cookie, routes to healthy server
# New session starts on different server
2. IP Hash-Based Persistence
Uses client IP address to determine which server handles requests. Same IP always routes to same server (as discussed in algorithms section).
3. Application-Controlled Session Affinity
Application embeds server identifier in responses (URLs, forms). Load balancer reads this and routes accordingly.
Trade-offs of Session Persistence
Advantages:
Enables stateful applications
Server-side session storage possible
Can optimize caching per user
Disadvantages:
Uneven load distribution (some servers may get more traffic)
Server failure causes session loss
Difficult to scale (can't easily add/remove servers)
Breaks horizontal scaling benefits
Best Practice: Make Applications Stateless
Instead of using sticky sessions, modern best practice is to make applications stateless:
Store session data in external store (Redis, database, Memcached)
Any server can handle any request
True horizontal scaling
No session loss on server failure
# Stateless Application Pattern
# Session data stored in Redis (external)
class ShoppingCart:
def __init__(self, session_id):
self.session_id = session_id
self.redis = Redis()
def add_item(self, item):
# Store in Redis, not server memory
self.redis.hset(f"cart:{self.session_id}", item.id, item.data)
def get_items(self):
# Retrieve from Redis
return self.redis.hgetall(f"cart:{self.session_id}")
# Any server can handle any request
# No sticky sessions needed
# Horizontal scaling works perfectly
Real-World Examples
Netflix - Multi-Layer Load Balancing
Netflix uses a sophisticated multi-layer load balancing architecture:
DNS Load Balancing: Routes users to nearest data center
L4 Load Balancers: Distribute traffic across regions
L7 Load Balancers: Route to specific microservices
Service Mesh: Envoy proxies handle inter-service communication
This architecture handles billions of requests per day across thousands of microservices.
Integrated with AWS services (CloudWatch, Auto Scaling)
Google - Global Load Balancing
Google's infrastructure uses global load balancing:
Anycast IP addresses route to nearest data center
Multi-region load balancing for redundancy
Intelligent routing based on latency and capacity
Handles millions of queries per second
Common Mistakes to Avoid
❌ Mistake 1: Using Sticky Sessions Unnecessarily
Many developers use sticky sessions as a quick fix, but this breaks horizontal scaling. Always prefer stateless applications with external session storage.
❌ Mistake 2: Not Configuring Health Checks Properly
Health checks that are too aggressive can mark healthy servers as unhealthy. Health checks that are too lenient can route traffic to failing servers. Find the right balance.
❌ Mistake 3: Single Load Balancer (No Redundancy)
A single load balancer is a single point of failure. Always use multiple load balancers in different availability zones with health checks and failover.
❌ Mistake 4: Wrong Algorithm for Use Case
Using round robin for long-lived connections or least connections for simple HTTP requests. Match the algorithm to your use case.
Interview Tips
How to Discuss Load Balancing in Interviews
Start with Requirements: Ask about traffic patterns, connection types, session requirements
Choose the Right Algorithm: Explain why you chose a specific algorithm based on requirements
Layer Selection: Discuss L4 vs L7 and when to use each
Health Checks: Always mention health checks and failover mechanisms
Redundancy: Discuss multiple load balancers for high availability
Session Management: Explain stateless design vs sticky sessions trade-offs
Key Points to Emphasize
✅ Load balancing enables horizontal scaling
✅ Health checks are critical for reliability
✅ Stateless applications are preferred over sticky sessions
✅ L4 for performance, L7 for intelligent routing
✅ Always have redundant load balancers
✅ Choose algorithm based on use case
Summary
Load balancing is fundamental to distributed systems. Key takeaways:
Purpose: Distribute traffic across multiple servers for scalability and availability
Algorithms: Round robin, least connections, weighted, IP hash, consistent hashing - each has specific use cases
L4 vs L7: L4 for performance, L7 for content-based routing
Health Checks: Essential for automatic failover and reliability
Types: Hardware, software, or cloud-managed - choose based on requirements
Best Practice: Stateless applications with external session storage
In the next section, we'll explore Database Scaling, which builds on load balancing concepts.
📝 Quiz: Test Your Understanding
Complete this quiz to mark this topic as completed
Question 1: What is the main difference between Layer 4 and Layer 7 load balancing?
a) L4 is faster, L7 is slower
b) L4 is for HTTP, L7 is for TCP
c) L4 routes based on IP/port, L7 routes based on content/URL
d) There is no difference
Question 2: Which load balancing algorithm is best for long-lived connections like WebSockets?
a) Round Robin
b) Least Connections
c) IP Hash
d) Weighted Round Robin
Question 3: What is the main advantage of consistent hashing?
a) It's the fastest algorithm
b) It provides sticky sessions
c) It's the simplest to implement
d) Minimal redistribution when servers are added/removed
Question 4: Why are health checks critical for load balancers?
a) They automatically remove unhealthy servers from rotation
b) They improve server performance
c) They reduce latency
d) They enable sticky sessions
Question 5: What is the best practice for session management in horizontally scaled applications?
a) Use sticky sessions with IP hash
b) Store sessions in server memory
c) Make applications stateless with external session storage
Database scaling is one of the most critical and challenging aspects of system design. As applications grow, databases often become the primary bottleneck. Understanding how to scale databases effectively is essential for building systems that can handle millions of users and billions of requests.
This topic covers the fundamental strategies for scaling databases: read replicas for scaling read operations, sharding for distributing data across multiple databases, replication strategies for high availability, and the CAP theorem for understanding trade-offs in distributed databases.
1. Read Replicas
Read replicas are copies of the primary (master) database that handle read operations. The primary database handles all write operations, and changes are replicated to read replicas asynchronously.
How Read Replicas Work
All writes go to the primary database
Primary database replicates changes to read replicas (asynchronously)
Read operations are distributed across read replicas
This allows horizontal scaling of read capacity
Read Replicas Architecture
Writes go to primary, reads distributed across replicas. Replication happens asynchronously.
Advantages:
Horizontal Read Scaling: Add more replicas to handle more read traffic
Geographic Distribution: Place replicas in different regions for lower latency
High Availability: If primary fails, promote a replica to primary
Offload Primary: Reduces load on primary database
Backup: Replicas can serve as backups
Disadvantages:
Replication Lag: Replicas may have stale data (eventual consistency)
Storage Cost: Each replica requires full database storage
Write Bottleneck: All writes still go through primary
Complexity: Need to handle replication failures and lag
Understanding the replication mode is critical for system design interviews.
# Synchronous Replication
Process:
1. Client writes to primary
2. Primary writes to WAL
3. Primary sends to replica
4. Primary WAITS for replica acknowledgment
5. Primary commits transaction
6. Primary responds to client
Trade-offs:
✅ Zero data loss (if primary fails, data is on replica)
❌ Higher latency (network round-trip)
❌ Lower availability (if replica fails, write blocks)
❌ Lower throughput
Use Case: Financial systems, critical data
# Asynchronous Replication
Process:
1. Client writes to primary
2. Primary writes to WAL
3. Primary commits transaction
4. Primary responds to client (immediately)
5. Primary sends to replica (asynchronously, in background)
6. Replica applies changes
Trade-offs:
✅ Lower latency (no wait for replica)
✅ Higher availability (replica failure doesn't block writes)
✅ Higher throughput
❌ Possible data loss (if primary fails before replication)
❌ Replication lag (replicas may be behind)
Use Case: Most web applications, read scaling
Replication Lag: The Critical Problem
Replication lag is the time between when data is written to the primary and when it appears on replicas. This creates the "read-after-write" consistency problem.
# Replication Lag Scenarios
Scenario 1: Normal Operation
T0: Write to primary (balance = 1000)
T1: Primary commits (50ms)
T2: Replica receives update (100ms)
T3: Replica applies update (150ms)
Lag: 150ms (acceptable for most use cases)
Scenario 2: High Load
T0: Write to primary
T1: Primary commits (50ms)
T2: Replica receives update (500ms) ← Network congestion
T3: Replica applies update (2000ms) ← Replica overloaded
Lag: 2000ms (2 seconds - problematic!)
Scenario 3: Network Partition
T0: Write to primary
T1: Primary commits
T2: Replica cannot receive (network down)
Lag: ∞ (infinite - replica is stale until partition heals)
# Measuring Lag
MySQL: SHOW SLAVE STATUS → Seconds_Behind_Master
PostgreSQL: SELECT * FROM pg_stat_replication → replay_lag
MongoDB: rs.printSlaveReplicationInfo() → lag time
Read-After-Write Consistency Solutions
Interviewers frequently ask: "How do you ensure a user sees their own write immediately after writing?"
# Solution 1: Sticky Sessions (Session Affinity)
Approach:
- Route user's reads to same replica that will receive their writes
- Use load balancer with session affinity
- User always reads from replica that's "caught up" with their writes
Implementation:
User writes to primary → Primary replicates to Replica 1
User reads → Route to Replica 1 (same replica)
Replica 1 has the write (or will have it soon)
Trade-offs:
✅ Simple to implement
❌ Load imbalance (some replicas get more traffic)
❌ Doesn't work if user writes from different locations
# Solution 2: Read from Primary After Write
Approach:
- Track recent writes per user (last N seconds)
- If user wrote recently, read from primary
- Otherwise, read from replica
Implementation:
User writes → Mark user as "recent writer" (TTL: 1 second)
User reads → Check if "recent writer"
- If yes: Read from primary
- If no: Read from replica
Trade-offs:
✅ Guarantees read-after-write consistency
❌ Increases load on primary
❌ More complex application logic
# Solution 3: Timeline Consistency
Approach:
- Track replication lag per replica
- Only read from replicas with lag < threshold
- If all replicas lagged, read from primary
Implementation:
Monitor: Replica 1 lag = 50ms, Replica 2 lag = 200ms, Replica 3 lag = 500ms
Threshold: 100ms
Route reads to: Replica 1 only (or primary if all lagged)
Trade-offs:
✅ Distributes load across replicas
✅ Maintains consistency
❌ Some replicas may be underutilized
❌ Requires lag monitoring
Real-World Example: Instagram
Instagram uses read replicas extensively with a sophisticated routing strategy:
Feed Reads: Route to read replicas (eventual consistency acceptable)
Post Writes: Write to primary, then read from primary for next 1 second
Profile Updates: Write to primary, read from primary for 5 seconds (critical data)
Analytics: Read from dedicated analytics replicas (can be minutes behind)
This allows Instagram to scale read capacity to billions of users while maintaining consistency where it matters.
2. Database Sharding
Sharding (also called horizontal partitioning) splits a database into smaller, independent pieces called "shards". Each shard contains a subset of the data and can be stored on a separate database server.
Why Sharding?
When a single database can't handle the load (either due to size or query volume), sharding distributes data and queries across multiple databases. This allows horizontal scaling of both reads and writes.
Sharding Strategies
1. Range-Based Sharding
Data is partitioned based on a range of values (e.g., user IDs 1-1000 in shard 1, 1001-2000 in shard 2).
Cons: Hard to add/remove shards (requires rehashing all data)
3. Directory-Based Sharding
A lookup service (directory) maps keys to shards. This allows flexible shard assignment.
Pros: Flexible, easy to rebalance
Cons: Single point of failure (directory), additional lookup overhead
Database Sharding
Shard router determines which shard handles each request based on sharding key.
Challenges with Sharding:
Cross-Shard Queries: Joins across shards are expensive or impossible
Rebalancing: Adding/removing shards requires data migration
Hotspots: Uneven data distribution can overload specific shards
Complexity: Application logic must handle shard routing
When to Use:
Database size exceeds single server capacity
Write throughput exceeds single server capacity
Need to scale both reads and writes horizontally
Data can be partitioned without frequent cross-shard queries
3. Vertical vs Horizontal Partitioning
Vertical Partitioning
Splits a table by columns - different columns stored in different tables or databases. Useful when some columns are accessed more frequently than others.
Use Case: When you have wide tables with columns accessed at different frequencies
Horizontal Partitioning (Sharding)
Splits a table by rows - different rows stored in different databases. This is what we call "sharding".
# Horizontally partitioned (sharded)
Shard 1: Users with id 1-1,000,000
Shard 2: Users with id 1,000,001-2,000,000
Shard 3: Users with id 2,000,001-3,000,000
Use Case: When table size or query volume exceeds single server capacity
4. Replication: Deep Technical Dive
Understanding how replication actually works at the database engine level is critical for system design interviews. This section covers the internal mechanisms, algorithms, and real-world implementations.
4.1 How Replication Works: The Fundamentals
Write-Ahead Log (WAL) - The Foundation
Most databases use a Write-Ahead Log (WAL) for replication. Before any data change is written to the actual data files, it's first written to a log file. This log becomes the source of truth for replication.
# Write-Ahead Log Process
1. Client sends: UPDATE users SET balance = 1000 WHERE id = 123
2. Primary Database:
a) Write to WAL: "UPDATE users SET balance = 1000 WHERE id = 123" (log entry #1001)
b) Apply change to data file (in-memory buffer)
c) Acknowledge to client: "OK"
d) Flush data to disk (asynchronously)
3. Replication Process:
a) Replica reads WAL entry #1001
b) Replica applies same change: UPDATE users SET balance = 1000 WHERE id = 123
c) Replica updates its own data file
d) Replica acknowledges: "Applied log entry #1001"
# Key Point: WAL ensures durability and enables replication
# If primary crashes, WAL can be replayed to recover state
Replication Methods: Statement-Based vs Row-Based
Statement-Based Replication (SBR)
The master logs the actual SQL statement, and replicas execute the same statement.
# Master receives:
UPDATE users SET balance = balance + 100 WHERE id = 123
# Master logs to binlog:
"UPDATE users SET balance = balance + 100 WHERE id = 123"
# Replica receives and executes:
UPDATE users SET balance = balance + 100 WHERE id = 123
# Problem: Non-deterministic functions cause issues
# Example: NOW(), RAND(), UUID() produce different values on replica
# If master has balance=1000, replica might have balance=900
# After replication: master=1100, replica=1000 (WRONG!)
Pros: Compact log size, human-readable
Cons: Non-deterministic functions cause inconsistencies, slower on replicas
Row-Based Replication (RBR)
The master logs the actual row changes (before/after values), and replicas apply the exact changes.
# Master receives:
UPDATE users SET balance = balance + 100 WHERE id = 123
# Master logs to binlog (row-based):
Row Change Event:
Table: users
Action: UPDATE
Before: {id: 123, balance: 1000}
After: {id: 123, balance: 1100}
# Replica receives and applies:
Directly set balance = 1100 (doesn't re-execute the calculation)
# Result: Guaranteed consistency, even with non-deterministic functions
Pros: Guaranteed consistency, works with any function
Cons: Larger log size, less human-readable
Mixed Replication (MySQL Default)
MySQL uses statement-based by default, but automatically switches to row-based for non-deterministic statements.
4.2 MySQL Replication: Deep Dive
MySQL Binary Log (binlog) Architecture
MySQL uses a binary log (binlog) to record all changes. The replication process involves multiple components:
# MySQL Replication Components
1. Master Components:
- binlog: Binary log file (sequential, append-only)
- binlog index: Tracks all binlog files
- Dump thread: Sends binlog events to replicas
2. Replica Components:
- Relay log: Temporary storage for events from master
- I/O thread: Connects to master, reads binlog, writes to relay log
- SQL thread: Reads relay log, applies changes to replica database
# Replication Flow:
Master: Write → binlog → Dump Thread → Network → Replica I/O Thread → Relay Log → SQL Thread → Replica DB
MySQL Replication Process Step-by-Step
# Detailed MySQL Replication Process
Step 1: Initial Setup (One-time)
- Master: Enable binlog (log_bin = ON)
- Replica: Configure master connection info
- Replica: Take snapshot of master data (mysqldump or XtraBackup)
- Replica: Load snapshot into replica database
- Replica: Record master's binlog position (e.g., "binlog.000001:1542")
Step 2: Continuous Replication
a) Replica I/O Thread connects to master
b) Replica requests: "Send me events from binlog.000001:1542"
c) Master Dump Thread reads binlog from position 1542
d) Master sends events to replica I/O thread
e) Replica I/O thread writes events to relay log
f) Replica SQL thread reads relay log
g) Replica SQL thread applies changes to replica database
h) Replica updates position: "binlog.000001:2000"
i) Process repeats
Step 3: Handling Failures
- If I/O thread disconnects: Reconnects and resumes from last position
- If SQL thread fails: Stops applying, I/O thread continues receiving
- If master fails: Replica can be promoted (manual or automatic)
MySQL Replication Lag: Causes and Measurement
Replication lag is the delay between when a change is written to the master and when it appears on the replica. This is critical to understand in interviews.
# Causes of Replication Lag
1. Network Latency:
- Master and replica in different data centers
- Example: 50ms network latency = minimum 50ms lag
2. Replica Load:
- Replica handling too many read queries
- SQL thread can't keep up with I/O thread
- Solution: Add more replicas, reduce read load per replica
3. Single-Threaded SQL Thread:
- MySQL replica SQL thread is single-threaded (until MySQL 5.6)
- Master can write in parallel, replica applies sequentially
- Solution: Parallel replication (MySQL 5.6+), use row-based replication
4. Large Transactions:
- Single transaction with millions of rows
- Replica must apply entire transaction atomically
- Solution: Break into smaller transactions
# Measuring Replication Lag
MySQL provides:
SHOW SLAVE STATUS\G
Key metrics:
- Seconds_Behind_Master: Estimated lag in seconds
- Read_Master_Log_Pos: Last binlog position read
- Exec_Master_Log_Pos: Last binlog position applied
- Lag = Read_Master_Log_Pos - Exec_Master_Log_Pos
4.3 PostgreSQL Replication: Deep Dive
PostgreSQL Write-Ahead Log (WAL)
PostgreSQL uses WAL files (16MB segments) for replication. The replication mechanism is different from MySQL.
# PostgreSQL WAL Architecture
1. WAL Segments:
- Each segment: 16MB (default)
- Naming: 000000010000000000000001 (24 hex digits)
- Format: 00000001 (timeline) + 0000000000000001 (log file number)
2. WAL Records:
- Each change creates a WAL record
- Record contains: transaction ID, table OID, row data, operation type
- Records are written sequentially
3. Replication Methods:
a) WAL Shipping (File-Based):
- Master archives WAL files
- Replica copies WAL files
- Replica replays WAL files
- High latency (file-based)
b) Streaming Replication (Real-Time):
- Replica connects to master
- Master streams WAL records as they're written
- Replica applies in real-time
- Low latency (streaming)
PostgreSQL Streaming Replication Process
# PostgreSQL Streaming Replication
Step 1: Setup
- Master: wal_level = replica (or logical)
- Master: max_wal_senders = 10 (allows 10 replicas)
- Replica: Configure recovery.conf with master connection
- Replica: Take base backup (pg_basebackup)
Step 2: Streaming Process
a) Replica connects to master via replication protocol
b) Replica sends: "Start streaming from WAL position X"
c) Master identifies WAL position X in its WAL files
d) Master streams WAL records starting from position X
e) Replica receives WAL records into receive buffer
f) Replica applies WAL records to its database
g) Replica sends acknowledgment: "Applied up to position Y"
h) Master tracks replica's progress
i) Process continues in real-time
Step 3: Synchronous vs Asynchronous
- Asynchronous (default): Master doesn't wait for replica
- Synchronous: Master waits for replica acknowledgment
- synchronous_standby_names = 'replica1'
- Trade-off: Lower latency vs higher durability
4.4 Consensus Algorithms: Raft and Paxos
For multi-master replication and leader election, databases use consensus algorithms to ensure all nodes agree on the same state.
Raft Algorithm: Leader Election and Log Replication
Raft is a consensus algorithm designed to be understandable. It's used by etcd, Consul, and many distributed databases.
# Raft Algorithm Overview
Raft has three key components:
1. Leader Election
2. Log Replication
3. Safety (ensuring consistency)
# Raft States
- Leader: Handles all client requests, replicates to followers
- Follower: Receives updates from leader, votes in elections
- Candidate: Temporary state during leader election
# Leader Election Process
Step 1: Election Timeout
- Each follower has random election timeout (150-300ms)
- If follower doesn't hear from leader, becomes candidate
Step 2: Candidate Requests Votes
- Candidate increments its term (election term)
- Candidate votes for itself
- Candidate sends RequestVote RPC to all other nodes
- Request includes: candidate's term, last log index, last log term
Step 3: Followers Vote
- Follower votes "yes" if:
a) Candidate's term >= follower's term
b) Candidate's log is at least as up-to-date as follower's
- Follower votes "no" if already voted for another candidate
Step 4: Majority Wins
- Candidate becomes leader if receives majority votes
- Leader sends heartbeat to all followers
- If no majority: election timeout, new election
# Log Replication Process
Step 1: Client Request
- Client sends write request to leader
- Leader appends entry to its log (not yet committed)
Step 2: Replication
- Leader sends AppendEntries RPC to all followers
- Followers append entry to their logs
- Followers send acknowledgment
Step 3: Commitment
- Leader waits for majority acknowledgments
- Leader commits entry (applies to state machine)
- Leader sends commit message to followers
- Followers commit and apply to state machine
- Leader responds to client: "Success"
# Safety Guarantees
- Election Safety: At most one leader per term
- Log Matching: If two logs have entry with same index and term, they're identical
- Leader Completeness: Committed entries from previous terms are present in new leader's log
Paxos Algorithm: The Classic Consensus Protocol
Paxos is the foundational consensus algorithm, used by Google's Chubby, Amazon's DynamoDB (modified), and many distributed systems.
# Paxos Algorithm (Simplified)
Paxos has three roles:
1. Proposers: Propose values
2. Acceptors: Accept proposals
3. Learners: Learn the chosen value
# Basic Paxos Process
Phase 1: Prepare
a) Proposer sends Prepare(n) to majority of acceptors
- n = proposal number (unique, increasing)
b) Acceptor responds:
- If n > highest proposal seen: Promise not to accept proposals < n
- Include highest proposal number accepted (if any)
c) Proposer collects promises from majority
Phase 2: Accept
a) Proposer sends Accept(n, v) to majority
- v = value (if no previous value, use proposer's value)
- If previous value exists, use that value
b) Acceptor accepts if n >= highest proposal number promised
c) Acceptor responds with acceptance
d) Proposer collects acceptances from majority
Phase 3: Learn
- Once majority accepts, value is "chosen"
- Learners learn the chosen value
- All nodes eventually learn the same value
# Key Properties
- Safety: Only one value can be chosen
- Liveness: Eventually a value is chosen (if no failures)
- Fault Tolerance: Works with up to (N-1)/2 failures (N = total nodes)
Amazon DynamoDB uses a unique replication model based on the Dynamo paper. Understanding this is critical for interviews.
DynamoDB Architecture
# DynamoDB Replication Model
1. Data Partitioning:
- Table partitioned across multiple nodes using consistent hashing
- Each partition has 3 replicas (by default)
- Replicas stored in different availability zones
2. Replication Strategy:
- Synchronous replication to 2 of 3 replicas
- Asynchronous replication to 3rd replica
- Quorum-based reads and writes
3. Quorum Reads:
- Read from R replicas (R = read quorum, typically 2)
- Wait for R responses
- Return most recent version (based on vector clock)
- If versions conflict, return all versions (client resolves)
4. Quorum Writes:
- Write to W replicas (W = write quorum, typically 2)
- Wait for W acknowledgments
- Asynchronously replicate to remaining replicas
- Rule: R + W > N (where N = total replicas, typically 3)
- This ensures read-write overlap (at least one node sees both)
# Example: N=3, R=2, W=2
- Write: Update replicas 1 and 2
- Read: Read from replicas 2 and 3
- Overlap: Replica 2 sees both read and write (ensures consistency)
DynamoDB Vector Clocks and Conflict Resolution
DynamoDB uses vector clocks to track causality and resolve conflicts.
# Vector Clocks in DynamoDB
Vector Clock: [Node1: 5, Node2: 3, Node3: 7]
- Node1 has seen 5 events from itself
- Node1 has seen 3 events from Node2
- Node1 has seen 7 events from Node3
# Conflict Resolution
Scenario: Two clients update same item concurrently
Client A (via Node1):
Write: balance = 1000
Vector Clock: [Node1: 1, Node2: 0, Node3: 0]
Client B (via Node2):
Write: balance = 2000
Vector Clock: [Node1: 0, Node2: 1, Node3: 0]
Read (from Node1 and Node2):
Node1 returns: balance = 1000, VC = [1,0,0]
Node2 returns: balance = 2000, VC = [0,1,0]
Conflict detected: Neither vector clock is "happens-before" the other
DynamoDB returns both versions: [1000, 2000]
Application must resolve (last-write-wins, merge, etc.)
# Last-Write-Wins (DynamoDB Default)
- DynamoDB uses timestamp for conflict resolution
- If timestamps conflict, uses node ID
- Client can also provide custom conflict resolution
DynamoDB Consistency Levels
# DynamoDB Read Consistency
1. Eventually Consistent Reads (Default):
- Read from any replica
- May return stale data
- Lower latency, lower cost
- Use case: Non-critical reads
2. Strongly Consistent Reads:
- Read from all replicas
- Wait for quorum
- Returns latest data
- Higher latency, higher cost
- Use case: Critical reads (account balance)
# Example:
Eventually Consistent:
- Read from replica 1 (might be 100ms behind)
- Latency: 10ms
- Cost: 1 read unit
Strongly Consistent:
- Read from replicas 1, 2, 3
- Wait for all responses
- Return latest
- Latency: 30ms (network to 3 replicas)
- Cost: 2 read units (2x more expensive)
4.6 MongoDB Replication: Replica Sets
MongoDB uses replica sets for replication. Understanding the oplog and election process is key.
MongoDB Replica Set Architecture
# MongoDB Replica Set
Replica Set Members:
1. Primary: One primary handles all writes
2. Secondaries: Multiple secondaries replicate from primary
3. Arbiter: Optional, votes in elections but doesn't store data
# Oplog (Operations Log)
- Capped collection (fixed size, circular)
- Stores all write operations
- Format: {ts: Timestamp, op: "i"|"u"|"d", ns: "db.collection", o: data}
- Example: {ts: Timestamp(1234567890, 1), op: "i", ns: "test.users", o: {_id: 1, name: "John"}}
# Replication Process
Step 1: Write to Primary
- Client writes to primary
- Primary writes to oplog
- Primary applies to data files
- Primary responds to client
Step 2: Replication to Secondaries
- Secondary queries primary's oplog: "Give me entries after timestamp X"
- Primary sends oplog entries
- Secondary applies oplog entries to its data files
- Secondary updates its timestamp
Step 3: Heartbeat and Election
- Primary sends heartbeat every 2 seconds
- If secondary doesn't receive heartbeat for 10 seconds, starts election
- Election: Secondary with highest priority or most recent oplog becomes primary
MongoDB Read Preferences and Write Concerns
# MongoDB Read Preferences
1. primary (Default):
- Read only from primary
- Strong consistency
- Use case: Critical reads
2. primaryPreferred:
- Read from primary, fallback to secondary if primary unavailable
- Use case: Prefer consistency, but allow availability
3. secondary:
- Read only from secondaries
- Eventual consistency
- Use case: Analytics, reporting
4. secondaryPreferred:
- Read from secondary, fallback to primary
- Use case: Distribute read load
5. nearest:
- Read from nearest node (lowest latency)
- Use case: Geographic distribution
# MongoDB Write Concerns
Write Concern specifies how many nodes must acknowledge write:
w: 1 (Default)
- Primary acknowledges
- Fast, but data might be lost if primary fails
w: "majority"
- Majority of nodes acknowledge
- Slower, but safer
- Use case: Critical writes
w: 2
- At least 2 nodes acknowledge
- Balance between speed and safety
j: true (Journal)
- Write must be written to journal (durable)
- Ensures data survives crash
- Use case: Critical data
# Example:
db.users.insert({name: "John"}, {writeConcern: {w: "majority", j: true}})
- Write to primary
- Replicate to majority of secondaries
- Wait for journal flush
- Acknowledge to client
4.7 Replication Lag: Deep Dive
Understanding replication lag is critical. Interviewers will ask: "What happens if a user writes data and immediately reads from a replica?"
Read-After-Write Consistency Problem
# The Problem
Scenario:
1. User writes: UPDATE balance = 1000 WHERE id = 123 (to primary)
2. User immediately reads: SELECT balance FROM users WHERE id = 123 (from replica)
3. Replica hasn't received update yet (replication lag = 200ms)
4. User sees: balance = 500 (stale data!)
# Solutions
Solution 1: Read from Primary After Write
- Track recent writes (last 1 second) in application
- If user wrote recently, read from primary
- Otherwise, read from replica
- Trade-off: Some reads go to primary (increases load)
Solution 2: Sticky Sessions
- Route user's reads to same replica that received their writes
- Requires session affinity
- Trade-off: Load imbalance, complex routing
Solution 3: Wait for Replication
- After write, wait for replica to acknowledge
- Then allow reads from replica
- Trade-off: Higher write latency
Solution 4: Timeline Consistency
- Track replication lag per replica
- Only read from replicas with lag < threshold (e.g., 100ms)
- Trade-off: Some replicas might be unavailable
Measuring and Monitoring Replication Lag
# Monitoring Replication Lag
MySQL:
SHOW SLAVE STATUS\G
- Seconds_Behind_Master: Lag in seconds
- Read_Master_Log_Pos: Last position read
- Exec_Master_Log_Pos: Last position applied
PostgreSQL:
SELECT * FROM pg_stat_replication;
- flush_lag: Lag in bytes
- replay_lag: Lag in bytes
- write_lag: Network + write lag
MongoDB:
rs.printSlaveReplicationInfo()
- Shows lag for each secondary
# Application-Level Monitoring
- Write timestamp to primary
- Read timestamp from replica
- Calculate difference = replication lag
- Alert if lag > threshold (e.g., 1 second)
4.8 Conflict Resolution in Multi-Master Replication
When multiple masters can accept writes, conflicts are inevitable. Understanding conflict resolution strategies is essential.
# Conflict Resolution Strategies
1. Last-Write-Wins (LWW):
- Use timestamp to determine winner
- Simple, but can lose data
- Example: DynamoDB default
2. Vector Clocks:
- Track causality
- Detect conflicts
- Return all versions, client resolves
- Example: DynamoDB, Riak
3. Operational Transformation:
- Transform operations to resolve conflicts
- Used in collaborative editing (Google Docs)
- Complex, but preserves intent
4. CRDTs (Conflict-Free Replicated Data Types):
- Data structures designed for conflict-free merging
- Examples: Counters, Sets, Maps
- Automatically resolve conflicts
- Example: Riak, Redis
5. Application-Level Resolution:
- Database returns all conflicting versions
- Application logic resolves (merge, choose, etc.)
- Most flexible, but most complex
5. CAP Theorem: Deep Dive
The CAP theorem (Consistency, Availability, Partition Tolerance) is fundamental to understanding distributed systems. However, it's often misunderstood. Let's dive deep into what it really means.
5.1 The Three Properties Defined
Consistency (C)
Linearizability - All nodes see the same data at the same time. After a write completes, all subsequent reads (from any node) must return that value or a more recent value.
# Consistency Example
Node 1: Write x = 100 (at time T1)
Node 2: Read x (at time T2, where T2 > T1)
Result: Node 2 MUST see x = 100 (not stale value)
# Violation of Consistency:
Node 1: Write x = 100
Node 2: Read x → returns 50 (stale value) ❌
Node 3: Read x → returns 100 (correct value)
This is inconsistent - nodes see different values
Availability (A)
Every request to a non-failing node receives a response (not an error or timeout). The system continues operating even if some nodes fail.
# Availability Example
System with 3 nodes:
- Node 1 fails
- Node 2 and 3 continue serving requests ✅
# Violation of Availability:
- Node 1 fails
- System stops responding to all requests ❌
- Returns "Service Unavailable" error
Partition Tolerance (P)
The system continues operating despite network partitions (nodes can't communicate). This is unavoidable in distributed systems - network partitions WILL happen.
# Network Partition Example
Original Network:
Node 1 ←→ Node 2 ←→ Node 3
Network Partition:
Cluster A: Node 1, Node 2 (can communicate)
Cluster B: Node 3 (isolated)
Partition-Tolerant System:
- Both clusters continue operating ✅
- May have consistency issues (expected)
Non-Partition-Tolerant System:
- System stops operating ❌
- Not practical for distributed systems
5.2 CAP Theorem: The Real Meaning
During a network partition, you must choose between Consistency and Availability. You cannot have both.
# CAP Theorem Scenarios
Scenario: Network Partition
Cluster A: Node 1, Node 2
Cluster B: Node 3
Client in Cluster A writes: x = 100
Client in Cluster B reads: x = ?
Option 1: Choose Consistency (CP)
- Cluster B cannot read (doesn't have latest value)
- Returns error or blocks until partition heals
- System is unavailable for Cluster B ❌
- But maintains consistency ✅
Option 2: Choose Availability (AP)
- Cluster B returns stale value (x = 50)
- System is available ✅
- But violates consistency ❌
Option 3: Choose CA (Impossible)
- Can't have both during partition
- Only possible in non-distributed system (single node)
- Not practical for distributed systems
5.3 Real-World CAP Implementations
CP Systems: Consistency + Partition Tolerance
# CP Systems Examples
1. PostgreSQL (with synchronous replication):
- During partition: Blocks writes until majority available
- Ensures consistency
- Trade-off: May be unavailable
2. MongoDB (with write concern "majority"):
- Requires majority to acknowledge writes
- During partition: If no majority, blocks writes
- Ensures consistency
- Trade-off: Availability
3. HBase:
- Strong consistency model
- During partition: May block operations
- Ensures consistency
- Trade-off: Availability
# CP System Behavior:
- Prioritizes consistency over availability
- During partition: May return errors or block
- Use case: Financial systems, critical data
AP Systems: Availability + Partition Tolerance
# AP Systems Examples
1. DynamoDB:
- Always accepts reads and writes
- During partition: Returns available data (may be stale)
- Eventual consistency
- Trade-off: May return inconsistent data
2. Cassandra:
- Always accepts reads and writes
- During partition: Returns data from available nodes
- Eventual consistency
- Trade-off: Consistency
3. CouchDB:
- Multi-master replication
- Always available
- Conflict resolution required
- Trade-off: Consistency
# AP System Behavior:
- Prioritizes availability over consistency
- During partition: Continues serving requests
- May return stale or conflicting data
- Use case: Social media, content delivery
5.4 CAP Theorem Nuances and Common Misconceptions
Misconception 1: "You choose CAP once for your system"
Reality: CAP is a per-operation choice. Different operations can have different CAP trade-offs.
# Example: DynamoDB
Read Operation:
- Eventually Consistent Read: AP (available, may be inconsistent)
- Strongly Consistent Read: CP (consistent, may be unavailable)
Write Operation:
- Standard Write: AP (available, eventual consistency)
- Conditional Write: CP (consistent, may fail if condition not met)
# System can be both AP and CP depending on operation!
Misconception 2: "CP means always consistent"
Reality: CP systems are consistent ONLY when there's no partition. During normal operation, they can be both consistent and available.
Misconception 3: "Partitions are rare"
Reality: Network partitions are common in distributed systems. They can be caused by:
Network switches failing
Router misconfigurations
Firewall rules
Network congestion
Data center connectivity issues
5.5 Beyond CAP: ACID vs BASE
ACID (Traditional Databases)
# ACID Properties
Atomicity: All or nothing
- Transaction either completes fully or not at all
- Example: Transfer $100 (debit $100, credit $100) - both or neither
Consistency: Database remains in valid state
- Constraints are always satisfied
- Example: Balance cannot be negative
Isolation: Concurrent transactions don't interfere
- Transactions see consistent view
- Example: Read committed, serializable isolation levels
Durability: Committed changes persist
- Written to disk, survives crashes
- Example: WAL ensures durability
# ACID = CP (Consistency + Partition Tolerance)
# Trade-off: Lower availability, higher latency
BASE (NoSQL Databases)
# BASE Properties
Basically Available: System is available most of the time
- Accepts requests even during failures
- Example: DynamoDB always accepts reads/writes
Soft State: System state may change without input
- Replication may update state
- Example: Replica eventually receives updates
Eventual Consistency: System will become consistent
- Given enough time, all nodes see same data
- Example: After replication lag, all replicas have same data
# BASE = AP (Availability + Partition Tolerance)
# Trade-off: Lower consistency, higher availability
5.6 Choosing the Right CAP Trade-off
# Decision Framework
Choose CP (Consistency + Partition Tolerance) when:
- Data correctness is critical
- Financial transactions
- User account balances
- Inventory management
- Can tolerate temporary unavailability
- Examples: PostgreSQL, MongoDB (with strong consistency)
Choose AP (Availability + Partition Tolerance) when:
- High availability is critical
- Slight inconsistency is acceptable
- Social media feeds
- Content delivery
- Recommendations
- Can handle eventual consistency
- Examples: DynamoDB, Cassandra, CouchDB
# Hybrid Approach (Common in Practice):
- Critical operations: CP (e.g., payment processing)
- Non-critical operations: AP (e.g., recommendations)
- Different consistency levels per operation
- Example: E-commerce site
* Orders: CP (must be consistent)
* Product recommendations: AP (can be eventually consistent)
Common Mistakes to Avoid
❌ Sharding too early: Sharding adds complexity - only shard when necessary
❌ Ignoring replication lag: Read replicas have lag - don't read immediately after write
❌ Cross-shard queries: Avoid joins across shards - they're expensive
❌ Wrong sharding key: Choose a key that distributes data evenly
❌ Not planning for growth: Design sharding strategy that can scale
How to Discuss Database Scaling in Interviews
Database scaling is a critical topic in system design interviews. Here's how to approach it effectively.
Step-by-Step Interview Approach
1. Start with Read Replicas
# Always Start Here
"I'd first implement read replicas because:
1. Simple to implement (most databases support it)
2. Solves read scaling without complexity
3. No application code changes needed (just route reads)
4. Provides high availability (failover to replica)
This handles 80% of scaling needs for most applications."
2. Discuss Replication Lag
# Critical Point to Mention
"However, read replicas introduce replication lag:
- Writes go to master, then replicate to replicas
- Replicas may be seconds behind master
- This means eventual consistency
I'd handle this by:
- Reading from master for critical reads (user's own data)
- Reading from replicas for non-critical reads (other users' data)
- Using version numbers or timestamps to detect stale data"
3. When to Consider Sharding
# When Read Replicas Aren't Enough
"Sharding becomes necessary when:
- Write throughput exceeds single master capacity
- Database size exceeds single server limits
- Geographic distribution needed
I'd choose sharding strategy based on:
- Range-based: For time-series, sequential data
- Hash-based: For even distribution, no hotspots
- Directory-based: For complex routing, flexibility"
Common Interview Questions
Question 1: "How would you scale a database for a social media app?"
# Strong Answer Structure
1. Start with read replicas:
"Social media is read-heavy (10:1 read:write ratio).
I'd add read replicas to handle feed generation, profile views."
2. Discuss sharding strategy:
"For writes, I'd shard by user_id using hash-based sharding.
This ensures even distribution and allows parallel writes."
3. Mention caching:
"I'd add Redis cache for hot data (user profiles, trending posts).
This reduces database load significantly."
4. Discuss consistency:
"For user's own posts, read from master (strong consistency).
For feed/timeline, read from replicas (eventual consistency is OK)."
Question 2: "How do you handle replication lag?"
# Comprehensive Answer
1. Acknowledge the problem:
"Replication lag is inevitable with async replication.
Replicas can be seconds behind master."
2. Solutions:
- Read-your-writes: Route user's own reads to master
- Version vectors: Track data versions, detect stale reads
- Quorum reads: Read from multiple replicas, use latest
- Timeline consistency: Route reads based on write timestamp
3. Trade-offs:
"Synchronous replication eliminates lag but reduces write throughput.
For most apps, eventual consistency is acceptable."
Question 3: "When would you use master-slave vs master-master?"
# Decision Framework
Master-Slave (Most Common):
✅ Simpler to implement
✅ No conflict resolution needed
✅ Clear write path (always master)
❌ Single point of failure for writes
❌ Geographic limitations
Master-Master:
✅ Write scaling (writes to both)
✅ Geographic distribution
✅ Higher availability
❌ Complex conflict resolution
❌ Eventual consistency challenges
❌ More complex to implement
"I'd use master-slave for most cases, master-master only if
I need write scaling and can handle conflict resolution."
Key Points to Emphasize
✅ Start simple: Read replicas first, sharding only when needed
Caching is one of the most critical performance optimization techniques in system design. Understanding caching strategies, patterns, and trade-offs is essential for building high-performance systems. This topic covers everything from browser caching to distributed caching systems used by major tech companies.
In FAANG interviews, you'll be asked about cache invalidation strategies, cache eviction policies, distributed caching architectures, and how to handle cache consistency. This deep dive covers all of that with real-world examples from AWS, Google Cloud, Azure, and industry leaders.
1. Cache Levels: The Caching Hierarchy
Caching exists at multiple levels in a system. Understanding each level and when to use it is crucial for system design.
1.1 Browser Cache
The browser cache stores resources locally on the user's device. This is the fastest cache but has limited control from the server side.
# Browser Cache Mechanisms
1. HTTP Cache Headers:
Cache-Control: max-age=3600, public
- max-age: Cache for 3600 seconds (1 hour)
- public: Can be cached by any cache
- private: Only browser can cache
- no-cache: Must revalidate before use
- no-store: Don't cache at all
ETag: "abc123"
- Resource version identifier
- Browser sends If-None-Match: "abc123"
- Server responds 304 Not Modified if unchanged
Last-Modified: Wed, 21 Oct 2015 07:28:00 GMT
- Resource modification time
- Browser sends If-Modified-Since
- Server responds 304 if not modified
2. Cache Storage:
- Memory Cache: Fast, limited size (~100MB)
- Disk Cache: Slower, larger size (~1GB)
- Service Worker Cache: Programmatic control
# Real-World Example: Google
- Static assets (JS, CSS): max-age=31536000 (1 year)
- HTML: max-age=3600, must-revalidate
- Images: max-age=86400 (1 day)
- API responses: no-cache (always fresh)
1.2 CDN Cache (Content Delivery Network)
CDN caches content at edge locations worldwide. CDNs are an important part of the caching hierarchy for delivering content globally with low latency.
1.3 Reverse Proxy Cache
A reverse proxy (like Nginx, Varnish) sits in front of application servers and caches responses.
Reverse Proxy Caching Architecture
How It Works: The reverse proxy intercepts requests before they reach application servers.
If the response is cached, it's served immediately. If not, the request is forwarded to the backend,
the response is cached, and returned to the client. Real-World: Varnish Cache (used by Wikipedia, The Guardian)
can handle 100K+ requests/second.
1.4 Application Cache
Application-level caching stores frequently accessed data in memory (in-process) or in a separate cache service (out-of-process).
Application Cache Types: Comparison
Key Trade-off: In-process caches are faster (nanoseconds) but limited to a single server.
Out-of-process caches (like Redis) are shared across servers and survive restarts, but add network latency (microseconds).
Most systems use out-of-process caches for scalability, with in-process caches as an optimization layer for extremely hot data.
1.5 Database Cache
Databases have their own caching mechanisms (query cache, buffer pool).
Database-Level Caching
Databases have built-in caching mechanisms that operate transparently. Understanding these helps you make better caching decisions at the application level.
MySQL Query Cache
Caches SELECT query results
Invalidated on table updates
Note: Disabled in MySQL 8.0 (deprecated)
PostgreSQL Shared Buffers
Caches frequently accessed pages
Reduces disk I/O significantly
Tuned based on available RAM
Best Practice: Let the database handle its own caching. Focus your application-level caching on expensive queries,
frequently accessed data, and computed results. Use read replicas for read-heavy workloads to distribute database load.
2. Cache Patterns: How to Use Caching
Different cache patterns solve different problems. Understanding when to use each pattern is critical for system design interviews.
2.1 Cache-Aside (Lazy Loading)
Cache-Aside is the most common pattern. The application is responsible for loading data into the cache.
Cache-Aside Pattern: Read and Write Flows
✓ Advantages
Simple to implement
Cache failures don't break application
Flexible cache invalidation
✗ Disadvantages
Cache miss penalty (2 round trips)
Possible stale data if invalidation fails
Application must manage cache logic
Real-World: Facebook's Cache-Aside
Facebook uses cache-aside extensively. Their Memcached infrastructure handles billions of requests per second using this pattern.
Reference: "Scaling Memcached at Facebook" - NSDI 2013
Cache hit rate: ~99% for hot data
Reduces database load by 10-100x
2.2 Write-Through Cache
In Write-Through, data is written to both cache and database simultaneously.
# Write-Through Pattern
Write Flow:
1. Write to cache
2. Write to database
3. Return success (after both complete)
Read Flow:
1. Check cache
2. If miss: Read from database, update cache
# Implementation:
def update_user(user_id, data):
# Write to cache
cache_key = f"user:{user_id}"
redis.setex(cache_key, 3600, json.dumps(data))
# Write to database
db.update("UPDATE users SET ... WHERE id = ?", user_id, data)
# Both must succeed (or use transaction)
# Advantages:
✅ Cache always consistent with database
✅ No stale data
✅ Read performance (data always in cache after write)
# Disadvantages:
❌ Higher write latency (2 writes)
❌ Cache failures affect writes
❌ Writes unnecessary data to cache (if not read soon)
Real-World: AWS ElastiCache Write-Through
AWS ElastiCache (Redis/Memcached) is commonly used with write-through for critical data that must always be consistent.
2.3 Write-Behind (Write-Back)
In Write-Behind, data is written to cache immediately, and database writes happen asynchronously.
# Write-Behind Pattern
Write Flow:
1. Write to cache immediately
2. Return success to client
3. Asynchronously write to database (background job)
Read Flow:
1. Check cache (should always hit after write)
# Implementation:
def update_user(user_id, data):
# Write to cache immediately
cache_key = f"user:{user_id}"
redis.setex(cache_key, 3600, json.dumps(data))
# Queue database write (async)
queue.enqueue(write_to_db, user_id, data)
return "Success" # Return immediately
def write_to_db(user_id, data):
# Background job writes to database
db.update("UPDATE users SET ... WHERE id = ?", user_id, data)
# Advantages:
✅ Very fast writes (cache only)
✅ High write throughput
✅ Database writes can be batched
# Disadvantages:
❌ Risk of data loss (cache failure before DB write)
❌ Complex failure handling
❌ Eventual consistency (cache ahead of DB)
Real-World: Kafka + Write-Behind
Many systems use Kafka as a write-behind buffer. Writes go to cache and Kafka, then consumers write to the database asynchronously.
2.4 Refresh-Ahead
Refresh-Ahead proactively refreshes cache entries before they expire.
# Refresh-Ahead Pattern
Mechanism:
1. Cache entry has TTL (e.g., 1 hour)
2. Background thread checks entries
3. If entry expires in < threshold (e.g., 10 minutes):
a) Asynchronously refresh from database
b) Update cache with new data
c) Reset TTL
# Implementation:
def refresh_cache_entries():
for key in redis.keys("user:*"):
ttl = redis.ttl(key)
if ttl < 600: # Less than 10 minutes
user_id = extract_id(key)
# Async refresh
async_refresh(user_id)
# Advantages:
✅ Reduces cache miss penalty
✅ Smoother user experience
✅ Predictable performance
# Disadvantages:
❌ Wastes resources refreshing unused entries
❌ Complex to implement
❌ May refresh data that's never accessed
3. Cache Eviction Policies
When cache is full, which entries should be removed? Different eviction policies optimize for different use cases.
3.1 LRU (Least Recently Used)
LRU evicts the least recently accessed entry. This is the most common policy.
# LRU Implementation (Simplified)
Data Structure: Doubly Linked List + Hash Map
Operations:
- Access: Move to head (O(1))
- Evict: Remove from tail (O(1))
- Insert: Add to head (O(1))
Example:
Cache: [A, B, C] (A = most recent)
Access C: [C, A, B]
Access B: [B, C, A]
Insert D (cache full): [D, B, C] (A evicted)
# Real-World Usage:
- Redis default (with maxmemory-policy allkeys-lru)
- Memcached LRU
- CPU cache (hardware)
- Browser cache
# Advantages:
✅ Good for temporal locality (recent = likely to be used)
✅ Simple to implement
✅ Works well for most workloads
# Disadvantages:
❌ Poor for scan patterns (evicts everything)
❌ Doesn't consider access frequency
3.2 LFU (Least Frequently Used)
LFU evicts the least frequently accessed entry.
# LFU Implementation
Data Structure: Frequency buckets + Hash Map
Operations:
- Access: Increment frequency, move to higher bucket
- Evict: Remove from lowest frequency bucket
Example:
Access counts: A=10, B=5, C=3, D=1
Evict: D (lowest frequency)
Access D: D=2, still lowest
Access C: C=4, now D is lowest
# Real-World Usage:
- Redis (maxmemory-policy allkeys-lfu)
- Some CDN configurations
- Content recommendation systems
# Advantages:
✅ Good for stable access patterns
✅ Preserves frequently accessed data
✅ Better for long-term caching
# Disadvantages:
❌ Poor for changing access patterns
❌ New entries evicted quickly (frequency = 0)
❌ More complex than LRU
3.3 FIFO (First In First Out)
FIFO evicts the oldest entry (by insertion time).
# FIFO Implementation
Data Structure: Queue
Operations:
- Insert: Add to tail
- Evict: Remove from head
Example:
Insert: A, B, C
Evict: A (oldest)
Insert: D
Evict: B (now oldest)
# Real-World Usage:
- Simple caching scenarios
- Network buffers
- Some database buffer pools
# Advantages:
✅ Very simple
✅ Predictable behavior
✅ Low overhead
# Disadvantages:
❌ Doesn't consider access patterns
❌ May evict frequently used data
❌ Poor hit rate for most workloads
3.4 Random Eviction
Random eviction randomly selects an entry to evict. Rarely used in practice.
3.5 TTL-Based Eviction
Entries expire after a Time-To-Live (TTL). This is often combined with other policies.
# TTL-Based Eviction
Mechanism:
- Each entry has expiration time
- Background thread periodically checks
- Removes expired entries
# Implementation (Redis):
redis.setex("user:123", 3600, data) # Expires in 1 hour
redis.expire("user:123", 3600) # Set TTL on existing key
# Real-World Usage:
- Session data (expires after inactivity)
- API rate limiting (reset after window)
- Temporary data (one-time tokens)
# Advantages:
✅ Automatic cleanup
✅ Prevents stale data
✅ Simple to understand
# Disadvantages:
❌ May evict data that's still needed
❌ Requires background cleanup
❌ TTL tuning can be difficult
4. Distributed Caching: Redis and Memcached
For large-scale systems, distributed caching is essential. Understanding Redis and Memcached architectures is critical for interviews.
4.1 Redis Architecture Deep Dive
Redis is an in-memory data structure store. It's more than a cache - it's a data structure server.
Redis Data Structures
# Redis Data Types
1. Strings:
SET user:123 "John Doe"
GET user:123
- Use case: Simple key-value caching
2. Hashes:
HSET user:123 name "John" age 30
HGET user:123 name
- Use case: Object caching (user profiles)
3. Lists:
LPUSH queue:emails "email1"
RPOP queue:emails
- Use case: Queues, recent items
4. Sets:
SADD tags:post:123 "tech" "programming"
SMEMBERS tags:post:123
- Use case: Tags, unique items
5. Sorted Sets:
ZADD leaderboard 100 "player1"
ZREVRANGE leaderboard 0 9
- Use case: Leaderboards, rankings
6. Bitmaps:
SETBIT user:123:online 1
GETBIT user:123:online
- Use case: Feature flags, analytics
7. HyperLogLog:
PFADD visitors:2024 "user1" "user2"
PFCOUNT visitors:2024
- Use case: Unique count estimation
8. Streams:
XADD events * user_id 123 action "login"
XREAD STREAMS events 0
- Use case: Event streaming, logs
Redis Persistence
# Redis Persistence Options
1. RDB (Redis Database Backup):
- Periodic snapshots
- Point-in-time backups
- Fast recovery
- May lose recent data (between snapshots)
2. AOF (Append-Only File):
- Logs every write operation
- More durable (can lose max 1 second)
- Slower recovery (replay log)
- Larger file size
3. RDB + AOF (Hybrid):
- Best of both worlds
- RDB for fast recovery
- AOF for durability
- Redis 4.0+ default
# Configuration:
save 900 1 # Save if 1 key changed in 900 seconds
save 300 10 # Save if 10 keys changed in 300 seconds
appendonly yes
appendfsync everysec
Redis Cluster Architecture
# Redis Cluster (Sharding)
Architecture:
- 16384 hash slots
- Each key hashes to a slot
- Each node handles subset of slots
- Replication: Each master has 1+ replicas
Example:
Node 1: Slots 0-5460 (Master) + Replica
Node 2: Slots 5461-10922 (Master) + Replica
Node 3: Slots 10923-16383 (Master) + Replica
Key Routing:
Key "user:123" → CRC16("user:123") % 16384 = Slot 7000
Slot 7000 → Node 2
Request routed to Node 2
# Failover:
- If Node 2 master fails, replica promoted
- Cluster continues operating
- Automatic failover (Redis Sentinel or Cluster mode)
# Real-World: Twitter
- Uses Redis Cluster extensively
- Handles millions of operations/second
- Reference: "Scaling Redis at Twitter" - RedisConf 2014
AWS ElastiCache for Redis
# AWS ElastiCache Redis Features
1. Managed Service:
- Automatic backups
- Automatic failover
- Monitoring and alerts
- Multi-AZ support
2. Cluster Mode:
- Up to 500 nodes
- Automatic sharding
- Horizontal scaling
3. Performance:
- Sub-millisecond latency
- Millions of operations/second
- In-memory performance
4. Use Cases:
- Session storage
- Real-time analytics
- Leaderboards
- Rate limiting
# Configuration Options:
- Single-AZ: Lower latency, single point of failure
- Multi-AZ: Higher availability, automatic failover
- Cluster mode: Horizontal scaling, automatic sharding
# Reference:
AWS ElastiCache Documentation
https://aws.amazon.com/elasticache/
4.2 Memcached Architecture
Memcached is a simple, high-performance distributed memory caching system.
# Memcached vs Redis
Memcached:
✅ Simpler (just key-value)
✅ Faster for simple operations
✅ Lower memory overhead
✅ Better for horizontal scaling
❌ No persistence
❌ No data structures (just strings)
❌ No replication (client-side sharding)
Redis:
✅ Rich data structures
✅ Persistence options
✅ Built-in replication
✅ More features (pub/sub, Lua scripting)
❌ More complex
❌ Higher memory overhead
❌ Slower for simple operations
# When to Use Each:
Memcached:
- Simple key-value caching
- High throughput needed
- Data can be lost (cache only)
- Example: Facebook, Wikipedia
Redis:
- Need data structures
- Need persistence
- Need advanced features
- Example: Twitter, GitHub, Stack Overflow
Real-World: Facebook's Memcached
Facebook operates one of the largest Memcached deployments in the world:
Scale: Thousands of servers, petabytes of memory
Traffic: Billions of requests per second
Hit Rate: ~99% for hot data
Reference: "Scaling Memcached at Facebook" - NSDI 2013
5. Cache Invalidation Strategies
Cache invalidation is one of the hardest problems in computer science. Understanding different strategies is essential.
5.1 Time-Based Invalidation (TTL)
# TTL-Based Invalidation
Mechanism:
- Set expiration time on cache entry
- Entry automatically expires
- Next access triggers refresh
# Implementation:
def get_user(user_id):
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache expired or miss
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
redis.setex(f"user:{user_id}", 3600, json.dumps(user)) # 1 hour TTL
return user
# Advantages:
✅ Simple
✅ Automatic cleanup
✅ Predictable staleness
# Disadvantages:
❌ May serve stale data (up to TTL duration)
❌ TTL tuning difficult
❌ Cache stampede if many requests after expiration
5.2 Event-Based Invalidation
# Event-Based Invalidation
Mechanism:
- Database changes trigger cache invalidation
- Real-time or near-real-time
- Ensures cache consistency
# Implementation (Database Triggers):
CREATE TRIGGER invalidate_user_cache
AFTER UPDATE ON users
FOR EACH ROW
BEGIN
-- Publish to message queue
PUBLISH cache:invalidate "user:{NEW.id}"
END;
# Application Listens:
redis.subscribe("cache:invalidate", function(channel, key) {
redis.delete(key)
})
# Implementation (Change Data Capture):
- Use CDC tool (Debezium, AWS DMS)
- Capture database changes
- Publish to Kafka
- Cache service consumes and invalidates
# Advantages:
✅ Immediate invalidation
✅ No stale data
✅ Cache always consistent
# Disadvantages:
❌ Complex to implement
❌ Requires infrastructure (message queue, CDC)
❌ Higher latency (event processing)
Real-World: Netflix Cache Invalidation
Netflix uses event-driven cache invalidation:
Content updates trigger invalidation events
Events flow through Kafka
Cache services (EVCache) invalidate based on events
Reference: "EVCache: Distributed In-Memory Caching" - Netflix Tech Blog
5.3 Version-Based Invalidation
# Version-Based Invalidation
Mechanism:
- Each cache entry has a version
- Database changes increment version
- Cache compares versions
# Implementation:
def get_user(user_id):
# Get from cache with version
cached = redis.hgetall(f"user:{user_id}")
if cached:
cached_version = cached['version']
# Check if version changed
db_version = db.query("SELECT version FROM users WHERE id = ?", user_id)
if cached_version == db_version:
return json.loads(cached['data'])
# Version mismatch or miss - refresh
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
redis.hset(f"user:{user_id}", mapping={
'data': json.dumps(user),
'version': user['version']
})
return user
# Advantages:
✅ Efficient (only refresh if changed)
✅ No TTL needed
✅ Always fresh
# Disadvantages:
❌ Requires version column in database
❌ Extra database query for version check
❌ More complex logic
5.4 Cache Stampede Prevention
Cache stampede occurs when many requests try to refresh the same expired cache entry simultaneously.
# Cache Stampede Problem
Scenario:
- Cache entry expires
- 1000 requests arrive simultaneously
- All 1000 miss cache
- All 1000 query database
- Database overloaded!
# Solutions:
1. Probabilistic Early Expiration:
- Expire entry slightly before TTL
- Randomize expiration time
- Spread refresh requests over time
2. Lock-Based Refresh:
- First request acquires lock
- Other requests wait or get stale data
- Only one database query
3. Background Refresh:
- Refresh before expiration
- Always serve from cache
- No user-facing latency
# Implementation (Lock-Based):
def get_user(user_id):
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Try to acquire lock
lock_acquired = redis.set(f"user:{user_id}:lock", "1", ex=10, nx=True)
if lock_acquired:
# This request refreshes
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
redis.setex(f"user:{user_id}", 3600, json.dumps(user))
redis.delete(f"user:{user_id}:lock")
return user
else:
# Another request is refreshing, wait or get stale
time.sleep(0.1)
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Fallback to database if still not cached
return db.query("SELECT * FROM users WHERE id = ?", user_id)
def get_user(user_id):
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Try to acquire lock
lock_acquired = redis.set(f"user:{user_id}:lock", "1", ex=10, nx=True)
if lock_acquired:
# This request refreshes
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
redis.setex(f"user:{user_id}", 3600, json.dumps(user))
redis.delete(f"user:{user_id}:lock")
return user
else:
# Another request is refreshing, wait or get stale
time.sleep(0.1)
cached = redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Fallback to database if still not cached
return db.query("SELECT * FROM users WHERE id = ?", user_id)
6. Cache Coherency and Consistency
Maintaining consistency across multiple cache layers and distributed caches is challenging.
6.1 Cache Coherency Models
# Cache Coherency Models
1. Strong Consistency:
- All caches see updates immediately
- Requires synchronization
- High latency, low throughput
- Use case: Financial data
2. Eventual Consistency:
- Caches eventually converge
- Lower latency, higher throughput
- May serve stale data temporarily
- Use case: Most web applications
3. Weak Consistency:
- No guarantees about when updates appear
- Highest performance
- May never converge
- Use case: Analytics, non-critical data
# Real-World: Google's Cache Coherency
- Uses eventual consistency for most caches
- Strong consistency only for critical operations
- Reference: "Large-scale Incremental Processing" - Google Research
6.2 Distributed Cache Consistency
# Handling Consistency in Distributed Caches
Problem:
- Multiple cache nodes
- Updates must propagate
- Network partitions possible
Solutions:
1. Write-Through to All Nodes:
- Write to all cache nodes
- Ensures consistency
- High latency (wait for all)
2. Write-Through to Primary:
- Write to primary node
- Replicate to others asynchronously
- Lower latency, eventual consistency
3. Invalidation Messages:
- Invalidate on all nodes
- Next read refreshes from database
- Ensures eventual consistency
# Real-World: Redis Cluster
- Writes go to master node
- Replicated to replicas asynchronously
- Reads can go to any node (may be stale)
- Strong consistency option: Read from master
❌ Cache everything: Only cache expensive operations or frequently accessed data
❌ Ignoring cache invalidation: Stale data can cause serious bugs
❌ Not handling cache failures: Application should work without cache
❌ Wrong eviction policy: Choose based on access patterns
❌ Cache stampede: Implement lock-based or probabilistic refresh
❌ Not monitoring cache metrics: Hit rate, latency, memory usage
How to Discuss Caching in Interviews
Caching is one of the most common topics in system design interviews. Here's how to approach it effectively.
Step-by-Step Interview Approach
1. Identify What to Cache
# Questions to Ask Yourself
"What should I cache?"
- Frequently accessed data
- Expensive computations (database queries, API calls)
- Static or semi-static content
- User sessions
"What should I NOT cache?"
- Frequently changing data
- User-specific sensitive data (unless encrypted)
- Data larger than cache capacity
- Data accessed rarely"
2. Choose Cache Pattern
# Decision Framework
Cache-Aside (Most Common):
✅ Simple, flexible
✅ Cache failures don't break app
✅ Good for read-heavy workloads
Use when: General purpose caching
Write-Through:
✅ Cache always consistent
✅ Good for write-heavy workloads
❌ Higher write latency
Use when: Strong consistency needed
Write-Behind:
✅ Fastest writes
✅ High throughput
❌ Risk of data loss
Use when: Write performance critical, can tolerate eventual consistency
3. Discuss Cache Invalidation
# Critical Interview Point
"Cache invalidation is hard. I'd use:
1. TTL-based: Simple, automatic, but may serve stale data
2. Event-driven: Real-time, but complex infrastructure needed
3. Version-based: Efficient, but requires version tracking
For most cases, I'd combine TTL with event-driven invalidation
for critical data."
Common Interview Questions
Question 1: "How would you implement caching for a news website?"
# Critical Interview Question
1. Application should still work:
"Cache is an optimization, not a requirement.
Application should degrade gracefully."
2. Fallback strategies:
- Direct database query (slower but works)
- Circuit breaker pattern (prevent cascade failures)
- Stale cache serving (if acceptable)
3. Monitoring:
"Monitor cache hit rate, latency.
Alert if hit rate drops significantly."
Question 3: "How do you prevent cache stampede?"
# Common Follow-up
1. Problem:
"When cache expires, many requests try to refresh
simultaneously, overwhelming database."
2. Solutions:
- Lock-based: First request acquires lock, others wait
- Probabilistic early expiration: Expire slightly before TTL
- Background refresh: Refresh before expiration
- Stale-while-revalidate: Serve stale, refresh in background
3. Implementation:
"I'd use distributed lock (Redis) to ensure only one
request refreshes cache at a time."
Key Points to Emphasize
✅ Cache-aside is most common: Simple, flexible, handles failures well
✅ Cache invalidation is hard: Always discuss strategies
✅ Handle cache failures: Application must work without cache
✅ Monitor cache metrics: Hit rate, latency, memory usage