Load Balancer — Notes
Functional requirements
- Distribute incoming traffic across N backends.
- Detect and bypass unhealthy backends within seconds.
- Support multiple algorithms (RR, WRR, least-conn, consistent hash).
- Terminate TLS (optional) and route by host/path (L7).
Non-functional requirements
- Throughput: 1M+ RPS per LB pair (commodity NIC).
- Latency overhead: < 1 ms p99 added by LB.
- Availability: 99.99%+. No single LB = SPOF.
- Horizontal scale: ECMP + multiple LB nodes.
Capacity estimation (example)
- 100k RPS, 1 KB request, 10 KB response → ~10 Gbps egress.
- Connections: 100k RPS × 0.5 s avg keep-alive = 50k concurrent.
- File descriptors per LB box: 200k+ (tune
ulimit, ephemeral ports).
API surface
- Control plane:
add_backend(host, weight), drain(host), set_health(host).
- Data plane: transparent — clients send to VIP, LB forwards.
- xDS (Envoy) for dynamic config push.
Data model
Pool{ id, algo, hc_config }
Backend{ id, pool_id, addr, weight, state(UP/DOWN/DRAIN) }
Listener{ vip, port, tls_cert, route_rules[] }
Trade-offs
- L4 = cheapest, fastest, opaque to app; L7 = richer features, ~2–5× CPU.
- DNS LB = simple but slow failover (TTL); GSLB needed for multi-region.
- Sticky sessions simplify legacy apps but pin load; prefer stateless JWT.
- Active-active anycast scales out but requires BGP; VRRP active-passive is easier ops.
- TLS termination at edge improves CPU on backend but exposes plaintext in DC; mTLS to backend solves it at cost.
Real-world refs
- Google Maglev (consistent hash + ECMP), Facebook Katran (XDP/eBPF L4),
AWS NLB (L4) / ALB (L7), Cloudflare Unimog, Envoy + Istio.