HLD
Multi-Region & DR — Detailed
Patterns
flowchart TB
subgraph Cold[Cold standby]
C1[Primary serves all]
C2[(Backups only)]
C3[RTO: hours-days, RPO: minutes-hours]
end
subgraph Pilot[Pilot light]
P1[Primary serves all]
P2[Standby idle, DB replicated]
P3[RTO: tens of minutes, RPO: minutes]
end
subgraph Warm[Warm standby]
W1[Primary serves all]
W2[Standby running, scaled down]
W3[RTO: minutes, RPO: seconds]
end
subgraph ActPas[Active-passive]
AP1[Primary serves all]
AP2[Standby at full size]
AP3[RTO: seconds-1m, RPO: seconds]
end
subgraph ActAct[Active-active]
AA1[Both regions serve]
AA2[Conflict resolution]
AA3[RTO ≈ 0, RPO ≈ 0]
end
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class C1,C3,P1,P3,W1,W2,W3,AP1,AP2,AP3,AA1,AA2,AA3 service;
class C2,P2 datastore;
Concepts
Term
Meaning
RTO
Recovery time — how long until traffic flows again
RPO
Recovery point — how recent the last good data is
MTBF
Mean time between failures
MTTR
Mean time to repair
Failover
Switch traffic to standby on failure
Failback
Move traffic back once primary is healthy
Failover region
Where traffic goes when primary dies
Traffic steering
flowchart LR
DNS[Geo / latency DNS<br/>or anycast]
HC[Health checks per region]
R1[Region A]
R2[Region B]
R3[Region C]
DNS --> HC
HC -.->|A unhealthy| DNS
DNS --> R1
DNS --> R2
DNS --> R3
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class DNS edge;
class HC,R1,R2,R3 service;
GeoDNS — short TTL (30-60s), routes by client geography.
Anycast — same IP advertised from multiple POPs; BGP picks closest.
Global load balancer — Cloudflare, AWS Global Accelerator, GCP GLB.
Data layer
Layer
Active-active option
Trade-off
Stateless services
trivial
none
Cache
per-region (Redis cluster) or hierarchical
cross-region inval
OLTP DB
Spanner, CockroachDB, Aurora Global, DynamoDB Global
latency on cross-region writes
Object store
S3 cross-region replication
eventual; pay for replication
Event bus
MirrorMaker (Kafka), cross-region Pub/Sub
dedupe at consumer
Cross-region write strategies
Region-pinned writes — each user writes to "home region", reads global. Easiest.
Last-writer-wins — eventual; data loss possible on conflict.
CRDTs / per-key conflict resolution — strong eventual consistency for counters, sets.
Globally consistent (Spanner / CRDB) — TrueTime / Raft over WAN; ~150ms commit.
Game days
Practice failover quarterly.
Tag every infra resource with failover_role.
Document the runbook; chaos-test it.
DR levels (AWS Well-Architected)
flowchart LR
L1[Backup & Restore]
L2[Pilot Light]
L3[Warm Standby]
L4[Active-Active / Multi-Site]
L1 --> L2 --> L3 --> L4
L1 -. cheaper, slower .- L1
L4 -. more expensive, faster .- L4
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class L2,L3,L4 service;
class L1 datastore;
Common interview hooks
"What's the latency cost of active-active for writes?" → cross-region RTT; bounded by physics.
"How do you handle stateful workloads in failover?" → drain, replicate, promote, fence (STONITH).
"Active-active how to avoid split-brain?" → leases, fencing tokens, quorum across regions.
Glossary & fundamentals
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
Tag
Concept
What it is
Page
HLD
Load balancer / GSLB
L4/L7 traffic distribution and failover
load-balancer
HLD
Pub/Sub & message brokers
topics, consumer groups, delivery semantics
pub-sub-pattern
HLD
CAP / PACELC
C vs A under partition; L vs C otherwise
cap-pacelc
HLD
Raft / Paxos consensus
replicated state machine via majority quorum
consensus-raft-paxos
HLD
Leader/follower replication
sync/semi-sync/async replication, failover
replication-leader-follower
HLD
Logical clocks
Lamport, vector clocks, HLC, TrueTime
logical-clocks
HLD
Multi-region & DR
RTO / RPO, active-active, failover
multi-region-dr