Skip to content

Multi-Region & DR — Detailed#

Patterns#

flowchart TB
  subgraph Cold[Cold standby]
    C1[Primary serves all]
    C2[(Backups only)]
    C3[RTO: hours-days, RPO: minutes-hours]
  end
  subgraph Pilot[Pilot light]
    P1[Primary serves all]
    P2[Standby idle, DB replicated]
    P3[RTO: tens of minutes, RPO: minutes]
  end
  subgraph Warm[Warm standby]
    W1[Primary serves all]
    W2[Standby running, scaled down]
    W3[RTO: minutes, RPO: seconds]
  end
  subgraph ActPas[Active-passive]
    AP1[Primary serves all]
    AP2[Standby at full size]
    AP3[RTO: seconds-1m, RPO: seconds]
  end
  subgraph ActAct[Active-active]
    AA1[Both regions serve]
    AA2[Conflict resolution]
    AA3[RTO ≈ 0, RPO ≈ 0]
  end

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class C1,C3,P1,P3,W1,W2,W3,AP1,AP2,AP3,AA1,AA2,AA3 service;
    class C2,P2 datastore;

Concepts#

Term Meaning
RTO Recovery time — how long until traffic flows again
RPO Recovery point — how recent the last good data is
MTBF Mean time between failures
MTTR Mean time to repair
Failover Switch traffic to standby on failure
Failback Move traffic back once primary is healthy
Failover region Where traffic goes when primary dies

Traffic steering#

flowchart LR
  DNS[Geo / latency DNS<br/>or anycast]
  HC[Health checks per region]
  R1[Region A]
  R2[Region B]
  R3[Region C]
  DNS --> HC
  HC -.->|A unhealthy| DNS
  DNS --> R1
  DNS --> R2
  DNS --> R3

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class DNS edge;
    class HC,R1,R2,R3 service;
  • GeoDNS — short TTL (30-60s), routes by client geography.
  • Anycast — same IP advertised from multiple POPs; BGP picks closest.
  • Global load balancer — Cloudflare, AWS Global Accelerator, GCP GLB.

Data layer#

Layer Active-active option Trade-off
Stateless services trivial none
Cache per-region (Redis cluster) or hierarchical cross-region inval
OLTP DB Spanner, CockroachDB, Aurora Global, DynamoDB Global latency on cross-region writes
Object store S3 cross-region replication eventual; pay for replication
Event bus MirrorMaker (Kafka), cross-region Pub/Sub dedupe at consumer

Cross-region write strategies#

  • Region-pinned writes — each user writes to "home region", reads global. Easiest.
  • Last-writer-wins — eventual; data loss possible on conflict.
  • CRDTs / per-key conflict resolution — strong eventual consistency for counters, sets.
  • Globally consistent (Spanner / CRDB) — TrueTime / Raft over WAN; ~150ms commit.

Game days#

  • Practice failover quarterly.
  • Tag every infra resource with failover_role.
  • Document the runbook; chaos-test it.

DR levels (AWS Well-Architected)#

flowchart LR
  L1[Backup & Restore]
  L2[Pilot Light]
  L3[Warm Standby]
  L4[Active-Active / Multi-Site]
  L1 --> L2 --> L3 --> L4
  L1 -. cheaper, slower .- L1
  L4 -. more expensive, faster .- L4

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class L2,L3,L4 service;
    class L1 datastore;

Common interview hooks#

  • "What's the latency cost of active-active for writes?" → cross-region RTT; bounded by physics.
  • "How do you handle stateful workloads in failover?" → drain, replicate, promote, fence (STONITH).
  • "Active-active how to avoid split-brain?" → leases, fencing tokens, quorum across regions.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Load balancer / GSLB L4/L7 traffic distribution and failover load-balancer
HLD Pub/Sub & message brokers topics, consumer groups, delivery semantics pub-sub-pattern
HLD CAP / PACELC C vs A under partition; L vs C otherwise cap-pacelc
HLD Raft / Paxos consensus replicated state machine via majority quorum consensus-raft-paxos
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD Logical clocks Lamport, vector clocks, HLC, TrueTime logical-clocks
HLD Multi-region & DR RTO / RPO, active-active, failover multi-region-dr