Skip to content

Multi-Region & DR — Notes#

Setting RTO / RPO#

  • Talk to the business — "5 min RTO" is a different cost from "5 hr RTO".
  • Tier services: tier-0 (payments, login) → active-active; tier-3 (analytics) → backups OK.

Cost#

  • Active-active doubles compute + state replication bills.
  • Cross-region egress is the silent killer. Estimate before you commit.

Stateful service failover steps#

  1. Detect (health checks across regions, plus quorum check).
  2. Fence the old primary (STONITH / lease revocation).
  3. Promote standby (or new region) to primary.
  4. Repoint DNS / GLB to new region (or shift weight gradually).
  5. Drain caches / warm them up.
  6. Failback later when source region is healthy.

Avoid these antipatterns#

  • "Failover plan that's never been tested" — equals no plan.
  • Hidden single-region dependencies (KMS keys, secret store, image registry).
  • Long-TTL DNS — clients keep hammering the dead region.
  • Asymmetric failover (DB fails over but service config doesn't).

Refs#

  • AWS Well-Architected Reliability Pillar (DR strategies).
  • "Site Reliability Engineering" book — chapter on failure recovery.
  • Spanner / CockroachDB papers on geo-replication.
  • Netflix Chaos Kong + RegionalEvac runbooks.