Multi-Region & DR — Notes
Setting RTO / RPO
- Talk to the business — "5 min RTO" is a different cost from "5 hr RTO".
- Tier services: tier-0 (payments, login) → active-active; tier-3 (analytics) → backups OK.
Cost
- Active-active doubles compute + state replication bills.
- Cross-region egress is the silent killer. Estimate before you commit.
Stateful service failover steps
- Detect (health checks across regions, plus quorum check).
- Fence the old primary (STONITH / lease revocation).
- Promote standby (or new region) to primary.
- Repoint DNS / GLB to new region (or shift weight gradually).
- Drain caches / warm them up.
- Failback later when source region is healthy.
Avoid these antipatterns
- "Failover plan that's never been tested" — equals no plan.
- Hidden single-region dependencies (KMS keys, secret store, image registry).
- Long-TTL DNS — clients keep hammering the dead region.
- Asymmetric failover (DB fails over but service config doesn't).
Refs
- AWS Well-Architected Reliability Pillar (DR strategies).
- "Site Reliability Engineering" book — chapter on failure recovery.
- Spanner / CockroachDB papers on geo-replication.
- Netflix Chaos Kong + RegionalEvac runbooks.