Resilience Patterns — Notes
Default playbook for any service call
- Timeout (connect + read), short.
- Retry only idempotent ops, with exponential backoff + jitter, max N.
- Circuit breaker around the call.
- Bulkhead the pool.
- Fallback / static response if all else fails.
- Surface SLO impact via metrics.
Sizing concurrency
- Little's Law:
concurrency = throughput × latency.
- If
p99 = 200 ms and target 1000 RPS, you need 200 concurrent slots.
- Pool size bigger than that wastes; smaller queues requests.
Failure detection
- Liveness ≠ readiness. Liveness: "process alive?". Readiness: "can serve traffic?".
- Heartbeat with phi-accrual: probabilistic measure of suspicion vs binary up/down.
Anti-patterns
- Infinite retry loops without budget.
- Long timeouts that hold threads (default 30 s on HTTP libs is too high).
- No upstream backpressure → memory blow up.
- Circuit breaker per host instead of per dependency.
- Retrying inside retrying (nested retry storms).
Where this shows up in this repo
- All service-to-service calls.
- API gateway timeouts & circuits.
- Job scheduler retry policies.
- Webhook delivery retries.
- Email/SMS provider failover.
- Message queue consumer DLQ.
Refs
- Michael Nygard: "Release It!" (Bulkhead, Circuit Breaker patterns).
- Marc Brooker AWS: "What's a 'reasonable' timeout?", "Hedging your bets."
- Netflix Hystrix (now archived) docs; Resilience4j docs.
- Google SRE Book ch.22 (Addressing Cascading Failures).