Skip to content

Resilience Patterns — Notes#

Default playbook for any service call#

  1. Timeout (connect + read), short.
  2. Retry only idempotent ops, with exponential backoff + jitter, max N.
  3. Circuit breaker around the call.
  4. Bulkhead the pool.
  5. Fallback / static response if all else fails.
  6. Surface SLO impact via metrics.

Sizing concurrency#

  • Little's Law: concurrency = throughput × latency.
  • If p99 = 200 ms and target 1000 RPS, you need 200 concurrent slots.
  • Pool size bigger than that wastes; smaller queues requests.

Failure detection#

  • Liveness ≠ readiness. Liveness: "process alive?". Readiness: "can serve traffic?".
  • Heartbeat with phi-accrual: probabilistic measure of suspicion vs binary up/down.

Anti-patterns#

  • Infinite retry loops without budget.
  • Long timeouts that hold threads (default 30 s on HTTP libs is too high).
  • No upstream backpressure → memory blow up.
  • Circuit breaker per host instead of per dependency.
  • Retrying inside retrying (nested retry storms).

Where this shows up in this repo#

  • All service-to-service calls.
  • API gateway timeouts & circuits.
  • Job scheduler retry policies.
  • Webhook delivery retries.
  • Email/SMS provider failover.
  • Message queue consumer DLQ.

Refs#

  • Michael Nygard: "Release It!" (Bulkhead, Circuit Breaker patterns).
  • Marc Brooker AWS: "What's a 'reasonable' timeout?", "Hedging your bets."
  • Netflix Hystrix (now archived) docs; Resilience4j docs.
  • Google SRE Book ch.22 (Addressing Cascading Failures).