Skip to content

Health Check / Heartbeat Service — Detailed#

flowchart TB
  subgraph Nodes
    N1[Node 1]
    N2[Node 2]
    NN[Node N]
  end

  subgraph Modes[Probe modes]
    ACTIVE[Active checks - HTTP/TCP/gRPC]
    PASSIVE[Passive - outcome-based]
    HB[Heartbeat push from node]
    PHI[Phi-accrual detector]
  end

  subgraph Service
    COL[Collector cluster]
    STATE[(Health state KV)]
    GOSSIP[Gossip across collectors]
    RULES[Status rules]
    DEPS[Dependency graph]
  end

  subgraph Reactions
    LB[LB pool update]
    ROUTE[Routing change]
    ALERT[Alerts / pages]
    RUNBOOK[Automated remediation]
  end

  Nodes --> Modes --> Service --> Reactions

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class LB edge;
    class N1,N2,NN,ACTIVE,PASSIVE,HB,PHI,COL,GOSSIP,RULES,DEPS,ROUTE,RUNBOOK service;
    class STATE datastore;
    class ALERT obs;

Failure detection#

  • Binary up/down is brittle.
  • Phi-accrual outputs a suspicion value (probability host is dead); thresholding triggers reactions.

Probe vs heartbeat#

  • Active probe = control plane initiates; ensures network path works.
  • Heartbeat = node-initiated; cheaper at scale.
  • Both used together.

Glossary & fundamentals#

Concept What it is Fundamentals
Service discovery feeds healthy set service-discovery
Load balancer consumes healthy pool load-balancer
Resilience patterns breakers + ejections resilience-patterns
Observability uptime SLO signals observability