Skip to content

Distributed File System (GFS / HDFS / Colossus) — Detailed#

flowchart TB
  subgraph Client[Client]
    APP([App / MapReduce / Spark])
    LIB([DFS Client library])
  end

  subgraph Master[Master Tier]
    NN[NameNode / Master<br/>namespace + chunk map]
    SNN[Standby NameNode<br/>HA via Raft / Quorum Journal]
    LEASE[Lease manager]
    BR([Block report aggregator])
    BAL[Balancer]
    QUOTA[Quota / ACL]
  end

  subgraph Storage[Chunk Servers / DataNodes]
    direction LR
    DN1[(DataNode 1<br/>chunks 64 MB-256 MB)]
    DN2[(DataNode 2)]
    DN3[(DataNode 3)]
    DNN[(DataNode N)]
  end

  subgraph Pipeline[Write Pipeline]
    PRIMARY[Primary replica<br/>holds lease]
    SEC1[Secondary 1]
    SEC2[Secondary 2]
    ACK[Daisy-chain ack]
  end

  subgraph EC[Erasure Coding optional]
    RS[Reed-Solomon 10+4 / 6+3]
    SAVE[~40% storage save vs 3x replication]
    RECONS[Reconstruction reads on degraded]
  end

  subgraph Recovery
    HB[Heartbeats every few seconds]
    DETECT[Failure detector]
    REPL[Re-replication]
    SCRUB[Checksum scrubber]
  end

  subgraph App[Workloads]
    MR([MapReduce / Spark / Flink])
    LOG[Log archive]
    ML[ML training shards]
    BQ[BigQuery / Hive tables]
  end

  APP --> LIB
  LIB -->|metadata RPC| NN
  NN -. log + checkpoint .-> SNN
  LIB -->|read| DN1
  LIB -->|read| DN2
  LIB -->|read| DN3
  LIB -->|write| PRIMARY --> SEC1 --> SEC2 --> ACK --> PRIMARY
  HB --> NN
  BR --> NN
  NN -.replicate cmd.-> REPL
  REPL --> Storage
  EC --- Storage
  App --- LIB

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class LIB client;
    class NN,SNN,LEASE,BAL,QUOTA,PRIMARY,SEC1,SEC2,ACK,RS,SAVE,RECONS,HB,DETECT,REPL,SCRUB,LOG,ML service;
    class DN1,DN2,DN3,DNN,BQ datastore;
    class APP,BR,MR compute;

Design choices#

  • Append-only writes (GFS/HDFS) — random writes not supported, simplifies replication.
  • Large chunk size (64–256 MB) — fewer metadata entries, sequential IO friendly.
  • Single master with HA standby — simple to reason about, but bottleneck → modern Colossus / HDFS Federation use multiple namespaces.
  • Replication 3 by default; EC for cold data.

Read path#

  1. Client asks master for chunk locations of file offset.
  2. Client streams directly from a chosen DataNode (often rack-local).
  3. Checksums verified per block.

Write path#

  1. Client requests new chunk → master assigns 3 DataNodes, picks primary (holds lease).
  2. Client pushes data to closest replica → it forwards down chain (daisy-chain saves bandwidth).
  3. Primary orders mutations; secondaries apply; all ack back.
  4. Master updates metadata only on chunk completion.

Master scaling#

  • Federation: namespace split across multiple NameNodes.
  • Colossus: separates metadata (Spanner-backed) from old single-master design.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Raft / Paxos consensus replicated state machine via majority quorum consensus-raft-paxos
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD Batch & stream processing Lambda vs Kappa, watermarks, windows batch-stream-processing