Distributed File System (GFS / HDFS / Colossus) — Detailed#
flowchart TB
subgraph Client[Client]
APP([App / MapReduce / Spark])
LIB([DFS Client library])
end
subgraph Master[Master Tier]
NN[NameNode / Master<br/>namespace + chunk map]
SNN[Standby NameNode<br/>HA via Raft / Quorum Journal]
LEASE[Lease manager]
BR([Block report aggregator])
BAL[Balancer]
QUOTA[Quota / ACL]
end
subgraph Storage[Chunk Servers / DataNodes]
direction LR
DN1[(DataNode 1<br/>chunks 64 MB-256 MB)]
DN2[(DataNode 2)]
DN3[(DataNode 3)]
DNN[(DataNode N)]
end
subgraph Pipeline[Write Pipeline]
PRIMARY[Primary replica<br/>holds lease]
SEC1[Secondary 1]
SEC2[Secondary 2]
ACK[Daisy-chain ack]
end
subgraph EC[Erasure Coding optional]
RS[Reed-Solomon 10+4 / 6+3]
SAVE[~40% storage save vs 3x replication]
RECONS[Reconstruction reads on degraded]
end
subgraph Recovery
HB[Heartbeats every few seconds]
DETECT[Failure detector]
REPL[Re-replication]
SCRUB[Checksum scrubber]
end
subgraph App[Workloads]
MR([MapReduce / Spark / Flink])
LOG[Log archive]
ML[ML training shards]
BQ[BigQuery / Hive tables]
end
APP --> LIB
LIB -->|metadata RPC| NN
NN -. log + checkpoint .-> SNN
LIB -->|read| DN1
LIB -->|read| DN2
LIB -->|read| DN3
LIB -->|write| PRIMARY --> SEC1 --> SEC2 --> ACK --> PRIMARY
HB --> NN
BR --> NN
NN -.replicate cmd.-> REPL
REPL --> Storage
EC --- Storage
App --- LIB
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class LIB client;
class NN,SNN,LEASE,BAL,QUOTA,PRIMARY,SEC1,SEC2,ACK,RS,SAVE,RECONS,HB,DETECT,REPL,SCRUB,LOG,ML service;
class DN1,DN2,DN3,DNN,BQ datastore;
class APP,BR,MR compute;
Design choices#
- Append-only writes (GFS/HDFS) — random writes not supported, simplifies replication.
- Large chunk size (64–256 MB) — fewer metadata entries, sequential IO friendly.
- Single master with HA standby — simple to reason about, but bottleneck → modern Colossus / HDFS Federation use multiple namespaces.
- Replication 3 by default; EC for cold data.
Read path#
- Client asks master for chunk locations of file offset.
- Client streams directly from a chosen DataNode (often rack-local).
- Checksums verified per block.
Write path#
- Client requests new chunk → master assigns 3 DataNodes, picks primary (holds lease).
- Client pushes data to closest replica → it forwards down chain (daisy-chain saves bandwidth).
- Primary orders mutations; secondaries apply; all ack back.
- Master updates metadata only on chunk completion.
Master scaling#
- Federation: namespace split across multiple NameNodes.
- Colossus: separates metadata (Spanner-backed) from old single-master design.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Raft / Paxos consensus | replicated state machine via majority quorum | consensus-raft-paxos |
HLD |
Leader/follower replication | sync/semi-sync/async replication, failover | replication-leader-follower |
HLD |
Batch & stream processing | Lambda vs Kappa, watermarks, windows | batch-stream-processing |