Distributed File System — Notes#
Goal#
A single namespace usable by data-processing frameworks, with bandwidth-saturating sequential reads/writes and rack-aware placement.
Properties#
- Large files (GB–TB).
- Append + truncate workloads (logs, MR output).
- Tolerate disk/node/rack failures.
- Horizontal capacity scaling by adding DataNodes.
Capacity#
- HDFS at Yahoo / FB scaled to ~100 PB per cluster, billions of files.
- Colossus underpins essentially all Google storage.
Schema (metadata)#
inode(id, type, parent_id, name, mtime, perm)chunks(file_id, offset, chunk_id, locations[])chunk_servers(id, host, capacity, last_heartbeat)
Trade-offs#
- Append-only is great for analytics; bad for OLTP-like workloads → split storage by purpose.
- Single master = simpler but bottleneck → federation / Colossus-style separation.
- Large blocks = great bandwidth, bad for many tiny files (NameNode RAM pressure).
- 3× replication vs EC: replication faster recovery + reads; EC saves cost on cold data.
Refs#
- "The Google File System" (SOSP '03).
- HDFS Architecture guide (Apache).
- Colossus blog posts (Google).
- "Tectonic" Facebook DFS paper.