Skip to content

Distributed File System — Notes#

Goal#

A single namespace usable by data-processing frameworks, with bandwidth-saturating sequential reads/writes and rack-aware placement.

Properties#

  • Large files (GB–TB).
  • Append + truncate workloads (logs, MR output).
  • Tolerate disk/node/rack failures.
  • Horizontal capacity scaling by adding DataNodes.

Capacity#

  • HDFS at Yahoo / FB scaled to ~100 PB per cluster, billions of files.
  • Colossus underpins essentially all Google storage.

Schema (metadata)#

  • inode(id, type, parent_id, name, mtime, perm)
  • chunks(file_id, offset, chunk_id, locations[])
  • chunk_servers(id, host, capacity, last_heartbeat)

Trade-offs#

  • Append-only is great for analytics; bad for OLTP-like workloads → split storage by purpose.
  • Single master = simpler but bottleneck → federation / Colossus-style separation.
  • Large blocks = great bandwidth, bad for many tiny files (NameNode RAM pressure).
  • 3× replication vs EC: replication faster recovery + reads; EC saves cost on cold data.

Refs#

  • "The Google File System" (SOSP '03).
  • HDFS Architecture guide (Apache).
  • Colossus blog posts (Google).
  • "Tectonic" Facebook DFS paper.