Skip to content

Object Storage (S3 / GCS / Azure Blob) — Detailed#

flowchart TB
  subgraph Client[Clients]
    SDK([SDK / CLI])
    BR([Browser pre-signed])
    APP[App service]
  end

  subgraph Edge[Edge]
    DNS[Regional endpoint DNS]
    LB[L7 LB]
    GW[REST Frontend<br/>SigV4 auth, range, multipart]
  end

  subgraph Control[Control Plane]
    AUTH[IAM / SigV4 verify]
    POL[Bucket policy / ACL]
    LIFECYCLE[Lifecycle rules<br/>tier transition / expiry]
    VERSION[Versioning manager]
    REPL[Cross-region replication]
    EVENTS[[Event notifications<br/>SNS / Lambda]]
  end

  subgraph Index[Metadata / Index]
    KEYS[(Key index<br/>distributed B-tree)]
    PART[Partitioning by key prefix]
    HOT[Hot-shard splitter]
  end

  subgraph Data[Data Plane - Storage Nodes]
    PLACE[Placement service]
    SHARD[Erasure coding<br/>e.g. 10+4 RS]
    CHUNK[Chunks 4-64 MB]
    NODE1[Storage node 1]
    NODE2[Storage node 2]
    NODE3[Storage node 3]
    NODEN[Storage node N]
    SCRUB[Scrubber / repair]
    BG[Background rebalance]
  end

  subgraph Tiering
    STANDARD[Standard tier - SSD/HDD mix]
    IA[Infrequent access]
    GLACIER[Cold / Glacier / Archive<br/>tape, minutes-to-hours retrieval]
  end

  subgraph Features
    MULTI[Multipart upload]
    PRES[Pre-signed URLs]
    CMK[KMS / SSE / CSE encryption]
    LOCK[Object lock / WORM / Legal hold]
    LOG[Server access logs]
    RR[Requester-pays]
    SELECT[S3 Select / pushdown]
  end

  Client --> DNS --> LB --> GW --> AUTH --> POL
  GW --> KEYS
  GW --> PLACE --> SHARD --> CHUNK
  CHUNK --> NODE1
  CHUNK --> NODE2
  CHUNK --> NODE3
  CHUNK --> NODEN
  SCRUB -.repair.-> Data
  BG -.move.-> Data
  LIFECYCLE -.transition.-> Tiering
  VERSION --- KEYS
  REPL -.async.-> Data
  EVENTS -.fire.-> Client
  Features --- GW

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class SDK,BR client;
    class DNS,LB edge;
    class APP,GW,AUTH,LIFECYCLE,VERSION,REPL,PART,HOT,PLACE,SHARD,CHUNK,NODE1,NODE2,NODE3,NODEN,SCRUB,BG,STANDARD,IA,MULTI,PRES,CMK,LOCK,RR service;
    class KEYS,GLACIER datastore;
    class EVENTS queue;
    class POL,SELECT storage;
    class LOG obs;

Architecture beats#

  • Flat keyspace keyed by bucket + object_key; no real directories.
  • Metadata (index) is separate from data (blobs). Index is itself a distributed sorted structure.
  • Erasure coding: each object split into chunks; chunks encoded into N+K shards (e.g. Reed-Solomon 10+4) across nodes/racks/AZs.
  • Durability of 11×9s comes from EC + cross-AZ placement + scrubbing.

Consistency#

  • Modern S3: strong read-after-write across all regions for new objects.
  • Historically eventually consistent for overwrites/deletes.

Multipart upload#

  • Initiate → upload parts (parallel) → complete (server assembles).
  • Allows resumable uploads of TB-scale objects.

Lifecycle & tiering#

  • Transition rules: > 30 d to IA, > 90 d to Glacier, > 365 d delete.
  • Restore from cold tiers takes minutes to hours; pay-per-restore.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Load balancer / GSLB L4/L7 traffic distribution and failover load-balancer
HLD Sharding horizontal partitioning across nodes database-sharding
HLD Pub/Sub & message brokers topics, consumer groups, delivery semantics pub-sub-pattern
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD LSM vs B-Tree engines WAL, memtable, SSTables, compaction storage-engines-lsm-btree
HLD Realtime protocols WS / SSE / polling / gRPC streaming realtime-protocols
HLD Multi-region & DR RTO / RPO, active-active, failover multi-region-dr
LLD REST API design verbs, statuses, pagination, errors rest-api-design