Document Database (MongoDB-style) — Detailed#
flowchart TB
subgraph Client[App tier]
APP[App]
DRV([Smart driver<br/>topology aware])
end
subgraph Routing[Routing - mongos]
MONGOS[Router]
CFG[(Config servers<br/>3-5 Raft / RS)]
BAL[Balancer<br/>chunk migration]
end
subgraph Shard1[Shard 1 - replica set]
P1[Primary]
S1A[Secondary]
S1B[Secondary]
ARB1[Arbiter]
end
subgraph Shard2[Shard 2 - replica set]
P2[Primary]
S2A[Secondary]
S2B[Secondary]
end
subgraph Shard3[Shard 3 - replica set]
P3[Primary]
S3A[Secondary]
S3B[Secondary]
end
subgraph Engine[Storage engine - WiredTiger]
BPTREE[B-Tree default]
LSMOPT[LSM optional]
MMAP[mmap legacy]
CACHE[Page cache]
JOURNAL[WAL / Journal]
SNAP[MVCC snapshots]
end
subgraph Repl[Replication]
OPLOG[Oplog idempotent ops]
HEARTB[Heartbeats every 2s]
ELECT[Election via Raft]
READPREF[Read preferences<br/>primary / nearest / secondary]
WC[Write concern w-majority]
RC[Read concern majority / linearizable]
end
subgraph Index[Indexing]
BIDX[B-tree single field]
COMP[Compound]
TEXT[Text index]
GEO[2dsphere geo]
PART[Partial]
HASHED[Hashed]
TTL[TTL]
VEC[Vector / Atlas Search]
end
subgraph CDC[Change Streams]
OPL2[Oplog tail]
CHG[Change Streams API]
DOWN[Downstream: search, cache, analytics]
end
subgraph Features
AGG[Aggregation framework]
JOIN[$lookup join]
TX[Multi-doc transactions<br/>snapshot isolation across shards]
GRIDFS[GridFS - large files]
end
APP --> DRV --> MONGOS
MONGOS --- CFG
MONGOS --> Shard1
MONGOS --> Shard2
MONGOS --> Shard3
BAL -.chunk move.-> Shard1
P1 -. oplog .-> S1A
P1 -. oplog .-> S1B
P2 -. oplog .-> S2A
P3 -. oplog .-> S3A
Engine --- Shard1
Repl --- Shard1
Index --- Shard1
OPL2 --- P1
OPL2 --> CHG --> DOWN
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class DRV client;
class APP,MONGOS,BAL,P1,S1A,S1B,ARB1,P2,S2A,S2B,P3,S3A,S3B,BPTREE,LSMOPT,MMAP,JOURNAL,SNAP,OPLOG,HEARTB,ELECT,READPREF,WC,RC,BIDX,COMP,TEXT,GEO,PART,HASHED,TTL,VEC,OPL2,CHG,DOWN,AGG,JOIN,TX,GRIDFS service;
class CFG datastore;
class CACHE cache;
Sharding#
- Shard key chosen at collection level (hashed or ranged).
- Data split into chunks (~64–128 MB); balancer migrates chunks for even load.
- Cross-shard queries scatter-gather; targeted queries hit one shard if shard key supplied.
Replication & elections#
- Replica set = 3 to 7 nodes; one primary, rest secondaries.
- Oplog is a capped collection containing idempotent ops.
- Elections via Raft-like protocol; replica with highest oplog wins.
Transactions#
- Multi-document, multi-collection transactions across shards (since 4.2).
- Snapshot isolation; cost is non-trivial compared to single-doc updates.
Indexes#
- Lots of types; choice dominates query latency.
- Indexes are per-shard; a global secondary index requires app-level work.
Change Streams (built-in CDC)#
- Tails the oplog and exposes a typed cursor.
- Use for cache invalidation, search sync, downstream services.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Sharding | horizontal partitioning across nodes | database-sharding |
HLD |
Cache strategies | cache-aside, read/write-through, eviction | caching-strategies |
HLD |
CAP / PACELC | C vs A under partition; L vs C otherwise | cap-pacelc |
HLD |
Raft / Paxos consensus | replicated state machine via majority quorum | consensus-raft-paxos |
HLD |
Leader/follower replication | sync/semi-sync/async replication, failover | replication-leader-follower |
HLD |
LSM vs B-Tree engines | WAL, memtable, SSTables, compaction | storage-engines-lsm-btree |
HLD |
MVCC & isolation levels | snapshot isolation, serializability, vacuum | mvcc-isolation-levels |
HLD |
Change Data Capture | WAL/binlog tailing, outbox publishing | change-data-capture |
LLD |
REST API design | verbs, statuses, pagination, errors | rest-api-design |