Skip to content

Gmail-style Email Service — Detailed#

flowchart TB
  subgraph Inbound[Inbound path]
    DNS[MX records]
    MTA_IN[Inbound MTA<br/>SMTP 25/465/587]
    TLS[STARTTLS / MTA-STS]
    AUTHV[SPF / DKIM / DMARC / ARC]
    SPAM([Spam classifier<br/>Bayes + ML])
    VIRUS[Malware scan]
    GREY[Greylisting / rate limit]
    ROUTE[Address rewriter / aliases]
  end

  subgraph Storage[Storage Layer]
    BIG([(Bigtable / KV<br/>per-user mailbox)])
    OBJ[(Attachment store)]
    DEDUP[Content dedup<br/>shared blob refs]
    META[(Mailbox metadata)]
    LABEL[Labels / Folders model]
  end

  subgraph Search
    IDX([(Inverted Index<br/>per user)])
    REALTIME[Real-time indexer]
  end

  subgraph User[User access]
    WEB([Web UI])
    IMAP[IMAP / POP / JMAP]
    API[Gmail API]
    PUSH((Push - APNS / FCM))
  end

  subgraph Outbound
    COMPOSE[Compose]
    QUEUE[[Outbound queue]]
    MTA_OUT[Outbound MTA pool]
    DKIM_SIGN[DKIM signing]
    BOUNCE[Bounce / DSN handling]
    REPUTE[IP / domain reputation]
  end

  subgraph Features
    THREAD[Threading - by Subject / Refs]
    PRIO[[Priority Inbox]]
    SMART[Smart Reply / Smart Compose]
    PHISH[Phishing detection]
    LABEL2[Filter / Rule engine]
  end

  subgraph Ops
    QUOTA([Per-user quota])
    RETN[Retention / Deletion]
    AUDIT[Audit log]
  end

  Internet --> DNS --> MTA_IN
  MTA_IN --> TLS --> AUTHV --> GREY --> SPAM --> VIRUS --> ROUTE --> BIG
  BIG --> REALTIME --> IDX
  BIG --> META
  OBJ --- BIG
  DEDUP --- OBJ
  User --> WEB
  WEB --> BIG
  WEB --> IDX
  COMPOSE --> QUEUE --> DKIM_SIGN --> MTA_OUT --> Internet
  MTA_OUT --> BOUNCE
  BOUNCE --> COMPOSE
  REPUTE --- MTA_OUT
  Features --- BIG
  Ops --- BIG

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class WEB,QUOTA client;
    class DNS,MTA_IN,TLS,AUTHV,VIRUS,GREY,ROUTE,DEDUP,LABEL,REALTIME,IMAP,API,COMPOSE,MTA_OUT,DKIM_SIGN,BOUNCE,REPUTE,THREAD,SMART,PHISH,LABEL2,RETN service;
    class BIG,OBJ,META,IDX datastore;
    class QUEUE,PRIO queue;
    class SPAM compute;
    class PUSH external;
    class AUDIT obs;

Mailbox storage#

  • Per-user mailbox in a sharded KV (Gmail historically on Bigtable; Yahoo on Cassandra-ish).
  • Message keyed by (user_id, msg_id); immutable body + mutable flags & labels.
  • Attachments stored once in object store, referenced by content hash (dedup).
  • Search index is per-user inverted index.

Threading#

  • RFC 2822 In-Reply-To / References headers form thread graph; fall back to normalized Subject.
  • Gmail-style label-based threading vs Outlook-style folder model.

Anti-spam stack#

  1. Connection-time checks (RBL, rate, greylisting).
  2. Auth: SPF (sender IP allowed), DKIM (signature match), DMARC (policy alignment), ARC (forwarded chain).
  3. Content classifier (Bayes + ML + reputation).
  4. User feedback ("Report spam") fed back into models.

Outbound reputation#

  • Warm IPs gradually; SPF + DKIM on every send.
  • Bounce processing → list hygiene.
  • Real-time indexing on receive (within seconds).
  • Per-user inverted index avoids cross-tenant leaks.

Scale notes#

  • Billions of mailboxes; some > 10 GB.
  • Attachment dedup saves significant storage (same forwarded chain).
  • IMAP idle keeps connections persistent → many open sockets.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Search internals inverted index, BM25, embeddings, ANN search-internals
LLD Immutability immutable types, persistent collections immutability