MVCC in MongoDB with WiredTiger

MVCC in MongoDB with WiredTiger

I wrote about MVCC in Postgres, MySQL (InnoDB), and CockroachDB, and now I want to understand how it works in WiredTiger, the storage engine behind MongoDB.

MVCC Recap

MVCC, or Multi-Version Concurrency Control, keeps multiple versions of a row, gives each transaction a snapshot, lets readers and writers proceed without blocking each other.

xmin/xmax (Postgres)

Postgres uses xmin/xmax fields to keep track of the visibility of rows. Each row has an xmin and xmax field, which are set to the transaction id of the transaction that created the row and the transaction id of the transaction that deleted the row, respectively.

CockroachDB - timestamps + HLC

CockroachDB uses hybrid logical clocks (HLC) to keep track of the visibility of rows. Each row has a start and end timestamp, which are set to the HLC of the transaction that created the row and the HLC of the transaction that deleted the row, respectively.

InnoDB (MySQL)

InnoDB uses undo logs to keep track of the visibility of rows. Each transaction is assigned a transaction id. Each row is either the latest version of a row, or an older version that is stored in the undo log. If the row is the latest version, it has a start transaction id and an end transaction id of null. If the row is an older version, it has a start transaction id and an end transaction id that are set to the transaction id of the transaction that created the row and the transaction id of the transaction that deleted the row, respectively.

MongoDB - WiredTiger, why MVCC came late, what it enables

MongoDB delegates storage to WiredTiger, a pluggable storage engine it acquired in 2014 and made the default in MongoDB 3.2. WiredTiger brings its own MVCC implementation, its own concurrency model, and its own cleanup story. MongoDB’s transaction and isolation semantics sit on top of it.

WiredTiger is a key-value store backed by B-trees. MongoDB maps each collection to a WiredTiger B-tree, where the key is the document’s _id and the value is the BSON-encoded document. Every value in WiredTiger is stored with a timestamp. When a document is updated, WiredTiger writes the new version at the transaction’s commit timestamp and retains the old version at its original timestamp. Readers see the version that existed at their read timestamp. This is structurally similar to CockroachDB’s approach - versions keyed by (key, timestamp) rather than (key, transaction ID) for the same reason - timestamps compose more naturally than integer IDs when multiple components need to agree on ordering.

Snapshots and Read Timestamps

When a read operation starts in MongoDB, WiredTiger assigns it a read timestamp. The read sees all committed versions at or before that timestamp and ignores everything newer.

For single-document operations, this happens implicitly. For multi-document transactions, you control it explicitly through read concern:

Read ConcernMeaning
localMost recent committed data on this node, no timestamp coordination
majorityData acknowledged by a majority of replica set members
snapshotConsistent snapshot across all documents in the transaction, at the same timestamp

snapshot read concern is what gives MongoDB consistent point-in-time reads across all documents in a multi-document transaction. Without it, a transaction reading multiple documents could see them at different points in time. Note that this is snapshot isolation, not serializability - write skew anomalies are still possible.

sequenceDiagram
    participant T1 as Txn 1 (Writer)
    participant WiredTiger
    participant T2 as Txn 2 (Reader, snapshot)

    T2->>WiredTiger: BEGIN (snapshot at ts=100)
    T1->>WiredTiger: UPDATE { balance: 500 } at ts=105
    T1->>WiredTiger: COMMIT
    T2->>WiredTiger: find({ _id: "acct1" })
    Note over WiredTiger: T2 reads ts=100 - returns old version { balance: 400 }
    T2->>WiredTiger: COMMIT

T2 sees the document as it existed at ts=100, even though T1 committed a newer version at ts=105. Same guarantee as PostgreSQL’s repeatable read, achieved through timestamps rather than transaction ID snapshots.

Write Conflicts

Unlike PostgreSQL, which detects write-write conflicts when a second transaction attempts to modify an already-locked row (blocking until the first commits or aborts), WiredTiger detects them eagerly. If two transactions attempt to modify the same document, the second one to try gets a WriteConflict error immediately. It does not wait.

MongoDB retries WriteConflict errors automatically for implicit (single-operation) transactions. For explicit multi-statement transactions, the error is returned to the application, which must retry the entire transaction.

This is worth knowing for schema design. Documents that aggregate high-contention data (counters, running totals, queues implemented as arrays, etc) can produce high write conflict rates under concurrent load. The usual fix is either sharding the counter across multiple documents and summing them at read time, or using the $inc operator on isolated fields where atomic increment semantics are sufficient.

Multi-Document Transactions

MongoDB added multi-document ACID transactions in version 4.0 (2018), and extended them to sharded clusters in 4.2. Before that, atomicity was only guaranteed for single-document operations. This shaped MongoDB’s data modeling conventions. Because you couldn’t rely on cross-document transactions, MongoDB’s documentation for years recommended embedding related data in a single document rather than normalizing it across multiple collections. A transaction that would require joining three tables in Postgres could be a single document read in MongoDB.

That recommendation is still often valid, but the reasoning behind it matters. Embedding is not just about performance, it was originally about atomicity. Now that multi-document transactions exist, embedding vs. referencing is a genuine trade-off rather than a safety constraint.

FeatureEmbedding (Denormalization)Referencing (Normalization)
Read PerformanceFast: Single disk I/O, no joins needed.Slower: Requires $lookup or multiple queries.
AtomicityNative: Single-document updates are always atomic.Transactional: Requires explicit multi-doc transactions.
Data IntegrityRisk: Data duplication can lead to inconsistencies.Clean: Single source of truth for every entity.
Growth PotentialLimited: Risk of hitting the 16MB BSON limit.Unlimited: Relationships can scale indefinitely.
Best ForOne-to-few, “part-of” relationships.One-to-many, many-to-many, or shared entities.

The Oplog and MVCC

Every write in MongoDB (insert, update, delete) is recorded in the oplog, a capped collection in the local database. The oplog is how replica set members replicate writes from the primary to secondaries.

WiredTiger’s MVCC and the oplog interact in a specific way: a write is not visible to other operations until it is both committed in WiredTiger and written to the oplog. This ensures that replication and visibility are consistent - a secondary will never be asked to apply an operation the primary hasn’t fully committed.

This is also why MongoDB’s write concern matters for MVCC semantics:

Write ConcernMeaning
{ w: 1 }Acknowledged once the primary commits
{ w: "majority" }Acknowledged once a majority of replica set members have applied it

A reader using majority read concern will only see writes that have crossed that threshold.

Garbage Collection - History Store

WiredTiger keeps old document versions in a structure called the history store (called the lookaside table before MongoDB 4.4 / WiredTiger 10.0). When no active transaction needs an old version anymore, WiredTiger discards it.

Unlike PostgreSQL’s autovacuum - which has to scan heap files on disk to find and reclaim dead tuples - WiredTiger’s cleanup is tighter because old versions are managed in a purpose-built structure rather than scattered across the main storage file.

The failure mode is still the same. A long-running transaction holds its read timestamp open. WiredTiger cannot discard versions at or after that timestamp. The history store grows. On a heavily written collection under a long-running snapshot, this can become significant.

MongoDB exposes this via the serverStatus wiredTiger command:

db.serverStatus().wiredTiger.cache

Watch these fields:

  • "pages requested from the cache"
  • "tracked dirty bytes in the cache"
  • "bytes currently in the cache"

A history store that’s growing faster than it’s being cleaned up will show up as cache pressure before it shows up as latency.

So What?

Schema design affects concurrency. Documents that mix high-read and high-write fields create unnecessary contention. WiredTiger locks at the document level, not the field level. Splitting a hot counter into its own document reduces the blast radius of write conflicts.

Snapshot read concern has a cost. Taking a snapshot across a transaction pins WiredTiger’s read timestamp for the duration of the transaction. Long-running snapshot transactions on busy collections accumulate history store pressure. Keep multi-document transactions short.

Write concern and read concern are two sides of the same question. Write concern controls when a write is considered durable. Read concern controls which writes are visible to a reader. Mismatching them - writing with { w: 1 } but reading with majority - means readers won’t see recent writes until they’re replicated, which can look like stale reads to the application.

The oplog is the replication source of truth. Change streams, MongoDB’s real-time change notification mechanism, are built on top of the oplog. Understanding that they reflect oplog ordering (not WiredTiger commit ordering) explains some of the latency and ordering guarantees (and limitations) change streams provide.

Compared to PostgreSQL

FeaturePostgreSQLMongoDB (WiredTiger)
Version storageDead tuples in heap fileHistory store (on-disk, purpose-built)
Cleanup mechanismautovacuumAutomatic history store eviction
Conflict detectionAt lock acquisition (row-level locks)Eagerly, at point of conflict
Write conflict retryDatabase waits (lock)Application retries WriteConflict
Long-running txn riskTable bloat, blocked vacuumHistory store growth, cache pressure

The gap that matters most in practice is the cleanup story. PostgreSQL dead tuples live in the heap file until autovacuum reclaims them. On a busy table with infrequent vacuuming, this means real storage bloat and slower sequential scans. WiredTiger’s history store is purpose-built and discards old versions more aggressively, but has the same fundamental constraint - a long-running transaction blocks cleanup.

Further Reading

Knowledge Check
How does WiredTiger key its document versions?
By transaction ID, similar to PostgreSQL's xmin/xmax.
By a monotonically increasing sequence number per collection.
By timestamp, similar to CockroachDB's approach.
By the oplog entry position.
Correct! 🎉 WiredTiger stores each version at its commit timestamp. Readers see the version that existed at their read timestamp.
Not quite. The correct answer is C. WiredTiger versions are keyed by (key, timestamp), which composes naturally across distributed components.
What happens when two concurrent transactions try to modify the same document in MongoDB?
The second writer gets a WriteConflict error immediately without waiting.
The second writer blocks until the first transaction commits or aborts.
Both writes succeed and the last one wins.
The transaction with the lower timestamp is automatically aborted.
Correct! 🎉 WiredTiger detects conflicts eagerly. The second writer gets an immediate WriteConflict rather than waiting, unlike PostgreSQL's row-lock-and-wait approach.
Not quite. The correct answer is A. WiredTiger uses optimistic concurrency control. It does not block the second writer - it fails fast with a WriteConflict.
What is the WiredTiger history store?
An in-memory cache that stores the most recently accessed documents.
An on-disk B-tree (WiredTigerHS.wt) that holds old document versions.
A section of the oplog dedicated to tracking version history.
A write-ahead log used for crash recovery.
Correct! 🎉 The history store is a purpose-built, on-disk B-tree that replaced the lookaside table in MongoDB 4.4.
Not quite. The correct answer is B. The history store is an on-disk structure (WiredTigerHS.wt) specifically designed to hold old versions until no active transaction needs them.
Why does schema design matter for concurrency in MongoDB?
Because MongoDB locks the entire collection during writes.
Because fields within a document can be locked independently.
Because embedded documents are stored in separate B-trees.
Because WiredTiger's concurrency control operates at the document level, so mixing hot and cold fields in one document creates unnecessary contention.
Correct! 🎉 Since concurrency control is per-document, a write to a hot counter field conflicts with a concurrent update to any other field in the same document.
Not quite. The correct answer is D. WiredTiger cannot distinguish between fields within a document for concurrency purposes. Two writes to different fields of the same document still conflict.
What risk does a long-running snapshot transaction create?
It causes the oplog to stop accepting new entries.
It forces all other transactions to use a lower isolation level.
It pins the read timestamp, preventing WiredTiger from discarding old versions, which leads to history store growth and cache pressure.
It automatically downgrades from snapshot to local read concern after 60 seconds.
Correct! 🎉 The pinned timestamp prevents cleanup, causing the history store to accumulate old versions. This shows up as cache pressure before it shows up as latency.
Not quite. The correct answer is C. A long-running snapshot holds its read timestamp open, blocking WiredTiger from discarding any versions at or after that timestamp.
Which read concern gives MongoDB consistent point-in-time reads across all documents in a multi-document transaction?
local - it reads the most recent data on this node.
majority - it reads data acknowledged by a majority of nodes.
snapshot - it reads all documents at the same timestamp.
linearizable - it provides real-time ordering guarantees.
Correct! 🎉 snapshot read concern assigns a single read timestamp to the entire transaction, so all documents are read at the same point in time.
Not quite. The correct answer is C. snapshot ensures every document in the transaction is read at the same timestamp. local and majority do not provide cross-document consistency within a transaction.
When is a write in MongoDB visible to other operations?
As soon as WiredTiger commits the write to its in-memory cache.
Only after it is both committed in WiredTiger and written to the oplog.
Once the write-ahead log is fsynced to disk.
After the next checkpoint flushes dirty pages.
Correct! 🎉 This coupling ensures replication and visibility are consistent - a secondary will never be asked to apply an operation the primary hasn't fully committed.
Not quite. The correct answer is B. Visibility requires both a WiredTiger commit and an oplog entry. This guarantees secondaries only replicate fully committed operations.
What happens if you write with { w: 1 } but read with majority read concern?
Readers may not see recent writes until they've replicated to a majority of nodes, which can look like stale reads.
The read will fail with a consistency error.
MongoDB automatically upgrades the write concern to majority.
The read returns the write immediately since it was acknowledged by the primary.
Correct! 🎉 { w: 1 } only waits for the primary. majority read concern only returns writes replicated to a majority. The gap between the two creates a window of apparent staleness.
Not quite. The correct answer is A. Write concern and read concern are independent. A { w: 1 } write is acknowledged before replication, but a majority reader won't see it until a majority of nodes have it.

Quiz Complete!

You scored 0 out of 8.

Comments

© 2025 Threads of Thought. Built with Astro.