Understanding B-Trees: The Backbone of Database Indexing
Why B-Trees dominate database storage — from theory to disk I/O optimization.
Every time you run a SQL query with a WHERE clause, there's a good chance a B-Tree is doing the heavy lifting behind the scenes. B-Trees (and their variant B+Trees) are the dominant data structure for database indexing, and understanding them is fundamental to writing performant queries.
Rudolf Bayer and Edward McCreight invented the B-Tree in 1970 at Boeing Research Labs. The "B" likely stands for "balanced" (though this is debated — some say it stands for Bayer, Boeing, or "broad"). Unlike binary search trees, B-Trees are designed for systems that read and write large blocks of data — specifically, disk-based storage systems.
A B-Tree of order m has the following properties: every node has at most m children, every non-root internal node has at least ⌈m/2⌉ children, and all leaves appear at the same depth. This balanced structure guarantees that search, insertion, and deletion all run in O(log n) time, where n is the number of keys.
The crucial insight is the branching factor. A binary search tree with 1 million keys has a height of about 20 — meaning 20 disk seeks per lookup. A B-Tree with a branching factor of 1000 (typical for a 4KB page size with small keys) has a height of about 2 for the same dataset. Two disk seeks instead of twenty. On spinning disks with 10ms seek times, that's the difference between 200ms and 20ms per query.
B+Trees, used by PostgreSQL, MySQL/InnoDB, and SQLite, are a variant where all values are stored in leaf nodes, and leaf nodes are linked in a doubly-linked list. This makes range queries extremely efficient — once you find the starting key, you simply traverse the leaf chain without touching any internal nodes.
When you create an index in PostgreSQL with CREATE INDEX idx ON users(email), you're building a B+Tree where the keys are email values and the leaf nodes contain pointers (ctid in PostgreSQL terms) to the heap tuples. A query like SELECT * FROM users WHERE email = 'berke@example.com' traverses the tree from root to leaf in 2-3 page reads, then follows the pointer to fetch the actual row.
Composite indexes extend this naturally. An index on (last_name, first_name) sorts by last_name first, then by first_name within each last_name group. This means the index supports queries on last_name alone or on (last_name, first_name), but not on first_name alone — the leftmost prefix rule.
Write amplification is the trade-off. Every INSERT requires updating the B-Tree, potentially splitting nodes if they're full. Page splits are expensive: they allocate a new page, redistribute keys, and update parent pointers. This is why write-heavy workloads sometimes prefer LSM-Trees (used by RocksDB, Cassandra, LevelDB), which batch writes into sorted runs and merge them periodically.
Understanding B-Trees isn't just academic — it directly impacts how you design schemas, choose indexes, and reason about query performance. Every EXPLAIN ANALYZE output makes more sense when you can visualize the tree traversal happening beneath it.