Hash Tables Demystified: Collisions, Load Factors, and Real Implementations
The most-used data structure in programming — how it actually works under the hood.
Hash tables are arguably the most important data structure in practical software engineering. Every language has them — dict in Python, map in Go, HashMap in Java/Rust, Object/Map in JavaScript. They provide O(1) average-case lookup, insertion, and deletion. But the "average-case" qualifier hides a wealth of implementation detail.
A hash table is conceptually simple: given a key, compute a hash (an integer), reduce it to an index in an array (typically via modulo), and store or retrieve the value at that index. The challenge is collisions — when two different keys map to the same index.
Chaining is the classic collision resolution strategy. Each bucket in the array points to a linked list (or in modern implementations, a balanced tree for buckets that exceed a threshold). Java's HashMap switches from a linked list to a red-black tree when a bucket exceeds 8 entries (and back to a list when it shrinks below 6). This prevents pathological O(n) behavior when many keys hash to the same bucket.
Open addressing is the alternative. When a collision occurs, the algorithm probes other positions in the array using a probing sequence — linear probing (check the next slot), quadratic probing (check slot + 1², slot + 2², ...), or double hashing (use a second hash function to determine the stride). Robin Hood hashing improves upon linear probing by keeping track of how far each element is from its ideal position and swapping elements to reduce the maximum probe length.
Go's map implementation is particularly interesting. It uses an array of buckets, each holding 8 key-value pairs. The hash determines the bucket, and within the bucket, the top 8 bits of the hash (called "tophash") are stored separately for fast comparison. This cache-friendly layout means that in the common case, finding a key requires one pointer dereference and a comparison of 8 bytes — very efficient on modern CPUs where cache misses dominate performance.
The load factor (number of entries / number of buckets) determines when to resize. Go resizes at a load factor of 6.5 (because each bucket holds 8 slots, so 6.5/8 ≈ 81% occupancy). Python resizes at 2/3. Java resizes at 0.75. Resizing is expensive — every entry must be rehashed and moved — so implementations typically double the number of buckets, amortizing the cost over future insertions.
Swiss Table (developed by Google, used in Abseil's flat_hash_map and Rust's HashMap since 1.36) represents the state of the art in open-addressing hash tables. It uses SIMD instructions to check 16 slots in parallel during probing, achieving remarkable performance for both hits and misses. The control byte array (one byte per slot, storing the top 7 bits of the hash plus metadata) fits neatly into a 128-bit SSE register.
A common pitfall: using mutable objects as hash keys. If you modify an object after inserting it into a hash table, its hash changes, and the table can no longer find it — it's looking in the wrong bucket. This is why Java requires that if you override equals(), you must also override hashCode(), and why Python makes lists unhashable.
For performance-critical code, the choice of hash function matters immensely. Cryptographic hashes (SHA-256) are far too slow for hash tables. Fast non-cryptographic hashes like xxHash, FNV-1a, or wyhash provide excellent distribution with minimal computation. Go uses a variant of AES-based hashing on hardware that supports AES-NI instructions, combining speed and good distribution.
Hash tables are one of those data structures where the theory is simple but the engineering is deep. Understanding the implementation details of your language's hash map can help you avoid subtle performance traps and make better design decisions.