How a Hash Table Works
Learn how hash tables provide near-constant time O(1) data lookup by mapping keys to array indices using hash functions, and how collisions are managed.
In depth
Hash tables are a fundamental data structure that efficiently store and retrieve data by mapping keys to values. They achieve an average time complexity of O(1) for lookups, insertions, and deletions, making them significantly faster than O(n) linear searches for large datasets.
The Problem with Linear Search
Searching for an item in an unsorted list requires examining, on average, half of the elements. In the worst case, you might have to check every single item. This linear search approach means that as the number of items (n) grows, the search time grows proportionally, resulting in O(n) time complexity.
Direct Access with Hashing
The goal of a hash table is to achieve direct access to data, bypassing the need for sequential searching. It does this by using a special function to transform a data item's key directly into an array index where the item is stored.
The Hash Function
A hash function is a deterministic mathematical process that takes an input key (e.g., a string like 'apple') and converts it into a fixed-size integer value. This integer is then typically used with the modulo operator (`%`) and the size of the underlying array to calculate the precise index where the key-value pair should reside. This calculation is a fixed set of operations, ensuring the index is found instantly, without any scanning or looping.
Handling Collisions
It's possible for two different keys to produce the same hash integer, or for their hash integers to map to the same array index after the modulo operation. This situation is called a collision. A common strategy to resolve collisions is chaining, where each array index (or "bucket") holds a reference to a linked list. When a collision occurs, the new key-value pair is simply added to the linked list at that specific bucket.
The Importance of a Good Hash Function
The performance of a hash table heavily depends on the quality of its hash function. A good hash function distributes keys uniformly across the array, minimizing collisions and keeping the linked lists (chains) at each bucket short. If a hash function is poor, it can lead to clustering, where many keys map to the same few buckets, resulting in long linked lists. In such cases, lookup performance can degrade towards O(n) as the system essentially reverts to searching a linked list.
Load Factor and Resizing
The load factor of a hash table is the ratio of the number of items stored to the number of available buckets (capacity). As the number of items approaches and exceeds the capacity, collisions become more frequent, and the average length of collision chains increases. To maintain O(1) average performance, hash tables are typically resized (rehashed) into a larger array when the load factor exceeds a certain threshold. While resizing itself can be an expensive O(n) operation, it happens infrequently enough that the average cost of operations over time remains amortized O(1).
Pseudocode for Hash Table Lookup
function lookup(key, hashTableArray, arraySize):
hashValue = calculateHash(key) // Step 1: Apply hash function to key
index = hashValue % arraySize // Step 2: Modulo to get array index
bucket = hashTableArray[index] // Step 3: Access the bucket
if bucket is a linked list:
for each item in bucket:
if item.key == key:
return item.value // Step 4a: Search linked list if chaining is used
return null
else if bucket contains a single item:
if bucket.key == key:
return bucket.value // Step 4b: Direct comparison if no collision or open addressing
return null
else:
return nullKey takeaways
- Hash tables use a hash function to map keys directly to array indices.
- This direct access enables average O(1) time complexity for lookups, insertions, and deletions.
- Collisions, where multiple keys map to the same index, are typically handled by storing items in linked lists (chaining) at each bucket.
- A well-designed hash function is crucial for uniform distribution and optimal performance.
- Hash tables are dynamically resized (rehashed) to maintain efficiency as data grows, leading to amortized O(1) performance.
Got a different question? SeaThru generates a fresh video for any topic where systems talk or data structures move.
Ask your own question →