How a Hash Table Works

Learn how hash tables provide near-constant time O(1) data lookup by mapping keys to array indices using hash functions, and how collisions are managed.

Ask your own question

In depth

Hash tables are a fundamental data structure that efficiently store and retrieve data by mapping keys to values. They achieve an average time complexity of O(1) for lookups, insertions, and deletions, making them significantly faster than O(n) linear searches for large datasets.

The Problem with Linear Search

Searching for an item in an unsorted list requires examining, on average, half of the elements. In the worst case, you might have to check every single item. This linear search approach means that as the number of items (n) grows, the search time grows proportionally, resulting in O(n) time complexity.

Direct Access with Hashing

The goal of a hash table is to achieve direct access to data, bypassing the need for sequential searching. It does this by using a special function to transform a data item's key directly into an array index where the item is stored.

The Hash Function

A hash function is a deterministic mathematical process that takes an input key (e.g., a string like 'apple') and converts it into a fixed-size integer value. This integer is then typically used with the modulo operator (`%`) and the size of the underlying array to calculate the precise index where the key-value pair should reside. This calculation is a fixed set of operations, ensuring the index is found instantly, without any scanning or looping.

Handling Collisions

It's possible for two different keys to produce the same hash integer, or for their hash integers to map to the same array index after the modulo operation. This situation is called a collision. A common strategy to resolve collisions is chaining, where each array index (or "bucket") holds a reference to a linked list. When a collision occurs, the new key-value pair is simply added to the linked list at that specific bucket.

The Importance of a Good Hash Function

The performance of a hash table heavily depends on the quality of its hash function. A good hash function distributes keys uniformly across the array, minimizing collisions and keeping the linked lists (chains) at each bucket short. If a hash function is poor, it can lead to clustering, where many keys map to the same few buckets, resulting in long linked lists. In such cases, lookup performance can degrade towards O(n) as the system essentially reverts to searching a linked list.

Load Factor and Resizing

The load factor of a hash table is the ratio of the number of items stored to the number of available buckets (capacity). As the number of items approaches and exceeds the capacity, collisions become more frequent, and the average length of collision chains increases. To maintain O(1) average performance, hash tables are typically resized (rehashed) into a larger array when the load factor exceeds a certain threshold. While resizing itself can be an expensive O(n) operation, it happens infrequently enough that the average cost of operations over time remains amortized O(1).

Pseudocode for Hash Table Lookup

function lookup(key, hashTableArray, arraySize):
    hashValue = calculateHash(key)         // Step 1: Apply hash function to key
    index = hashValue % arraySize          // Step 2: Modulo to get array index
    
    bucket = hashTableArray[index]         // Step 3: Access the bucket
    
    if bucket is a linked list:
        for each item in bucket:
            if item.key == key:
                return item.value          // Step 4a: Search linked list if chaining is used
        return null
    else if bucket contains a single item:
        if bucket.key == key:
            return bucket.value            // Step 4b: Direct comparison if no collision or open addressing
        return null
    else:
        return null

Key takeaways

Hash tables use a hash function to map keys directly to array indices.
This direct access enables average O(1) time complexity for lookups, insertions, and deletions.
Collisions, where multiple keys map to the same index, are typically handled by storing items in linked lists (chaining) at each bucket.
A well-designed hash function is crucial for uniform distribution and optimal performance.
Hash tables are dynamically resized (rehashed) to maintain efficiency as data grows, leading to amortized O(1) performance.

Got a different question? SeaThru generates a fresh video for any topic where systems talk or data structures move.

Ask your own question →