Consistent Hashing Explained
Consistent Hashing Explained
Consistent hashing is a distributed hashing scheme that minimizes key remapping when the number of servers changes, making it ideal for scalable caching systems
Consistent hashing is a specialized distributed hashing algorithm designed to minimize the number of keys remapped when nodes are added or removed from a distributed system. This approach is crucial for maintaining high cache hit rates and system stability in large-scale, distributed environments.
The Problem with Traditional Hashing
Traditional modulo hashing, where a key is mapped to a server using `hash(key) % N` (where N is the number of servers), works well for a fixed number of servers. However, if a server is added or removed, N changes, leading to a complete remapping of almost all keys. This causes a "cache miss storm" as clients try to fetch data from incorrect servers, effectively invalidating the entire cache and putting immense load on the backend database.
How Consistent Hashing Works
Consistent hashing maps both servers and keys onto a large, circular hash space, typically a ring representing the range of a hash function (e.g., 0 to 2^32 or 0 to 2^128). Each server is hashed to a specific point on this ring. Similarly, each data key is also hashed to a point on the same ring.
To determine which server is responsible for a given key, you move clockwise around the ring from the key's position until you encounter the first server. That server is then responsible for storing or retrieving the data associated with that key.
function assign_key_to_server(key, servers_on_ring):
key_hash = hash(key)
# Find the first server clockwise from the key's hash
for server_position in sorted(servers_on_ring.keys()):
if server_position >= key_hash:
return servers_on_ring[server_position]
# If no server found, wrap around to the first server on the ring
return servers_on_ring[min(servers_on_ring.keys())]Minimizing Data Movement
The primary advantage of consistent hashing is its ability to minimize data movement when the server count changes. If a server is added, only the keys that were previously mapped to the next server clockwise (and now fall into the new server's segment) need to be remapped. Similarly, if a server is removed, its keys are simply reassigned to the next server clockwise, affecting only a small portion of the total keys. This localized remapping prevents system-wide cache invalidations.
Virtual Nodes for Even Distribution
To prevent hotspots and ensure a more even distribution of keys, consistent hashing employs the concept of "virtual nodes" (also called "vnodes" or "replicas"). Instead of mapping each physical server to a single point on the ring, each physical server is mapped to multiple points (e.g., 100 or 200 virtual nodes). These virtual nodes are distributed randomly around the ring. This approach helps to smooth out the distribution of keys across physical servers, even with a small number of physical servers, and ensures that adding or removing a physical server results in a more balanced redistribution of load.
Key Takeaways
- Consistent hashing minimizes key remapping when servers are added or removed.
- It maps both keys and servers onto a circular hash space.
- Keys are assigned to the first server encountered moving clockwise on the ring.
- Virtual nodes are used to ensure balanced data distribution and prevent hotspots.
- This technique is fundamental for building scalable and resilient distributed caching systems.
Got a different question? SeaThru generates a fresh video for any topic where systems talk or data structures move.
Ask your own question →