Graphical explanation of the consistent hashing algorithm, just read this article

Graphical explanation of the consistent hashing algorithm, just read this article

Author | LemonCoder

Source | Backend Technology Academy

Many students should know what a hash function is. You will encounter "consistent hashing" in backend interviews and development. So what is consistent hashing? The name sounds impressive, but the principle is not complicated. This article will help you thoroughly understand consistent hashing!

Before getting into the topic, let's have an exciting mock interview.

Mock Interview

Interviewer: I see on your resume that you participated in a large project that used a distributed cache cluster. So can you tell me how you did cache load balancing?

Newbie: I know this. We use a round-robin approach. The first key is given to the first storage node, the second key is given to the second, and so on.

Interviewer: Are there any other solutions?

Newbie: You can use a hash function to break up the requests and randomly distribute them to the machines in the cache cluster.

Interviewer: Have you considered the scalability and fault tolerance of this hash-based load balancing?

Newbie:...

Interviewer: Go back and wait for notification.

If there is any similarity between the above, you will be considered to have plagiarized me.

What is a hash

In data structure, we have learned that hash table is also called hash table. Let's review the definition of hash table.

A hash table is a data structure that directly accesses data in a specified storage location based on a key. By calculating a function about the key, also called a hash function, the required data is mapped to a location in the table to access the record, speeding up the search. This mapping function is called a "hash function", and the array that stores the records is called a hash table.

The hash function can make the access process to a data sequence faster and more efficient. It is an algorithm that exchanges space for time. Through the hash function, data elements can be located more quickly.

The following figure shows the process of mapping a string from a hash function to a hash table. Yes, the input string is typed by rolling your face on the keyboard :)

Hash diagram.png

Common hash algorithms include MD5, CRC, MurmurHash, etc. Here is a brief introduction.

MD5 Message-Digest Algorithm, a widely used cryptographic hash function, can generate a 128-bit (16-byte) hash value. The MD5 algorithm converts data (such as a piece of text) into another fixed-length value, which is the basic principle of the hash algorithm. It was designed by American cryptographer Ronald Linn Rivest, made public in 1992 and standardized in RFC 1321. CRC algorithm

Cyclic Redundancy Check (CRC) is a hash function that generates a short fixed-bit checksum based on data such as network packets or computer files. It was published by W. Wesley Peterson in 1961. The generated number is calculated and appended to the data before transmission or storage, and then the receiver checks to determine whether the data has changed. This function is widely used because it is easy to use with binary computer hardware, easy to perform mathematical analysis, and is particularly good at detecting errors caused by interference in transmission channels. MurmurHash

MurmurHash is a non-cryptographic hash function suitable for general hash retrieval operations. It was invented by Austin Appleby in 2008 and has multiple variants. Compared with other popular hash functions, MurmurHash's random distribution characteristics perform better for keys with strong regularity.

This algorithm has been used by many open source projects, such as libstdc++ (version 4.6), Perl, nginx (no earlier than version 1.0.1), Rubinius, libmemcached, maatkit, Hadoop, etc. Common hashing methods

Direct addressing method: take the keyword or a linear function value of the keyword as the hash address. The definition of this linear function varies and there is no standard.

Digital analysis method: Assuming that the keyword is a number with base r as the base, and the keywords that may appear in the hash table are known in advance, several digits of the keyword can be taken to form the hash address.

The middle digits of the squared keyword are used as the hash address. Usually, when selecting a hash function, we may not know all the information about the keyword, and it is not certain which digits to take. However, the middle digits of a squared number are related to each digit of the number, so the hash address obtained by the randomly distributed keyword is also random, and the number of digits taken is determined by the length of the table.

Folding method: split the keyword into several parts with the same number of digits (the number of digits in the last part can be different), and then take the sum of these parts (discarding the carry) as the hash address.

Modulo method: The remainder obtained after dividing the keyword by a number p that is not greater than the length of the hash table m is taken as the hash address. That is, hash(key) = key % p (p<= M). The modulo method can be used not only to take the keyword directly, but also to take the modulo method after operations such as the folding method and the square middle method. The choice of p is very important. Generally, a prime number or m is used. If p is not chosen well, conflicts are likely to occur.

Cache system load balancing

In the implementation of load balancing in distributed cluster caches, such as memcached cache clusters, the key of the cached data needs to be hashed using a hash function so that the cached data can be evenly distributed to each distributed storage node. To achieve such load balancing, a hash algorithm can generally be used. The following figure demonstrates this distributed storage process:

Distributed cache hash storage diagram

Ordinary hash algorithm load balancing

We have introduced various hashing methods before. No matter which hashing method is chosen, in this application scenario, the cached data must be evenly mapped to the server cluster using a hash function. We will choose the simple "modulus method" to illustrate this process.

Assume that there are 3 server nodes numbered [0 - 2] and 6 cache key-value pairs numbered [1 - 6]. After the hash mapping is completed, the three cache data mappings are as follows:

Hash calculation formula: key % total number of nodes = Hash node index 1%3=12%3=23%3=04%3=15%3=26%3=0

Cache Hash Instance

Each connection is evenly distributed across three different server nodes. It looks perfect!

However, this model has two problems in implementing load balancing in distributed cluster systems: 1. Poor scalability

In order to dynamically adjust service capabilities, service nodes often need to be expanded or reduced in capacity. For example, if it is an e-commerce service, the number of service machines during the Double Eleven period must be much larger than usual. The newly added machines will make the originally calculated hash value inaccurate. In order to achieve the effect of load balancing, the hash value must be recalculated and updated. For cached data with inconsistent hash values ​​after the update, it must be migrated to the updated node.

Assume that a new server node is added, and the original three service nodes become four nodes numbered [0 - 3]. The hash mapping is as follows:

Hash calculation formula: key % total number of nodes = Hash node index 1%4=12%4=23%4=34%4=05%4=16%4=2

It can be seen that the storage nodes corresponding to the last three cache keys: 4, 5, and 6 are all invalid. Therefore, the cache data of these nodes need to be migrated to the updated nodes (time-consuming and labor-intensive), that is, from the original node [1, 2, 0] to the node [0, 1, 2]. The storage diagram after migration is as follows:

Cache hash scalability diagram 2. Poor fault tolerance

Although online environment service nodes have various high availability guarantees, they are still likely to crash. Even if there is no crash, there is still a need to shrink capacity. Both crashes and shrinkage can be attributed to the deletion of service nodes. The following analyzes the impact of service node deletion on the load balancing hash value.

Assume that one server node is deleted, and the original three service nodes are reduced to two, with the node numbers [0 - 1]. The hash mapping is as follows:

Hash calculation formula: key % total number of nodes = Hash node index 1%2=12%2=03%2=14%2=05%2=16%2=0

The following figure shows the cache data migration distribution caused by the common hash load balancing algorithm when a node goes down:

Cache hash fault tolerance diagram

As shown in the figure, in this example, simply deleting a service node also leads to a large-scale update of the hash value. The update of the hash value also means the migration of the node cache data (cached data means being very tired).

Consistent hashing algorithm load balancing

It is precisely because the cache load balancing implemented by the ordinary hash algorithm has poor scalability and fault tolerance that we introduce the consistent hashing algorithm. So what is consistent hashing? Let's first look at the definition of consistent hashing on Wikipedia.

Consistent hashing was proposed by David Karger and his collaborators at MIT, and now the idea has been extended to other fields. This academic paper published in 1997 introduces how consistent hashing can be applied to distributed Web services with volatile users. Consistent hashing can also be used to implement robust caching to reduce the negative impact of partial system failures in large Web applications.

This paper describing consistent hashing was published in 1997. Students with reading disabilities can directly read the paper for a deeper understanding. Attached is the paper download link: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.147.1879

Consistent hashing paper

To summarize consistent hashing in one sentence: it is an improved version of the ordinary modulo hashing algorithm. The hash function calculation method remains unchanged, but it replaces the ordinary linear hash space by constructing a circular hash space. The specific steps are as follows:

First, select a large enough hash space (usually 0 ~ 2^32) to form a hash ring.

Consistent hashing ring

Then, the hash value is calculated for each storage server node in the cache cluster. The hash value can be calculated using the server's IP or host name. The calculated hash value is the position of the service node on the hash ring.

Node Hash

Finally, a hash value is calculated for each data key that needs to be stored. The calculated hash is also mapped to the ring. The data is stored at the first node on the ring found in the clockwise direction. The following figure shows an example of the data stored in the node. Our following description is also based on the current storage situation.

image

Now that we have explained the principle, let’s see why this design can solve the two problems of ordinary hashing mentioned above.

As we have analyzed before, when the common hash algorithm needs to be expanded to add service nodes, it will cause a large area of ​​crude hash mapping failure. Now, let's see how consistent hashing solves this problem.

As shown in the following figure, when a new node node3 is added to the cache service cluster, only the data value3 corresponding to key3 is affected. At this time, you only need to migrate value3 from the original node node0 to the new node node3, and the data stored in other nodes remain unchanged.

Consistent hashing - Extended node fault tolerance improvement

When a service node of a common hash algorithm goes offline, it will also cause a large area of ​​original hash mapping to fail. The failed mapping triggers data migration, affecting cache service performance and insufficient fault tolerance. Let's take a look at how consistent hashing improves fault tolerance.

As shown in the figure below, if node2 goes offline, the data value2 and value5 originally stored in node2 can be stored in the new storage node node0 in a clockwise direction without affecting the data of other nodes. Consistent hashing can control the impact of node failure to the clockwise adjacent nodes, avoiding the impact on the entire cluster.

Consistent Hashing - Deleting Nodes

Consistent Hashing Optimization

Problems

The above shows how consistent hashing solves the scalability and fault tolerance problems of ordinary hashing. The principle is relatively simple and can work well under ideal conditions, but there are still some practical issues that need to be considered in actual use. The following is a detailed analysis.

Imagine if there are fewer service nodes in the cache cluster, like the three nodes in our example, and the hash ring space is very large (usually 0 ~ 2^32), what problems will this cause?

One possible situation is that the hash values ​​of fewer service nodes are clustered together. For example, in the following figure, node0, node1, and node2 are clustered together. The key hashes of cached data are all mapped to the clockwise direction of node2. Data searches for storage nodes in a clockwise direction, which results in all data being stored in node0, putting a lot of pressure on a single node! This situation is called data skew.

Consistent hashing - data skew node avalanche

Both data skew and node crashes can cause cache avalanche.

Taking the previous example of data skew, data skew causes all cached data to be pushed to node0, which may cause node0 to be overwhelmed and collapse. Node0 goes down, and the data is pushed to node1, which also collapses. Node1 is also collapsed and the failure is passed to node2. At this time, the failure is like a snowball in an avalanche, which keeps getting bigger and bigger.

Another situation is that the node goes offline due to various reasons. For example, the node node2 shown in the figure below goes offline, causing the data originally in node2 to be pressed to node0. If the amount of data is particularly large, it may also cause node avalanche. The specific process is the same as the analysis just now.

In short, the chain reaction that causes the entire cache cluster to become unavailable is called node avalanche.

Consistent hashing - node avalanche virtual node

How to solve the above two thorny problems? They can be solved by "virtual nodes".

The so-called virtual node is to create several virtual nodes on the hash ring for the original single physical node. These virtual nodes are called "virtual nodes". The data hit on the virtual node is actually mapped to the physical node corresponding to the virtual node. In this way, a physical node can be evenly distributed in various parts of the hash ring through virtual nodes, solving the problem of data skew.

Since virtual nodes are scattered across the hash ring, when a node goes offline, the data it stores will be evenly distributed to other nodes, avoiding node avalanche problems caused by sudden pressure on a single node.

The following figure shows the hash ring distribution of virtual nodes. The left side shows the node distribution without virtual nodes. The two node0 nodes with green background color on the right are the virtual nodes of node0 node; the node1 node with red background color is the virtual node of node1.

Consistent Hashing - Virtual Nodes

To sum up

This article first introduces what hash algorithms are and common hash algorithms and common hashing methods, then explains the implementation of cache load balancing based on common hash algorithms, and gives examples to illustrate the problems of scalability and fault tolerance of common algorithms.

In order to solve the scalability and fault tolerance problems of ordinary algorithms, the consistent hashing algorithm is introduced. The diagrams and examples are used to analyze how consistent hashing improves scalability and fault tolerance. Finally, the rough consistent hashing algorithm also has problems of data skew and node avalanche. It explains how to use virtual nodes to optimize the consistent hashing algorithm and solve the data skew and avalanche problems. So far, have you learned consistent hashing?

The knowledge point of consistent hashing is not difficult, but it is often tested, just like the Bloom filter algorithm. People who have never heard of it think it is very advanced, but it is just a matter of studying it. Therefore, you must have a broad knowledge base to beat the interviewer, students!

<<:  Behind Canaan's change of leadership: Farewell dinner, government non-interference and the future of both parties

>>:  Bitmain’s valuation drops by 62.5%. Is the era of the “three tyrants” among mining companies coming to an end?

Recommend

The face of a person who is restless and always quarrels with others

It is very normal to quarrel with others, because...

How to tell the difference between children and their parents

Children are our hope and our reliance on the fut...

Do you have the look of becoming rich overnight?

I believe everyone is more concerned about money ...

Is it true that a square chin is good for your face?

The chin refers to the chin. The position of the ...

Bitcoin fell sharply and remained flat at the low point in the Asian session

Bitcoin prices traded at a low level in the Asian...

Teach you how to analyze whether the other person is honest from his face

Honesty is important in everything we do. Honesty...

Palmistry: Judging whether you are compatible from your hand shape

Marriage is a very sacred and solemn event, and e...

Rationality or fanaticism, where will blockchain go next?

Editor's note: This article is compiled from ...

How does a woman with wide cheekbones look like?

What does a woman with wide cheekbones look like?...

What does a mole on a woman's foot mean for her fate?

1. A mole on the sole of a woman's foot repre...

Mole on a woman's back - fate diagram and mole analysis

People have more or less moles on their bodies, a...