【Reserved】Why does the default load factor of HashMap choose 0.75

The initial capacity of Hashtable is 11, and the expansion method is 2N+1;

The initial capacity of HashMap is 16, and the expansion method is 2N;

Ali’s people suddenly asked me why the expansion factor is 0.75, and I came back and summarized it;  the compromise between improving space utilization and reducing query cost is mainly Poisson distribution. If 0.75 is the smallest collision,

HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is just the capacity of the hash table when it was created. Load factor is a measure of how full a hash table can become before its capacity is automatically expanded. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table needs to be expanded and rehashed (that is, the internal data structure is rebuilt). The expanded hash table will have twice the size of the hash table. original capacity.

Typically, loading factors require a compromise between time and space costs.

The loading factor is too high, for example, 1. Although the space overhead is reduced and the space utilization is improved, it also increases the query time cost;

If the load factor is too low, such as 0.5, although the query time cost can be reduced, the space utilization is very low, and the number of rehash operations is increased .

When setting the initial capacity, the number of entries required in the map and its load factor should be considered in order to minimize the number of rehash operations. Therefore, it is generally recommended to set the initial capacity according to the estimated value when using HashMap to reduce capacity expansion operations.

Choosing 0.75 as the default loading factor is completely a compromise between time and space costs,

text

A few days ago, I saw someone in a group discussing why the load factor in the hashmap is 0.75 by default.

Load factor in HashMap source code

static final float DEFAULT_LOAD_FACTOR = 0.75f;

What came to mind at the time was a compromise between the contradiction between “hash collision” and “space utilization”.

In the same way that data structures are either fast to query or fast to insert, hashmap is a data structure that is slow to insert and fast to query.

The load factor is the degree to which the elements in the [Hash] table are filled.
The larger the load factor, the more elements are filled, and the space utilization is higher, but the chance of conflict increases.
Conversely, the smaller the load factor, the fewer elements are filled, and the chance of conflict is reduced, but more space is wasted.

The greater the chance of conflict, the higher the search cost. On the contrary, the cost of searching is smaller.

Therefore, a balance and compromise must be found between “opportunities for conflict” and “space utilization”.

But why must it be 0.75? instead of 0.8, 0.6 #

Continue to dig deeper in the spirit of not being too big of a problem. Before that, let’s briefly add the basic knowledge needed in this article:

  1. Definition of conflict: Assuming that the address set of the hash table is [0, n), a conflict means that there is already a record at the position where the hash address obtained by the keyword is j (0<=j<=n-1). There is already a record on the hash address obtained by the keyword, then it is called a collision.

  2. Handle collisions: It is to tie another “empty” hash address for the record of this keyword. That is, when dealing with the conflict of hash addresses, if another hash address H1 is still in conflict, then find the next address H2, if H2 is still in conflict, then find H3 until Hk does not conflict, then Hk is the address recorded in the table.

Several ways to handle conflicts: #

1. Open addressing method #

Hi=(H(key) + di) MOD mi=1,2,…k(k<=m-1) where H(key) is the hash function; m is the length of the hash table; di is the increment quantity sequence.

Open addressing methods can be divided into three types according to different step sizes:

1) Linear Probing: di=1, 2, 3,…, m-1
  simply means that the current conflict position is the starting point, the step size is 1, and the search cycle is performed until an empty position is found. Insert the element into it, and if it is not found after the loop, the container is full. It’s like going to a restaurant on a street, asking the first one and being told it’s full, and then going next door to ask if there’s a seat.

2) Linear compensation detection method: di=Q, the next position satisfies Hi=(H(key) + Q) mod mi=1,2,…k(k<=m-1), and it is required that Q and m are mutually qualitative, so that all cells in the hash table can be detected.
Continue to use the above example, now you don’t go to each house to ask, take out the calculator and do the math, and then ask every other house if there is a location.

3) Pseudo-random probing and re-hashing: di = sequence of pseudo-random numbers. Again, this is an example of choosing a store based on your mood.

shortcoming:

  • The hash table established by this method is easy to accumulate data when there are many conflicts, which is not friendly to search at this time;
  • Deleting a node cannot simply empty the space of the deleted node, otherwise the search path of the synonym node that fills the hash table after it will be truncated. Therefore, when the delete operation is performed on the hash table that uses the open address method to deal with the conflict, the deletion mark can only be made on the deleted node, but the node cannot be deleted.
  • When the space is full, an overflow table is also created to store the extra elements.

2. Re-hashing #

Hi = RHi(key), i=1,2,…k
RHi are all different hash functions, that is, another hash function address is calculated when the synonym generates address conflict until no conflict occurs. This method is less prone to aggregation, but increases computation time.

Disadvantage: increased computation time.

3. Create a public overflow area #

Assuming that the value range of the hash function is [0,m-1], set the vector HashTable[0…m-1] as the basic table, each component stores a record, and set up the vector OverTable[0…. v] is the overflow table. All the records whose keywords and the keywords in the basic table are synonyms, regardless of their hash addresses obtained by the hash function, will be filled in the overflow table once a conflict occurs.

Simply put, it is to create a new table to store conflicting elements.

4. Chain address method (zipper method) #

All records whose keywords are synonyms are stored in the same linear linked list, that is, the elements in the conflicting positions are constructed into a linked list.

Advantages of the zipper method:

  • The zipper method is simple to deal with conflicts, and there is no accumulation phenomenon, that is, non-synonymous words will never conflict, so the average search length is shorter;
  • Since the node space on each linked list in the zipper method is dynamically applied, it is more suitable for the situation where the length of the table cannot be determined before the table is built;
  • In the hash table constructed by the zipper method, the operation of deleting a node is easy to implement. Simply delete the corresponding node on the linked list.

Disadvantages of the zipper method:

  • Pointers require additional space, so when the node size is small, the open addressing method saves space, and if the saved pointer space is used to expand the size of the hash table, the filling factor can be reduced, which reduces the open addressing method. collisions in the average lookup speed

back to the top

Data structure of HashMap in Java #

HashMap is actually a “linked list hash” data structure, that is, a combination of an array and a linked list.

HashMap data structure, derived from the network

Looking at the picture, you can see that the hashMap in Java uses the zipper method to handle conflicts.
HashMap has an initial capacity size, the default is 16

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

In order to reduce the probability of conflict, when the length of the hashMap array reaches a critical value, it will trigger expansion, and rehash all elements into the expanded container, which is a very time-consuming operation.

And this critical value is determined by the [load factor] and the capacity of the current container: DEFAULT_INITIAL_CAPACITY*DEFAULT_LOAD_FACTOR, that is, when 16×0.75=12 by default, the expansion operation will be triggered.

Therefore, when using the hash container, try to estimate the amount of your own data to set the initial value. The specific code is implemented to study the source code of HashMap by yourself.

The basic knowledge is completed, back to the topic, why should the loading factor default to 0.75?
Found this paragraph from the hashmap source comment

Ideally, under random hashCodes, the frequency of

  • nodes in bins follows a Poisson distribution
  • ([http://en.wikipedia.org/wiki/Poisson_distribution]) with a
  • parameter of about 0.5 on average for the default resizing
  • threshold of 0.75, although with a large variance because of
  • resizing granularity. Ignoring variance, the expected
  • occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
  • factorial(k)). The first values are:

  • 0: 0.60653066

  • 1: 0.30326533
  • 2: 0.07581633
  • 3: 0.01263606
  • 4: 0.00157952
  • 5: 0.00015795
  • 6: 0.00001316
  • 7: 0.00000094
  • 8: 0.00000006
  • more: less than 1 in ten million

Note the keyword in the wiki link: Poisson_distribution
Poisson distribution

A simple translation is that ideally, using a random hash code, the frequency of nodes in the hash bucket follows a Poisson distribution, and a comparison table of the number and probability of elements in the bucket is given.

As can be seen from the above table, when the number of elements in the bucket reaches 8, the probability has become very small, that is to say, with 0.75 as the loading factor, it is almost impossible for the length of the linked list of each collision position to exceed 8.

Well, if you dig deeper, you will go to the statistics side, so stop here, and reiterate that when using a hash container, please try to specify the initial capacity, and it is a power of 2.

For knowledge about the Poisson distribution, see  Poisson and Exponential Distributions: 10-minute tutorial

Reference: Why the default load factor of HashMap in Java is 0.75

Leave a Comment

Your email address will not be published. Required fields are marked *