Why is the load factor in java Hashmap default to 0.75

Hits: 0

A few days ago, I saw someone in a group discussing why the load factor in the hashmap is 0.75 by default.

Load factor in HashMap [source code]

static final float DEFAULT_LOAD_FACTOR = 0.75f;

What came to mind at the time was a compromise between the contradiction between “hash collision” and “space utilization”.
In the same way [that data structures are] either fast to query or fast to insert, hashmap is a data structure that is slow to insert and fast to query.

The loading factor is the degree to which the elements in the Hsah table are filled.
The larger the load factor, the more elements are filled, and the space utilization is higher, but the chance of conflict increases.
Conversely, the smaller the load factor, the fewer elements are filled, and the chance of conflict is reduced, but more space is wasted.

The greater the chance of conflict, the higher the search cost. On the contrary, the cost of searching is smaller.

Therefore, a balance and compromise must be found between “opportunities for conflict” and “space utilization”.

But why must it be 0.75? instead of 0.8, 0.6

Continue to dig deeper in the spirit of not being too big of a problem. Before that, let’s briefly add the basic knowledge needed in this article:

  1. Definition of conflict: Assuming that the address set of the hash table is [0, n), the conflict means that there is already a record at the position where the hash address obtained by the keyword is j (0<=j<=n-1). There is already a record on the hash address obtained by the keyword, then it is called a conflict

  2. Handling collisions: It is to tie another “empty” hash address for the record of this keyword. That is, when dealing with the conflict of hash addresses, if another hash address H1 is still in conflict, then find the next address H2, if H2 is still in conflict, then find H3 until Hk does not conflict, then Hk is the address recorded in the table.

Several ways to deal with conflicts:#

1. Open addressing method

Hi=(H(key) + di) MOD mi=1,2,…k(k<=m-1) where H(key) is the hash function; m is the length of the hash table; di is the increment quantity sequence.

Open addressing methods can be divided into three types according to different step sizes:

1 ) Linear Probing : di=1,2,3,…,m-1
  simply means that the current conflict position is the starting point, the step size is 1, and the search is repeated until an empty position is found. Insert the element into it, and if it is not found after the loop, the container is full. It’s like going to a restaurant on a street, asking the first one and being told it’s full, and then going next door to ask if there’s a seat.

2) Linear compensation detection method : di=Q, the next position satisfies Hi=(H(key) + Q) mod mi=1,2,…k(k<=m-1), and it is required that Q and m are mutually qualitative, so that all cells in the hash table can be detected.
Continue to use the above example, now you don’t go to each house to ask, take out the calculator and do the math, and then ask every other house if there is a location.

3) Pseudo-random detection and re-hashing : di=pseudo-random number sequence. Again, this is an example of choosing a store based on your mood.

shortcoming:

  • The hash table established by this method is easy to accumulate data when there are many conflicts, which is not friendly to search at this time;
  • Deleting a node cannot simply empty the space of the deleted node, otherwise the search path of the synonym node that fills the hash table after it will be truncated. Therefore, when the delete operation is performed on the hash table that uses the open address method to deal with the conflict, the deletion mark can only be made on the deleted node, but the node cannot be deleted.
  • When the space is full, an overflow table is also created to store the extra elements.

2. Rehashing

Hi = RHi(key), i=1,2,…k
RHi are all different hash functions, that is, another hash function address is calculated when the synonym generates address conflict until no conflict occurs. This method is less prone to aggregation, but increases computation time.

Disadvantage: increased computation time.

3. Create a public overflow area

Assuming that the value range of the hash function is [0,m-1], set the vector HashTable[0…m-1] as the basic table, each component stores a record, and set up the vector OverTable[0…. v] is the overflow table. All keywords and the records in the basic table whose keywords are synonyms, no matter what their hash addresses are obtained by the hash function, will be filled in the overflow table once a conflict occurs.

Simply put, it is to create a new table to store conflicting elements.

4. Chain address method (zipper method)

All records whose keywords are synonyms are stored in the same linear linked list, that is, the elements in the conflicting positions are constructed into a linked list.

Advantages of the zipper method:

  • The zipper method is simple to deal with conflicts, and there is no accumulation phenomenon, that is, non-synonymous words will never conflict, so the average search length is shorter;
  • Since the node space on each linked list in the zipper method is dynamically applied, it is more suitable for the situation where the length of the table cannot be determined before the table is built;
  • In the hash table constructed by the zipper method, the operation of deleting a node is easy to implement. Simply delete the corresponding node on the linked list.

Disadvantages of the zipper method:

  • Pointers require additional space, so when the node size is small, the open addressing method saves space, and if the saved pointer space is used to expand the size of the hash table, the filling factor can be reduced, which reduces the open addressing method. collisions in the average lookup speed

Data structure of HashMap in Java#

HashMap is actually a “linked list hash” data structure, that is, a combination of an array and a linked list.

HashMap data structure, derived from the network

Looking at the picture, you can see that the hashMap in Java uses the zipper method to handle conflicts.
HashMap has an initial capacity size, the default is 16

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

In order to reduce the probability of conflict, when the length of the hashMap array reaches a critical value, it will trigger expansion, and rehash all elements into the expanded container, which is a very time-consuming operation.

And this critical value is determined by the [load factor] and the capacity of the current container: DEFAULT_INITIAL_CAPACITY*DEFAULT_LOAD_FACTOR, that is, when 16×0.75=12 by default, the expansion operation will be triggered.

Therefore, when using the [hash] container, try to estimate the amount of your own data to set the initial value. The specific code is implemented to study the source code of HashMap by yourself.

The basic knowledge is completed, back to the topic, why should the loading factor default to 0.75?
Found this paragraph from the hashmap source comment

Ideally, under random hashCodes, the frequency of

  • nodes in bins follows a Poisson distribution
  • ([http://en.wikipedia.org/wiki/Poisson_distribution]) with a
  • parameter of about 0.5 on average for the default resizing
  • threshold of 0.75, although with a large variance because of
  • resizing granularity. Ignoring variance, the expected
  • occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
  • factorial(k)). The first values are:

  • 0: 0.60653066

  • 1: 0.30326533
  • 2: 0.07581633
  • 3: 0.01263606
  • 4: 0.00157952
  • 5: 0.00015795
  • 6: 0.00001316
  • 7: 0.00000094
  • 8: 0.00000006
  • more: less than 1 in ten million

Pay attention to the keywords in the wiki link: Poisson_distribution
Poisson distribution

A simple translation is that in an ideal case, using a random hash code, the frequency of node occurrences follows a Poisson distribution in the hash bucket, and a comparison table of the number of elements in the bucket and the probability is given.

As can be seen from the above table, when the number of elements in the bucket reaches 8, the probability has become very small, that is to say, with 0.75 as the loading factor, it is almost impossible for the length of the linked list of each collision position to exceed 8.

Well, if you dig deeper, you will go to the statistics side, so stop here, and reiterate that when using a hash container, please try to specify the initial capacity, and it is a power of 2.

For more information on Posong distribution, see

[http://www.ruanyifeng.com/blog/2015/06/poisson-distribution.html#comment-356111]

Welcome to discuss, the task of this weekend is completed, you can rest assured to go to the waves ─=≡Σ((((つ •̀ω•́) つ Superman

Author: Eric Shinnosuke
Link:
Source: Jianshu
The copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Leave a Reply

Your email address will not be published.