Description:
In the case where most searches on a HASH key of HEAP table will result in non-match, the implementation is currently unoptimized and inferior to traditional sparsely populated hashing algorithms. An optimal solution is provided below, which maximizes speed and minimizes memory usage.
These cases can be quite important. For example, in our application we want to have a table of tokens which indicate a message is a spam. Typically these tokens will be very few in number, yet the total number of tokens to be tested will be very large. Perhaps several orders of magnitude between the two, so perhaps 99.99% of searches will result in a non-match.
(1) Please refer to following prequisite bug report, where I studied the "mysys/hash.c" source code (granted the "heap/hash.c" may differ slightly but I think all points herein remain valid and applicable):
http://bugs.mysql.com/bug.php?id=7817
Currently, the HASH key implementation _never_ has an empty hash bucket, the # of buckets is always equal the # of records, and the overhead for duplicates is almost the same as for non-matches.
The algorithm for searching is:
* calculate hash index of search key into hash table array
* In a loop, test for key match in each linked node in hash table array
* If record of 1st node in loop has a key whose hash index does not
match the calculated hash index, then search key does not exist
and exit loop
* loop ends with last linked node
Additionally (but not crucial to my reason for optimizing this) there are even rarer more degenerate (even slower) cases:
* a non-match may in some cases require walking entire linked list in the
loop, when record of 1st node in loop has a key whose hash index _does_
match the calculated hash index
* without "MTF" optimization, duplicates may in some cases require walking
entire linked list in the loop
(2) Thus for both non-matches and duplicates, minimally the hash index function must be called twice and at least one key compared. For long char() keys this can be quite expensive.
(3) In the prequisite bug report above, I already proposed a faster hash function. This will help, especially duplicates, but is lacking for non-matches. Also I proposed an "MTF" algorithm so that searching for MRU duplicates results in less slowdown due to duplicates.
(4) However, #3 still does not make non-matches as fast as for a sparsely populated hash table. Thus I propose that a new keyword be added, e.g. "HASH_QUICK (multiple)", for which in addition to the hash table buckets for hash.h::HASH_LINK, there is also a duplicate hash table buckets array where each bucket is 1 bit and the # of buckets (the TSIZE per source code of the optimal hash function I suggested) is much greater than ("multiple" multiplied by) the number of records. For example, if the TSIZE for the bit bucket array was 32 times larger than for the HASH_LINK bucket array, then the bit bucket array would only consume 50% less memory than the HASH_LINK bucket array (because HASH_LINK is 64 bits on 32 bit cpu). The bit bucket array and HASH_LINK bucket array could each use different hash functions as well, if desired.
Thus for most non-matches (roughly 31/32 if use 32 as "multiple") it would only involve masking the hash function return value to the appropriate bit and if bit is not set then immediately return non-match. If bit is set, then continue with HASH key code.
How to repeat:
Please read Description.
Suggested fix:
Please read Description.