Description:
In the case where most searches on a HASH key of HEAP table will result in non-match, the
implementation is currently unoptimized and inferior to traditional sparsely populated
hashing algorithms. An optimal solution is provided below, which maximizes speed and
minimizes memory usage.
These cases can be quite important. For example, in our application we want to have a
table of tokens which indicate a message is a spam. Typically these tokens will be very
few in number, yet the total number of tokens to be tested will be very large. Perhaps
several orders of magnitude between the two, so perhaps 99.99% of searches will result in
a non-match.
(1) Please refer to following prequisite bug report, where I studied the "mysys/hash.c"
source code (granted the "heap/hash.c" may differ slightly but I think all points herein
remain valid and applicable):
http://bugs.mysql.com/bug.php?id=7817
Currently, the HASH key implementation _never_ has an empty hash bucket, the # of buckets
is always equal the # of records, and the overhead for duplicates is almost the same as
for non-matches.
The algorithm for searching is:
* calculate hash index of search key into hash table array
* In a loop, test for key match in each linked node in hash table array
* If record of 1st node in loop has a key whose hash index does not
match the calculated hash index, then search key does not exist
and exit loop
* loop ends with last linked node
Additionally (but not crucial to my reason for optimizing this) there are even rarer more
degenerate (even slower) cases:
* a non-match may in some cases require walking entire linked list in the
loop, when record of 1st node in loop has a key whose hash index _does_
match the calculated hash index
* without "MTF" optimization, duplicates may in some cases require walking
entire linked list in the loop
(2) Thus for both non-matches and duplicates, minimally the hash index function must be
called twice and at least one key compared. For long char() keys this can be quite
expensive.
(3) In the prequisite bug report above, I already proposed a faster hash function. This
will help, especially duplicates, but is lacking for non-matches. Also I proposed an
"MTF" algorithm so that searching for MRU duplicates results in less slowdown due to
duplicates.
(4) However, #3 still does not make non-matches as fast as for a sparsely populated hash
table. Thus I propose that a new keyword be added, e.g. "HASH_QUICK (multiple)", for
which in addition to the hash table buckets for hash.h::HASH_LINK, there is also a
duplicate hash table buckets array where each bucket is 1 bit and the # of buckets (the
TSIZE per source code of the optimal hash function I suggested) is much greater than
("multiple" multiplied by) the number of records. For example, if the TSIZE for the bit
bucket array was 32 times larger than for the HASH_LINK bucket array, then the bit bucket
array would only consume 50% less memory than the HASH_LINK bucket array (because
HASH_LINK is 64 bits on 32 bit cpu). The bit bucket array and HASH_LINK bucket array
could each use different hash functions as well, if desired.
Thus for most non-matches (roughly 31/32 if use 32 as "multiple") it would only involve
masking the hash function return value to the appropriate bit and if bit is not set then
immediately return non-match. If bit is set, then continue with HASH key code.
How to repeat:
Please read Description.
Suggested fix:
Please read Description.