Bug #7817 HEAP table faster for duplicate keys with "MTF" & bitwise hash function
Submitted: 11 Jan 2005 19:33 Modified: 26 Jun 2005 8:16
Reporter: Shelby Moore Email Updates:
Status: Open Impact on me:
Category:MySQL Server: Memory storage engine Severity:S4 (Feature request)
Version:4.0.17 OS:FreeBSD (FreeBSD)
Assigned to: CPU Architecture:Any
Tags: Contribution
Triage: D5 (Feature request)

[11 Jan 2005 19:33] Shelby Moore
Please see my comment:

"11 Jan 8:20pm] Shelby Moore"

in following bug report:


I believe that storage requirements of the HEAP table can be reduced by often orders of magnitude, and thus performance increased by orders of magnitude!

The optimizations needed are quite simple to implement.

Putting a copy of KEY in index because it takes one less memory read, doesn't make sense if it causes the L2 cache to page to main memory, and causes the table data to page to disk.

And no need to have huge dynamic hash tables with the very simple "move to first" algorithm.

How to repeat:
See the description please.

Suggested fix:
See the description please.
[12 Jan 2005 10:00] Shelby Moore
Let me make the proposed optimizations more clear and give examples why they are necessary:

(1) "move to first" hashing:


First, I should ask if MySQL's "HEAP" (aka "MEMORY") storage engine already implements "move to first" hashing??

My guess is it is not optimized, based on comments in the MySQL docs, "degree of slowdown is proportional to the degree of duplication (or, inversely proportional to the index cardinality)".

This optimization is very simple to implement.  When accessing the linked list of records in the hash buck, if the found record is not first in the linked list, then move it to first.  This requires only 3 pointer writes.  The research paper in above link is an easy read.  Check it out.

As a result, hashing performance does not degrade significantly, even if the size of the hash buckets table is not increased.  Without "move to first" hashing optimization, the hash bucket tables have to be dynamically doubled in size as load (average linked list length) increases.  An example given in the research paper in above link, shows that for a very large database, a 512KB hash bucket table with "move to first" hashing is only a few percent slower than a 32MB hash bucket table without "move to first" hashing!!

I think implementing the optimization would be very simple.  Just add a small snippet of code that tests if the found record is not first in the linked list, then move it to first.  This requires only 3 pointer writes.  Point me to the correct file of source code and I can probably make this change in 30 minutes or less.

The other required change to the source code would be to stop increasing the size of the hash bucket table when it reaches a limit.  I would suggest the default limit would be based on the size of the KEY times number of records (per the benchmarks in the research above).  This limit could be optionally configurable from the CREATE TABLE syntax in future.

Consider the following example usage to see why this is a critically needed optimization.

Example #1: Consider example #2 in the case where you have 127 HEAP tables, not necessarily related to each other.  Using hash bucket table that are excessively large (very sparsely populated) is a huge waste of memory when many HEAP tables are used.  Which I think is quite common in MySQL's general use of memory temporary tables.

Example #2: a MyISAM table with varchar(127) and 6 bytes of other field data, is currently populated with 4 million records, and consumes less than 200MB of disk for index + datafile.  It runs at least 100x too slow to be of practical use.  It must be moved into memory.  It will grow to at least 12 million records.  The only practical options are to host the MyISAM in Ramdisk or to use HEAP tables.  The Ramdisk option is less than optimal hack.  To use HEAP, we observe that the current average size of the varchar(127) is 13 bytes, and the char() string is not changed after record is inserted, so we can break the MyISAM table into 127 tables, each with a constant char() field ranging from 1 - 127 length.  The has advantage of making table locking more granular as well.

However this example use of HEAP becomes impossible if the hash bucket tables are allowed to expand.  Imagine 127 * 32MB > 4GB (can not fit into Intel 32 bit memory pointer).  And even if hash bucket tables are kept small, then without "move to first" optimization, then performance will suffer significantly with increasing KEY duplication (i.e. large datasets).
[12 Jan 2005 10:40] Shelby Moore
(2) "Do always not store 2 copies of KEY data" optimization:

For CREATE TABLE scheme where the max length of the KEY field is large compared to the length of all fields in the table row, then duplicating the a copy of the KEY data in the hash index linked list is not optimal.  Instead, just follow the pointer to record in the linked list to get the value of the KEY.

I am assuming a data structure for MySQL "HEAP" indexes, where the hash buckets are pointers.  These hash bucket pointers point (or null if no entries) to a linked list structure that contains 2 pointers, one to the record and the other to the next entry in the linked list (or null if none).

The linked list structures currently ALWAYS contain a copy of the KEY field.  I assume this is to speed up searching the linked list, because you do not have to read the KEY field from the record (which could case a memory from disk page fault).  This is reasonable strategy for small KEY field sizes.  Note that "move to first" algorithm makes any gain from copy of KEY data neglible.  And more importantly, this tradeoff swings to side of not putting a copy of KEY in the linked list, when the size of the KEY field becomes large (because then linked lists become large and cause L2 cache from memory page fault), and especially when the size of KEY field is large compared to record size, as overall memory consumption can be double (which of course can lead to memory from disk page fault generally in the system).

Consider the following example usage to see why this is a critically needed optimization.

Example #3:  Imagine a table schema with 1 of char(64) fields.  The current memory storage requirement would be (64 + 8) + 64.  The linked lists consume (64 + 8) per row.  Without the copy of the KEY field data in the index linked lists, then memory storage requirement would be 8 + 640.  The linked lists would then consume 8 per row.

That means a memory consumption reduction from 136 to 70 per row, or 49% reduction.  Increasing memory from 70 to 136, is 94% (almost double).  And for the linked list index it is 8 to 70, is 775% (almost an order of magnitude!!!!!).    So for example a linked list that would fit into 512KB L2 cache would consume 4MB and not fit without this optimization.

Example #4: for my usage in Example #2, this optimization means 33% reduction in overall memory consumption, e.g. estimating > 256MB (when we reach 12 million records).  And more importantly the linked list memory (and thus hashing) will be much more likely to fit in L2 cache and be orders of magnitude faster in performance (and even more so if "move to first" is also implemented).
[12 Jan 2005 11:06] Shelby Moore
(3) "Bitwise hashing" optimization

The research paper claims this is the most optimal hash function.  It is much faster than others (no multiplies or divisions) and is equivalently optimal in distribution.  And that prime number sizes are not necessary for hash bucket tables.  Powers of 2 work as well.

I assume MySQL HEAD already does this for char KEYs:

/* Author J. Zobel, April 2001.
   Permission to use this code is freely granted, provided that this
   statement is retained. */

/* Bitwise hash function.  Note that tsize does not have to be prime. */
#define TSIZE	1048576
#define SEED	1159241

unsigned int
bitwisehash(char *word, int tsize, unsigned int seed)
    char	c;
    unsigned int h;

    h = seed;
    for( ; ( c=*word )!='\0' ; word++ )
	h ^= ( (h << 5) + c + (h >> 2) );
    return((unsigned int)((h&0x7fffffff) % tsize));
[12 Jan 2005 23:35] Sergei Golubchik
Thanks! It was an interesting and thought-provoking reading :)

We don't do curently "move-to-first" optimization. I expect it won't be that efficient in the hashing code that MySQL has. You see, we don't use standard two-level "hash table - linked list for collisions" scheme. If you're curious - check mysys/hash.c function - but it's no easy reading, (it took me quite a while few years ago to understand how it works - it was something I never saw before).
In mysys/hash.c approach there are no explicit hash buckets, no explicit linked lists. Size of hash is always equal to the number of elements in it, no waste of memory (for sparsely populated buckets) whatsoever.

About "do always not store 2 copies of KEY data" optimization.
Yes, this could be done. We'll think about it (if it's easy and won't require big code changes it could be done relatively fast)

"Bitwise hashing" - again, it should be tested how well it performs in our hashing scheme.
[13 Jan 2005 12:04] Shelby Moore
This is getting more interesting!

I have studied mysys/hash.c and mysys/hash.h.  In 8 minutes, I think I understand how it works.

(1) The important function to understand first is hash.c::hash_search() and the functions it calls.

The data structure is dynamic array of hash table buckets (hash.h::HASH_LINK), which are also the linked list.  There is a linked list.  The # of buckets equals the number of records.  The buckets point to other buckets in the chain of a hash match.

The algorithm for searching is:

* calculate hash index of search key into hash table array
* In a loop, test for key match in each linked node in hash table array
* If record of 1st node in loop has a key whose hash index does not match the calculated hash index, then search key does not exist and exit loop
* loop ends with last linked node

(2) Realize the above algorithm (in #1) is no different in terms of searching performance than a standard hashing algorithm in that it maps keys to hash table buckets, then walks a linked list (where the links are offsets into array instead of pointers) of record pointers whose key matches hash value of search key.  The difference in performance is the size (in buckets) of the hash table is ALWAYS equal to the number of records.  The other difference is that is never has an unused bucket.

(3) In most (if not all) cases the algorithm (in #1 above) uses the same memory as compared to "move to first" ("MTF") algorithm.  I was wrong about the order of magnitude difference.  The only difference is up to 100% overuse from the duplication of they KEY data (see #7 below), which is independent of the hashing algorithm used (e.g. "MTF" or not).

The only time the current algorithm (in #1 above) would use less memory than "MTF", is if the hash table buckets were sparsely populated (there were unused buckets) for the "MTF" algorithm, but this is never the case.

To prove this, look at "Table 1" and "Table 3" in the research paper I gave a link to.  First realize from #2 above that the algorithm in #1 above is in terms of search performance, equivalent to standard non-"MTF" hash algorithm, with the only difference being the size of the hash table used.  Study the "MTF" hash tables sizes needed to match the performance of the non-"MTF" where hash table size is dynamically increased.  Note that "Small AP" has 74,439 records (from "Table 1"), and an 8KB "MTF" hash table (1K buckets hash table size) can match the performance of a 0.26MB (33K buckets hash table size) dynamically increased final size.  Given that algorithm above in #1 will max out at 74,439 buckets hash table size, and safely assume given that the 1K buckets of "MTF" algorithm will not be sparsely populated with 74,439 records.  Taking into consideration that dynamically growing incrementally (as each record added) vs. "doubling" might be faster and taking into account 74,439 final size vs. 33K buckets, a more reasonable comparison would be no more than 64KB for the MTF hash table size (8K buckets).  Still very unlikely for the 8K buckets to be sparsely populated with 74,439 records.

For "Large Web" a matching performance 1MB "MTF" hash table (128K buckets) will not likely be sparsely populated with 9 million records.

(4) Thus the difference in performance between the "MTF" and algorithm currently used (in #1 above) is that "MTF" does not have the overhead of dynamic sizing of the hash table array, "MTF" can use a faster hash function that has no multiplications, and "MTF" will not slow down as much with (inversely to) index cardinality.  I quote from the research paper, "...while a table without move-to-front is perhaps 40% slower again...".   However, the # of hash buckets (hash table size) is larger for algorithm currently used (in #1 above), yet I doubt that the distribution of it's hash function is perfect.

(5) So they appear to mostly equal, except I think "MTF" will be significantly superior when there are many duplicate keys (because we can assume records are not accessed with equal frequency).  Thus implementing "MTF" could remove a caveat from the docs that advises use of BTREE:


(6) Implementing "MTF" in the algorithm current used (in #1 above) is easy.  In hash.c::hash_search(), when we find a match, we copy the first node to the found node position in the array, vice versa, and then we still need to do the 3 data pointer writes.  This would add the best of "MTF" to the best of current algorithm.  It would slow the algorithm slightly overall in return for less slowdown inversely to index cardinality.

Actually there is an optimization.  We only need to copy the data pointers as the indexes to next do not get swapped.  So actually the current algorithm is able to implement "MTF" with only 2 writes instead of 3 for normal hash table linked list!

Do you want me to do it?

(7) I have not yet figured out where the duplicate copy of the KEY data is stored. I think you need to point me to source that calls into hash.c.  Looks like it may be stored in the record itself?  At least it requires reading the data pointer in the linked list of HASH_INDEX.  Why store two copies of the key, when you are doing an extra indirection any way (might as well just get the value from the field in the record)??

(8) I think the hash function quoted from the research paper could be used for current algorithm (in #1 above).  The only requirement appears to be that the hash distributes equally across a non-prime (because = # of records) # of buckets.  The hash function in research paper does that and does it without using multiplications.  I might experiment with that hash function within next 3 weeks and report my results back here.
[13 Jan 2005 12:12] Shelby Moore
I suggest that some explanations from this page would be very useful if added to docs or technical article for MySQL, to explain hashing data structures, hashing algorithms, and the tradeoffs, to further convince people that MySQL is (will be after we finish this) optimal.
[13 Jan 2005 12:18] Shelby Moore
For the curious, off topic and FYI only, the examples I have discussed and our motivation for optimal HEAP tables is for a new statistical algorithm (superior to Bayesian) for anti-spam.  The current website (http://AccuSpam.com) does not yet reflect the new algorithm we are implementing/testing.
[13 Jan 2005 13:31] Shelby Moore
As promised, I have done the necessary coding.  Here follows the hash.c::hash_search() function which implements "MTF"!  Just plug this in.

Contrary to what I wrote earlier, "MTF" in hash_search() requires 5 writes instead of the 3 with normal linked-list "MTF", but I suspect this is insignificant performance overhead.

Note in the "MTF" case, the meaning (semantics) of HASH::current_record is changed (it does not point to last record found), in order to give same performance from the hash.c::hash_next() function.  Otherwise a new field could be created which would hold the index of next instead of current.  Within hash.c, HASH::current_record is only used to find the next record currently.  If you need more explanation to resolve this issue, just ask.  The following code will work optimally as is.

An alternative "MTF if not 2nd" algorithm, would just need to change the line:
	if (prev)
	if (prev && prev!=first)

gptr hash_search(HASH *hash,const byte *key,uint length)
  HASH_LINK *pos,*first,*prev;
  uint first_idx,idx,prev_idx;
  byte *data;

  if (hash->records)
    idx=hash_mask((*hash->calc_hashnr)(key,length ? length :
      pos= dynamic_element(&hash->array,idx,HASH_LINK*);
      if (!hashcmp(hash,pos,key,length))
	/* Do "move to first" algorithm of Zobel, Heinz, Williams, suggested and added by Shelby Moore */
	if (prev)
	  hash->current_record= prev_idx;
	  hash->current_record= idx;
	DBUG_PRINT("exit",("found key at %d",idx));
	DBUG_RETURN (pos->data);
      if (!prev)
	if (hash_rec_mask(hash,pos,hash->blength,hash->records) != idx)
	  break;				/* Wrong link */
    while ((idx=pos->next) != NO_RECORD);
  hash->current_record= NO_RECORD;
[13 Jan 2005 13:46] Shelby Moore
Add one correction or additional datum, where I stated, "(2) Realize the above algorithm (in #1) is no different in terms of searching performance than a standard hashing algorithm...", add that the current algorithm calculates the hash twice if first record is not a KEY match.

This can be another reasons that duplicates are noteably slower in current algorithm (as mentioned in MySQL docs).

So using a faster hash function (with no multiplications) and using "MTF" so the MRU entry is usually first, will probably be worthwhile.
[13 Jan 2005 13:55] Shelby Moore
For quick comparison, here is the current hash.c::hash_search() function without "MTF" optimization:

gptr hash_search(HASH *hash,const byte *key,uint length)
  HASH_LINK *pos;
  uint flag,idx;

  if (hash->records)
    idx=hash_mask((*hash->calc_hashnr)(key,length ? length :
      pos= dynamic_element(&hash->array,idx,HASH_LINK*);
      if (!hashcmp(hash,pos,key,length))
	DBUG_PRINT("exit",("found key at %d",idx));
	hash->current_record= idx;
	DBUG_RETURN (pos->data);
      if (flag)
	flag=0;					/* Reset flag */
	if (hash_rec_mask(hash,pos,hash->blength,hash->records) != idx)
	  break;				/* Wrong link */
    while ((idx=pos->next) != NO_RECORD);
  hash->current_record= NO_RECORD;
[13 Jan 2005 13:58] Shelby Moore
If possible, I kindly request that if you integrate my source code improvements for hash_search() into main MySQL distribution, that you kindly retain the comment I inserted that credits the authors of the "MTF" algorithm and mentions my name.  Giving proper credit is always a good motivation for future contributions.
[13 Jan 2005 19:27] Shelby Moore

My tests now reveal that HEAP hashing engine is NOT duplicating the KEY data, at least when the KEY is char(4) or greater in length.

Can someone please re-open the following bug report:


I am posting a new comment on that bug report to justify what I have written above and explain that the documented memory storage requirements are incorrect, at least when the KEY is char(4) or greater in length.

Other than re-opening the above bug report, then only the "MTF" optimization I have contributed needs to be done.  Then this "feature request" bug report can be closed as fixed/done.
[14 Jan 2005 21:05] Shelby Moore
Based on your latter comments in:


Do you need me to make the "MTF" optimization edits to "heap/hash.c" as well?

If yes, I can probably not do it until February, as I will traveling from Monday.

I do believe the "MTF" optimization should be tested, as I think it will eliminate the slowdown for duplicate KEYs as mentioned in the manual.

If it works as expected, then perhaps it should be a toggleable option somehow, either at the SQL level (e.g. new index keyword "HASH_MTF"), or a compile time option.

Let me know if you are waiting for me to do something?  Please also let me know the desire for "MTF"?
[14 Jan 2005 21:08] Shelby Moore
I am modifying the bug title/synopsis to:

HEAP table slowdown for duplicate keys can possibly be fixed with "MTF"
[14 Jan 2005 21:11] Shelby Moore
Modified the title of bug again, as I remembered we should also test the faster bitwise hash function:

"HEAP table faster for duplicate keys with 'MTF' & bitwise hash function"
[14 Jan 2005 21:20] Jim Winstead
MTF is not likely to be of much benefit for MEMORY tables, because it would add a requirement to 
acquire a write lock for the table when reading. (Don't forget that the server is multi-threaded, 
and there may be more than one thread reading from the table at once.)
[14 Jan 2005 21:38] Shelby Moore
We still have the faster bitwise hash to consider regardless.  See previous comments.

And you could still do "MTF" when you had a write lock, e.g. UPDATE ... WHERE ...

For large databases, UPDATES are often much more frequent than INSERTs, and frequently updated data may correspond to frequently read (SELECT) data in some applications.

Also, my understanding is you only need to write lock during a microsecond critical section which could probably accomplish without a queue and simple semaphore, i.e. no need to use the TABLE lock overhead:


I don't know what they overhead of a semaphore is relative to the slowdown experienced with duplicate KEYs.
[16 Jan 2005 1:20] Shelby Moore
I have discovered a new very important optimization needed that partially pertains to this bug report as well:

HEAP table HASH not optimized for searches that usually do not match KEY:

[16 Jan 2005 12:50] Sergei Golubchik
About bitwise hash. A very important property that our
implementation relies on - any number of bits of the hash
must be good. Because if the dymanic array is (at the
moment) of N elements, then for each element we use only
last n bits of the hash (where n is number of bits in N).
Say, for 800-element array, hash number is actually
calculated_hash(key) & 0x3FF.

Usually hash function should only be good as a whole, but we
need it mod 2^N to be good too for every N.

This is what we have to test (I'm explaining why not every
"generally good" hash function is good in our hashing

About MTF - because of that dynamic expanding of hash number
(by using more bits of it as new elements are added to the
hash), I expect the number of collisions to be low. What
does SHOW KEYS return in cardinality column for you ? And
when there are identical values MTF will help not that much
as the query typically is ... WHERE col=value and MySQL will
have to traverse the complete linked list anyway. As far as
I understand, MTF is mostly useful when there are many
*different* keys in the hash that are hashed to the same
number and there's significantly skewed access pattern -
some keys are needed much more often than others. That is
MTF is a perfect approach for word dictionaries (this is
what the article used it for). It also means I can use it
internally in MySQL whenever I need to store words for
fulltext search :) There are also other applications where
MTF is useful, but for generic HEAP table it must be
optional (if at all). In any case it seems to be a useful
addition to mysys/hash.c - the internally used hash. E.g.
table cache would benefit from it, I'm sure - some tables
are, indeed, used much more often than others :)
[16 Jan 2005 20:32] Shelby Moore
(1) The bitwisehash() hash function I suggested above allows you to set the # of bits of output using TSIZE.

(2) The research paper claims bitwisehash() gives as good a distribution as other hash functions, which have a multiply and/or divide (modulo) for every byte of KEY.  And it is as good for all TSIZE, even for TSIZE which is not prime and for TSIZE which is powers of 2.

(3) Notice bitwisehash() only has bit operations per byte of KEY and a single modulo operation for the entire hash result, so it should be much faster than the hash currently used, which has a multiply for every byte of KEY.

FYI, apparently freeware FastDB uses bitwisehash(), except they multiply by 31 instead of left shift by 5:


(4) Agreed we should test bitwisehash() to verify research paper's claim.

(5) You are wrong to assume that the linked list length will be 1, when the cardinality is maximum.

Your comment above misses a very important aspect of the way MySQL currently hashes.  In fact, I think the manual is wrong as well, because it says "...degree of slowdown is proportional to the degree of duplication (or, inversely proportional to the index cardinality...":


I got off to thinking about this because it affected whether I should post the bug report below:


Apparently you and the manual are assuming that the hash will return a linked list of rows which all have a KEY which matches the hash of the KEY in the first item of the linked list, and this is impossible.

This fact is because the linked list is also the buckets of hash table.  Thus when adding rows, if the bucket is occuppied that a row's KEY hashes to, and the occupied row has KEY which hashes to the value of that bucket, then the HASH_LINK for that KEY has to placed in a bucket which has a different hash value.  Else the occuppied row is moved to a different bucket.  In either case, the linked list takes care of making sure the row is found on hash_search().  This is why the hash.c::hash_search() function only tests the first HASH_LINK in the linked list to see if the KEY exists or not.

The above paragraph is very convoluted to explain, but if you think deeply about the fact that the linked list is also the hash buckets (benefits: no memory wasted and hash table size maximized), then you should realize that the drawback is that EVEN KEYS WHICH ARE NOT EQUAL, will end up causing linked lists which are not length of 1.

Thus "MTF" is very applicable because it bubbles Most Recently Used (MRU) items of the linked lists to the beginning of the linked lists.  "MTF" takes advantage of the usual Gaussian distribution of most things in nature and thus that some rows which be accessed more frequently than others.

(6) FYI, but irrelevant to points above, the cardinality of my current use is maximum (e.g. equal to number of records), because it is the PRIMARY KEY (a unique key).  However, this does not diminish the need for "MTF" for the reason given in #5 above in this comment.
[16 Jan 2005 20:44] Shelby Moore
Also please do not take my #5 in above comment to be critical, and excuse me because I wrote that as fast as possible because I had to leave in 15 minutes for a 3 day trip.

I very much appreciate you taking the time to consider these optimizations carefully and the degree of inspection of details among contributors here is quite inspiring!  Thanks!
[19 Jan 2005 15:36] Shelby Moore
Note the only way the previous 2 comments are not true is if the hash function gives perfect distribution in all cases, i.e. that no KEYs ever have the same hash value.  So it is not duplicates of KEYs but duplicates of hash values that I think the manual must be referring to.  If yes, IMHO the manual should be edited to make this more clear.
[19 Jan 2005 19:04] Sergei Golubchik
Manual is talking about duplicate key values, not duplicate hash values.
Of course, if hash function is bad it'll only make things worse.

When you wrote about "maximum cardinality" - did you mean number of distinct values in the index or "Cardinality" column from SHOW KEYS command ? (I was asking about the latter)

About your comment that "EVEN KEYS WHICH ARE NOT EQUAL,
will end up causing linked lists".
This is not the case - in any linked list all keys have the same hash value.
But I cannot really explain why, because you seem to understand how our
hash works - you described it correctly in your comment
from [2005-01-16 20:32:05]. So, I don't see why you think that "even keys
which are not equal will end up causing linked lists".

[19 Jan 2005 20:04] Shelby Moore
First let me state that my understanding of the hashing algorithm is purely theoretical, based on studying the source code in mysys/hash.c, meaning I have not actually studied the actual values in hash tables and linked lists in a running application.  I think my understanding is more correct than conclusions from anecdotal snapshots of real data.

Perhaps it is easiest to explain the defense (logic) of my assertion if I ask some concise (bordering on rhetorical) questions about the algorithm currently used:

(1) Is it possible that the hash function can map non-duplicate KEY values to same hash value (i.e. same hash table bucket)?

(2) Do we agree that the # of hash table buckets == # of records?

(3) In what bucket is HASH_LINK placed for a KEY if the hash value for that KEY is a duplicate (i.e. the hash value of the KEY is the same as the hash value of another KEY already in the index)?

The only possible answer to #3 is that it must be that it is placed into a bucket which does not match the hash value and the linked list from the bucket with correct hash value points to the bucket it is placed in.

(4) So then in what bucket is HASH_LINK placed for a KEY if the hash value for that KEY is the bucket which the key in #3 was placed?

There are only two possible answers (at this point in logic flow):

   (4-a) The KEY from #3 is moved to another bucket (and the linked list from first duplicate hash value follows it) so that this #4 KEY can be placed in the bucket which matches it's hash value.

   (4-b) This new #4 KEY is placed into a bucket that does not match it's hash value and linked list from #3 KEY will point to it.  Obviously this would not be allowed, because then the #4 KEY would not be first in linked list and the #3 KEY hash does not match bucket (an obvious requirement of hash_search()).  So only (4-a) is correct answer.

(5) To which bucket do we move #3 KEY to?

The only possible answer is to move it to a bucket where the hash value of existing KEY in the bucket does not match the bucket.  Such a bucket will always exist because the # of buckets == # of records.

Conclusion: is that I agree with you that any linked lists will have KEYs that give same hash value (meaning I was incorrect if I stated otherwise).  However, I disagree with you that KEYs with duplicate hash values will not create linked lists.  Thus the slowdown is proportional to the duplication of hash values, not only duplicate KEY values.

For my application, the Cardinality == # of records, because it uses only UNIQUE KEY, so all KEY values are distinct.
[28 Jan 2005 18:40] Sergei Golubchik
Sorry for delayed reply :(

Your understanding is very good! Here're my answers:

(1) Yes, of course. One cannot have a perfect hash for an arbitrary
    unknown in advance dataset.

(2) Yes.

(3) Correct.

(4) Correct, it's (4-a)

(5) Correct.

Looks like earlier it was some misunderstanding.
I did not try to say that "KEYs with duplicate hash values will not
create linked lists". They will, of course. It's cardinality of hash
values what is really important, not cardinality of key values.
(that's why I asked about "Cardinality" as reported by SHOW INDEX - as it
depends on hash values, not key values).

The manual speaks about "duplicated key values" because a user cannot
really check whether he has duplicated hash values or not, he can only
see key values. And of course if there's high duplication in *key* values,
no hashing can fix the situation.
[29 Jan 2005 10:07] Shelby Moore

Thus we still need 3 improvements:

(1) Add a footnote to manual that slowdown is really caused by duplication (inverse cardinality) of hashes, but which can not be seen and is often approximated by inverse cardinality of KEY values.

(2) Test and use the faster hash function I provided above, if it is better, because the hash function has to run twice on querys that have duplicate hashs (more than one HASH_LINK linked from bucket).

(3) Consider in future, implementing MTF (as a compile option?) as it imparoves queries on duplicates, by moving MRU to top of HASH_LINK list.

The reason #1 is needed is because otherwise users do not understand that they can get a slowdown even with non-duplicate KEY values if the values exhibit certain periodic frequencies.  And also to prevent people like me from asking the same question again. :)