Bug #7817 HEAP table faster for duplicate keys with "MTF" & bitwise hash function
Submitted: 11 Jan 2005 20:33 Modified: 26 Jun 2005 10:16
Reporter: Shelby Moore
Status: Open
Category:Server: Memory Severity:S4 (Feature request)
Version:4.0.17 OS:FreeBSD (FreeBSD)
Assigned to: Target Version:
Tags: Contribution
Triage: D5 (Feature request)

[11 Jan 2005 20:33] Shelby Moore
Description:
Please see my comment:

"11 Jan 8:20pm] Shelby Moore"

in following bug report:

http://bugs.mysql.com/bug.php?id=7791

I believe that storage requirements of the HEAP table can be reduced by often orders of
magnitude, and thus performance increased by orders of magnitude!

The optimizations needed are quite simple to implement.

Putting a copy of KEY in index because it takes one less memory read, doesn't make sense
if it causes the L2 cache to page to main memory, and causes the table data to page to
disk.

And no need to have huge dynamic hash tables with the very simple "move to first"
algorithm.

How to repeat:
See the description please.

Suggested fix:
See the description please.
[12 Jan 2005 11:00] Shelby Moore
Let me make the proposed optimizations more clear and give examples why they are
necessary:

(1) "move to first" hashing:

http://www.cs.rmit.edu.au/~jz/fulltext/ipl01.pdf

First, I should ask if MySQL's "HEAP" (aka "MEMORY") storage engine already implements
"move to first" hashing??

My guess is it is not optimized, based on comments in the MySQL docs, "degree of slowdown
is proportional to the degree of duplication (or, inversely proportional to the index
cardinality)".

This optimization is very simple to implement.  When accessing the linked list of records
in the hash buck, if the found record is not first in the linked list, then move it to
first.  This requires only 3 pointer writes.  The research paper in above link is an easy
read.  Check it out.

As a result, hashing performance does not degrade significantly, even if the size of the
hash buckets table is not increased.  Without "move to first" hashing optimization, the
hash bucket tables have to be dynamically doubled in size as load (average linked list
length) increases.  An example given in the research paper in above link, shows that for
a very large database, a 512KB hash bucket table with "move to first" hashing is only a
few percent slower than a 32MB hash bucket table without "move to first" hashing!!

I think implementing the optimization would be very simple.  Just add a small snippet of
code that tests if the found record is not first in the linked list, then move it to
first.  This requires only 3 pointer writes.  Point me to the correct file of source code
and I can probably make this change in 30 minutes or less.

The other required change to the source code would be to stop increasing the size of the
hash bucket table when it reaches a limit.  I would suggest the default limit would be
based on the size of the KEY times number of records (per the benchmarks in the research
above).  This limit could be optionally configurable from the CREATE TABLE syntax in
future.

Consider the following example usage to see why this is a critically needed
optimization.

Example #1: Consider example #2 in the case where you have 127 HEAP tables, not
necessarily related to each other.  Using hash bucket table that are excessively large
(very sparsely populated) is a huge waste of memory when many HEAP tables are used. 
Which I think is quite common in MySQL's general use of memory temporary tables.

Example #2: a MyISAM table with varchar(127) and 6 bytes of other field data, is
currently populated with 4 million records, and consumes less than 200MB of disk for
index + datafile.  It runs at least 100x too slow to be of practical use.  It must be
moved into memory.  It will grow to at least 12 million records.  The only practical
options are to host the MyISAM in Ramdisk or to use HEAP tables.  The Ramdisk option is
less than optimal hack.  To use HEAP, we observe that the current average size of the
varchar(127) is 13 bytes, and the char() string is not changed after record is inserted,
so we can break the MyISAM table into 127 tables, each with a constant char() field
ranging from 1 - 127 length.  The has advantage of making table locking more granular as
well.

However this example use of HEAP becomes impossible if the hash bucket tables are allowed
to expand.  Imagine 127 * 32MB > 4GB (can not fit into Intel 32 bit memory pointer).  And
even if hash bucket tables are kept small, then without "move to first" optimization,
then performance will suffer significantly with increasing KEY duplication (i.e. large
datasets).
[12 Jan 2005 11:40] Shelby Moore
(2) "Do always not store 2 copies of KEY data" optimization:

For CREATE TABLE scheme where the max length of the KEY field is large compared to the
length of all fields in the table row, then duplicating the a copy of the KEY data in the
hash index linked list is not optimal.  Instead, just follow the pointer to record in the
linked list to get the value of the KEY.

I am assuming a data structure for MySQL "HEAP" indexes, where the hash buckets are
pointers.  These hash bucket pointers point (or null if no entries) to a linked list
structure that contains 2 pointers, one to the record and the other to the next entry in
the linked list (or null if none).

The linked list structures currently ALWAYS contain a copy of the KEY field.  I assume
this is to speed up searching the linked list, because you do not have to read the KEY
field from the record (which could case a memory from disk page fault).  This is
reasonable strategy for small KEY field sizes.  Note that "move to first" algorithm makes
any gain from copy of KEY data neglible.  And more importantly, this tradeoff swings to
side of not putting a copy of KEY in the linked list, when the size of the KEY field
becomes large (because then linked lists become large and cause L2 cache from memory page
fault), and especially when the size of KEY field is large compared to record size, as
overall memory consumption can be double (which of course can lead to memory from disk
page fault generally in the system).

Consider the following example usage to see why this is a critically needed
optimization.

Example #3:  Imagine a table schema with 1 of char(64) fields.  The current memory
storage requirement would be (64 + 8) + 64.  The linked lists consume (64 + 8) per row. 
Without the copy of the KEY field data in the index linked lists, then memory storage
requirement would be 8 + 640.  The linked lists would then consume 8 per row.

That means a memory consumption reduction from 136 to 70 per row, or 49% reduction. 
Increasing memory from 70 to 136, is 94% (almost double).  And for the linked list index
it is 8 to 70, is 775% (almost an order of magnitude!!!!!).    So for example a linked
list that would fit into 512KB L2 cache would consume 4MB and not fit without this
optimization.

Example #4: for my usage in Example #2, this optimization means 33% reduction in overall
memory consumption, e.g. estimating > 256MB (when we reach 12 million records).  And more
importantly the linked list memory (and thus hashing) will be much more likely to fit in
L2 cache and be orders of magnitude faster in performance (and even more so if "move to
first" is also implemented).
[12 Jan 2005 12:06] Shelby Moore
(3) "Bitwise hashing" optimization

The research paper claims this is the most optimal hash function.  It is much faster than
others (no multiplies or divisions) and is equivalently optimal in distribution.  And that
prime number sizes are not necessary for hash bucket tables.  Powers of 2 work as well.

I assume MySQL HEAD already does this for char KEYs:

/* Author J. Zobel, April 2001.
   Permission to use this code is freely granted, provided that this
   statement is retained. */

/* Bitwise hash function.  Note that tsize does not have to be prime. */
#define TSIZE	1048576
#define SEED	1159241

unsigned int
bitwisehash(char *word, int tsize, unsigned int seed)
{
    char	c;
    unsigned int h;

    h = seed;
    for( ; ( c=*word )!='\0' ; word++ )
    {
	h ^= ( (h << 5) + c + (h >> 2) );
    }
    return((unsigned int)((h&0x7fffffff) % tsize));
}
[13 Jan 2005 0:35] Sergei Golubchik
Thanks! It was an interesting and thought-provoking reading :)

We don't do curently "move-to-first" optimization. I expect it won't be that efficient in
the hashing code that MySQL has. You see, we don't use standard two-level "hash table -
linked list for collisions" scheme. If you're curious - check mysys/hash.c function - but
it's no easy reading, (it took me quite a while few years ago to understand how it works -
it was something I never saw before).
In mysys/hash.c approach there are no explicit hash buckets, no explicit linked lists.
Size of hash is always equal to the number of elements in it, no waste of memory (for
sparsely populated buckets) whatsoever.

About "do always not store 2 copies of KEY data" optimization.
Yes, this could be done. We'll think about it (if it's easy and won't require big code
changes it could be done relatively fast)

"Bitwise hashing" - again, it should be tested how well it performs in our hashing
scheme.
[13 Jan 2005 13:04] Shelby Moore
This is getting more interesting!

I have studied mysys/hash.c and mysys/hash.h.  In 8 minutes, I think I understand how it
works.

(1) The important function to understand first is hash.c::hash_search() and the functions
it calls.

The data structure is dynamic array of hash table buckets (hash.h::HASH_LINK), which are
also the linked list.  There is a linked list.  The # of buckets equals the number of
records.  The buckets point to other buckets in the chain of a hash match.

The algorithm for searching is:

* calculate hash index of search key into hash table array
* In a loop, test for key match in each linked node in hash table array
* If record of 1st node in loop has a key whose hash index does not match the calculated
hash index, then search key does not exist and exit loop
* loop ends with last linked node

(2) Realize the above algorithm (in #1) is no different in terms of searching performance
than a standard hashing algorithm in that it maps keys to hash table buckets, then walks a
linked list (where the links are offsets into array instead of pointers) of record
pointers whose key matches hash value of search key.  The difference in performance is
the size (in buckets) of the hash table is ALWAYS equal to the number of records.  The
other difference is that is never has an unused bucket.

(3) In most (if not all) cases the algorithm (in #1 above) uses the same memory as
compared to "move to first" ("MTF") algorithm.  I was wrong about the order of magnitude
difference.  The only difference is up to 100% overuse from the duplication of they KEY
data (see #7 below), which is independent of the hashing algorithm used (e.g. "MTF" or
not).

The only time the current algorithm (in #1 above) would use less memory than "MTF", is if
the hash table buckets were sparsely populated (there were unused buckets) for the "MTF"
algorithm, but this is never the case.

To prove this, look at "Table 1" and "Table 3" in the research paper I gave a link to. 
First realize from #2 above that the algorithm in #1 above is in terms of search
performance, equivalent to standard non-"MTF" hash algorithm, with the only difference
being the size of the hash table used.  Study the "MTF" hash tables sizes needed to match
the performance of the non-"MTF" where hash table size is dynamically increased.  Note
that "Small AP" has 74,439 records (from "Table 1"), and an 8KB "MTF" hash table (1K
buckets hash table size) can match the performance of a 0.26MB (33K buckets hash table
size) dynamically increased final size.  Given that algorithm above in #1 will max out at
74,439 buckets hash table size, and safely assume given that the 1K buckets of "MTF"
algorithm will not be sparsely populated with 74,439 records.  Taking into consideration
that dynamically growing incrementally (as each record added) vs. "doubling" might be
faster and taking into account 74,439 final size vs. 33K buckets, a more reasonable
comparison would be no more than 64KB for the MTF hash table size (8K buckets).  Still
very unlikely for the 8K buckets to be sparsely populated with 74,439 records.

For "Large Web" a matching performance 1MB "MTF" hash table (128K buckets) will not
likely be sparsely populated with 9 million records.

(4) Thus the difference in performance between the "MTF" and algorithm currently used (in
#1 above) is that "MTF" does not have the overhead of dynamic sizing of the hash table
array, "MTF" can use a faster hash function that has no multiplications, and "MTF" will
not slow down as much with (inversely to) index cardinality.  I quote from the research
paper, "...while a table without move-to-front is perhaps 40% slower again...".  
However, the # of hash buckets (hash table size) is larger for algorithm currently used
(in #1 above), yet I doubt that the distribution of it's hash function is perfect.

(5) So they appear to mostly equal, except I think "MTF" will be significantly superior
when there are many duplicate keys (because we can assume records are not accessed with
equal frequency).  Thus implementing "MTF" could remove a caveat from the docs that
advises use of BTREE:

http://dev.mysql.com/doc/mysql/en/MEMORY_storage_engine.html

(6) Implementing "MTF" in the algorithm current used (in #1 above) is easy.  In
hash.c::hash_search(), when we find a match, we copy the first node to the found node
position in the array, vice versa, and then we still need to do the 3 data pointer
writes.  This would add the best of "MTF" to the best of current algorithm.  It would
slow the algorithm slightly overall in return for less slowdown inversely to index
cardinality.

Actually there is an optimization.  We only need to copy the data pointers as the indexes
to next do not get swapped.  So actually the current algorithm is able to implement "MTF"
with only 2 writes instead of 3 for normal hash table linked list!

Do you want me to do it?

(7) I have not yet figured out where the duplicate copy of the KEY data is stored. I
think you need to point me to source that calls into hash.c.  Looks like it may be stored
in the record itself?  At least it requires reading the data pointer in the linked list of
HASH_INDEX.  Why store two copies of the key, when you are doing an extra indirection any
way (might as well just get the value from the field in the record)??

(8) I think the hash function quoted from the research paper could be used for current
algorithm (in #1 above).  The only requirement appears to be that the hash distributes
equally across a non-prime (because = # of records) # of buckets.  The hash function in
research paper does that and does it without using multiplications.  I might experiment
with that hash function within next 3 weeks and report my results back here.
[13 Jan 2005 13:12] Shelby Moore
I suggest that some explanations from this page would be very useful if added to docs or
technical article for MySQL, to explain hashing data structures, hashing algorithms, and
the tradeoffs, to further convince people that MySQL is (will be after we finish this)
optimal.
[13 Jan 2005 13:18] Shelby Moore
For the curious, off topic and FYI only, the examples I have discussed and our motivation
for optimal HEAP tables is for a new statistical algorithm (superior to Bayesian) for
anti-spam.  The current website (http://AccuSpam.com) does not yet reflect the new
algorithm we are implementing/testing.
[13 Jan 2005 14:31] Shelby Moore
As promised, I have done the necessary coding.  Here follows the hash.c::hash_search()
function which implements "MTF"!  Just plug this in.

Contrary to what I wrote earlier, "MTF" in hash_search() requires 5 writes instead of the
3 with normal linked-list "MTF", but I suspect this is insignificant performance
overhead.

Note in the "MTF" case, the meaning (semantics) of HASH::current_record is changed (it
does not point to last record found), in order to give same performance from the
hash.c::hash_next() function.  Otherwise a new field could be created which would hold
the index of next instead of current.  Within hash.c, HASH::current_record is only used
to find the next record currently.  If you need more explanation to resolve this issue,
just ask.  The following code will work optimally as is.

An alternative "MTF if not 2nd" algorithm, would just need to change the line:
	if (prev)
To:
	if (prev && prev!=first)

gptr hash_search(HASH *hash,const byte *key,uint length)
{
  HASH_LINK *pos,*first,*prev;
  uint first_idx,idx,prev_idx;
  byte *data;
  DBUG_ENTER("hash_search");

  prev=0;
  if (hash->records)
  {
    idx=hash_mask((*hash->calc_hashnr)(key,length ? length :
					hash->key_length),
		    hash->blength,hash->records);
    do
    {
      pos= dynamic_element(&hash->array,idx,HASH_LINK*);
      if (!hashcmp(hash,pos,key,length))
      {
	/* Do "move to first" algorithm of Zobel, Heinz, Williams, suggested and added by Shelby
Moore */
	if (prev)
	{
	  data=first->data;
	  first->data=pos->data;
	  pos->data=data;
	  prev->next=pos->next;
	  pos->next=first->next;
	  first->next=idx;
	  idx=first_idx;
	  pos=first;
	  hash->current_record= prev_idx;
	}
	else
	{
	  hash->current_record= idx;
	}
	DBUG_PRINT("exit",("found key at %d",idx));
	DBUG_RETURN (pos->data);
      }
      if (!prev)
      {
	if (hash_rec_mask(hash,pos,hash->blength,hash->records) != idx)
	  break;				/* Wrong link */
	first=pos;
	first_idx=idx;
      }
      prev=pos;
      prev_idx=idx;
    }
    while ((idx=pos->next) != NO_RECORD);
  }
  hash->current_record= NO_RECORD;
  DBUG_RETURN(0);
}
[13 Jan 2005 14:46] Shelby Moore
Add one correction or additional datum, where I stated, "(2) Realize the above algorithm
(in #1) is no different in terms of searching performance than a standard hashing
algorithm...", add that the current algorithm calculates the hash twice if first record
is not a KEY match.

This can be another reasons that duplicates are noteably slower in current algorithm (as
mentioned in MySQL docs).

So using a faster hash function (with no multiplications) and using "MTF" so the MRU
entry is usually first, will probably be worthwhile.
[13 Jan 2005 14:55] Shelby Moore
For quick comparison, here is the current hash.c::hash_search() function without "MTF"
optimization:

gptr hash_search(HASH *hash,const byte *key,uint length)
{
  HASH_LINK *pos;
  uint flag,idx;
  DBUG_ENTER("hash_search");

  flag=1;
  if (hash->records)
  {
    idx=hash_mask((*hash->calc_hashnr)(key,length ? length :
					 hash->key_length),
		    hash->blength,hash->records);
    do
    {
      pos= dynamic_element(&hash->array,idx,HASH_LINK*);
      if (!hashcmp(hash,pos,key,length))
      {
	DBUG_PRINT("exit",("found key at %d",idx));
	hash->current_record= idx;
	DBUG_RETURN (pos->data);
      }
      if (flag)
      {
	flag=0;					/* Reset flag */
	if (hash_rec_mask(hash,pos,hash->blength,hash->records) != idx)
	  break;				/* Wrong link */
      }
    }
    while ((idx=pos->next) != NO_RECORD);
  }
  hash->current_record= NO_RECORD;
  DBUG_RETURN(0);
}
[13 Jan 2005 14:58] Shelby Moore
If possible, I kindly request that if you integrate my source code improvements for
hash_search() into main MySQL distribution, that you kindly retain the comment I inserted
that credits the authors of the "MTF" algorithm and mentions my name.  Giving proper
credit is always a good motivation for future contributions.
[13 Jan 2005 20:27] Shelby Moore
UPDATE:

My tests now reveal that HEAP hashing engine is NOT duplicating the KEY data, at least
when the KEY is char(4) or greater in length.

Can someone please re-open the following bug report:

http://bugs.mysql.com/bug.php?id=7791

I am posting a new comment on that bug report to justify what I have written above and
explain that the documented memory storage requirements are incorrect, at least when the
KEY is char(4) or greater in length.

Other than re-opening the above bug report, then only the "MTF" optimization I have
contributed needs to be done.  Then this "feature request" bug report can be closed as
fixed/done.
[14 Jan 2005 22:05] Shelby Moore
Based on your latter comments in:

http://bugs.mysql.com/7791

Do you need me to make the "MTF" optimization edits to "heap/hash.c" as well?

If yes, I can probably not do it until February, as I will traveling from Monday.

I do believe the "MTF" optimization should be tested, as I think it will eliminate the
slowdown for duplicate KEYs as mentioned in the manual.

If it works as expected, then perhaps it should be a toggleable option somehow, either at
the SQL level (e.g. new index keyword "HASH_MTF"), or a compile time option.

Let me know if you are waiting for me to do something?  Please also let me know the
desire for "MTF"?
[14 Jan 2005 22:08] Shelby Moore
I am modifying the bug title/synopsis to:

HEAP table slowdown for duplicate keys can possibly be fixed with "MTF"
[14 Jan 2005 22:11] Shelby Moore
Modified the title of bug again, as I remembered we should also test the faster bitwise
hash function:

"HEAP table faster for duplicate keys with 'MTF' & bitwise hash function"
[14 Jan 2005 22:20] Jim Winstead
MTF is not likely to be of much benefit for MEMORY tables, because it would add a
requirement to 
acquire a write lock for the table when reading. (Don't forget that the server is
multi-threaded, 
and there may be more than one thread reading from the table at once.)
[14 Jan 2005 22:38] Shelby Moore
We still have the faster bitwise hash to consider regardless.  See previous comments.

And you could still do "MTF" when you had a write lock, e.g. UPDATE ... WHERE ...

For large databases, UPDATES are often much more frequent than INSERTs, and frequently
updated data may correspond to frequently read (SELECT) data in some applications.

Also, my understanding is you only need to write lock during a microsecond critical
section which could probably accomplish without a queue and simple semaphore, i.e. no
need to use the TABLE lock overhead:

	  data=first->data;
	  first->data=pos->data;
	  pos->data=data;
	  prev->next=pos->next;
	  pos->next=first->next;
	  first->next=idx;

I don't know what they overhead of a semaphore is relative to the slowdown experienced
with duplicate KEYs.
[16 Jan 2005 2:20] Shelby Moore
I have discovered a new very important optimization needed that partially pertains to this
bug report as well:

HEAP table HASH not optimized for searches that usually do not match KEY:

http://bugs.mysql.com/7936
[16 Jan 2005 13:50] Sergei Golubchik
About bitwise hash. A very important property that our
implementation relies on - any number of bits of the hash
must be good. Because if the dymanic array is (at the
moment) of N elements, then for each element we use only
last n bits of the hash (where n is number of bits in N).
Say, for 800-element array, hash number is actually
calculated_hash(key) & 0x3FF.

Usually hash function should only be good as a whole, but we
need it mod 2^N to be good too for every N.

This is what we have to test (I'm explaining why not every
"generally good" hash function is good in our hashing
algorithm).

About MTF - because of that dynamic expanding of hash number
(by using more bits of it as new elements are added to the
hash), I expect the number of collisions to be low. What
does SHOW KEYS return in cardinality column for you ? And
when there are identical values MTF will help not that much
as the query typically is ... WHERE col=value and MySQL will
have to traverse the complete linked list anyway. As far as
I understand, MTF is mostly useful when there are many
*different* keys in the hash that are hashed to the same
number and there's significantly skewed access pattern -
some keys are needed much more often than others. That is
MTF is a perfect approach for word dictionaries (this is
what the article used it for). It also means I can use it
internally in MySQL whenever I need to store words for
fulltext search :) There are also other applications where
MTF is useful, but for generic HEAP table it must be
optional (if at all). In any case it seems to be a useful
addition to mysys/hash.c - the internally used hash. E.g.
table cache would benefit from it, I'm sure - some tables
are, indeed, used much more often than others :)
[16 Jan 2005 21:32] Shelby Moore
(1) The bitwisehash() hash function I suggested above allows you to set the # of bits of
output using TSIZE.

(2) The research paper claims bitwisehash() gives as good a distribution as other hash
functions, which have a multiply and/or divide (modulo) for every byte of KEY.  And it is
as good for all TSIZE, even for TSIZE which is not prime and for TSIZE which is powers of
2.

(3) Notice bitwisehash() only has bit operations per byte of KEY and a single modulo
operation for the entire hash result, so it should be much faster than the hash currently
used, which has a multiply for every byte of KEY.

FYI, apparently freeware FastDB uses bitwisehash(), except they multiply by 31 instead of
left shift by 5:

http://www.garret.ru/~knizhnik/fastdb/FastDB.htm#hashtable

(4) Agreed we should test bitwisehash() to verify research paper's claim.

(5) You are wrong to assume that the linked list length will be 1, when the cardinality
is maximum.

Your comment above misses a very important aspect of the way MySQL currently hashes.  In
fact, I think the manual is wrong as well, because it says "...degree of slowdown is
proportional to the degree of duplication (or, inversely proportional to the index
cardinality...":

http://dev.mysql.com/doc/mysql/en/MEMORY_storage_engine.html

I got off to thinking about this because it affected whether I should post the bug report
below:

http://bugs.mysql.com/7936

Apparently you and the manual are assuming that the hash will return a linked list of
rows which all have a KEY which matches the hash of the KEY in the first item of the
linked list, and this is impossible.

This fact is because the linked list is also the buckets of hash table.  Thus when adding
rows, if the bucket is occuppied that a row's KEY hashes to, and the occupied row has KEY
which hashes to the value of that bucket, then the HASH_LINK for that KEY has to placed
in a bucket which has a different hash value.  Else the occuppied row is moved to a
different bucket.  In either case, the linked list takes care of making sure the row is
found on hash_search().  This is why the hash.c::hash_search() function only tests the
first HASH_LINK in the linked list to see if the KEY exists or not.

The above paragraph is very convoluted to explain, but if you think deeply about the fact
that the linked list is also the hash buckets (benefits: no memory wasted and hash table
size maximized), then you should realize that the drawback is that EVEN KEYS WHICH ARE
NOT EQUAL, will end up causing linked lists which are not length of 1.

Thus "MTF" is very applicable because it bubbles Most Recently Used (MRU) items of the
linked lists to the beginning of the linked lists.  "MTF" takes advantage of the usual
Gaussian distribution of most things in nature and thus that some rows which be accessed
more frequently than others.

(6) FYI, but irrelevant to points above, the cardinality of my current use is maximum
(e.g. equal to number of records), because it is the PRIMARY KEY (a unique key). 
However, this does not diminish the need for "MTF" for the reason given in #5 above in
this comment.
[16 Jan 2005 21:44] Shelby Moore
Also please do not take my #5 in above comment to be critical, and excuse me because I
wrote that as fast as possible because I had to leave in 15 minutes for a 3 day trip.

I very much appreciate you taking the time to consider these optimizations carefully and
the degree of inspection of details among contributors here is quite inspiring!  Thanks!
[19 Jan 2005 16:36] Shelby Moore
Note the only way the previous 2 comments are not true is if the hash function gives
perfect distribution in all cases, i.e. that no KEYs ever have the same hash value.  So
it is not duplicates of KEYs but duplicates of hash values that I think the manual must
be referring to.  If yes, IMHO the manual should be edited to make this more clear.
[19 Jan 2005 20:04] Sergei Golubchik
Manual is talking about duplicate key values, not duplicate hash values.
Of course, if hash function is bad it'll only make things worse.

When you wrote about "maximum cardinality" - did you mean number of distinct values in
the index or "Cardinality" column from SHOW KEYS command ? (I was asking about the
latter)

About your comment that "EVEN KEYS WHICH ARE NOT EQUAL,
will end up causing linked lists".
This is not the case - in any linked list all keys have the same hash value.
But I cannot really explain why, because you seem to understand how our
hash works - you described it correctly in your comment
from [2005-01-16 20:32:05]. So, I don't see why you think that "even keys
which are not equal will end up causing linked lists".

Regards,
[19 Jan 2005 21:04] Shelby Moore
First let me state that my understanding of the hashing algorithm is purely theoretical,
based on studying the source code in mysys/hash.c, meaning I have not actually studied
the actual values in hash tables and linked lists in a running application.  I think my
understanding is more correct than conclusions from anecdotal snapshots of real data.

Perhaps it is easiest to explain the defense (logic) of my assertion if I ask some
concise (bordering on rhetorical) questions about the algorithm currently used:

(1) Is it possible that the hash function can map non-duplicate KEY values to same hash
value (i.e. same hash table bucket)?

(2) Do we agree that the # of hash table buckets == # of records?

(3) In what bucket is HASH_LINK placed for a KEY if the hash value for that KEY is a
duplicate (i.e. the hash value of the KEY is the same as the hash value of another KEY
already in the index)?

The only possible answer to #3 is that it must be that it is placed into a bucket which
does not match the hash value and the linked list from the bucket with correct hash value
points to the bucket it is placed in.

(4) So then in what bucket is HASH_LINK placed for a KEY if the hash value for that KEY
is the bucket which the key in #3 was placed?

There are only two possible answers (at this point in logic flow):

   (4-a) The KEY from #3 is moved to another bucket (and the linked list from first
duplicate hash value follows it) so that this #4 KEY can be placed in the bucket which
matches it's hash value.

   (4-b) This new #4 KEY is placed into a bucket that does not match it's hash value and
linked list from #3 KEY will point to it.  Obviously this would not be allowed, because
then the #4 KEY would not be first in linked list and the #3 KEY hash does not match
bucket (an obvious requirement of hash_search()).  So only (4-a) is correct answer.

(5) To which bucket do we move #3 KEY to?

The only possible answer is to move it to a bucket where the hash value of existing KEY
in the bucket does not match the bucket.  Such a bucket will always exist because the #
of buckets == # of records.

Conclusion: is that I agree with you that any linked lists will have KEYs that give same
hash value (meaning I was incorrect if I stated otherwise).  However, I disagree with you
that KEYs with duplicate hash values will not create linked lists.  Thus the slowdown is
proportional to the duplication of hash values, not only duplicate KEY values.

For my application, the Cardinality == # of records, because it uses only UNIQUE KEY, so
all KEY values are distinct.
[28 Jan 2005 19:40] Sergei Golubchik
Sorry for delayed reply :(

Your understanding is very good! Here're my answers:

(1) Yes, of course. One cannot have a perfect hash for an arbitrary
    unknown in advance dataset.

(2) Yes.

(3) Correct.

(4) Correct, it's (4-a)

(5) Correct.

Looks like earlier it was some misunderstanding.
I did not try to say that "KEYs with duplicate hash values will not
create linked lists". They will, of course. It's cardinality of hash
values what is really important, not cardinality of key values.
(that's why I asked about "Cardinality" as reported by SHOW INDEX - as it
depends on hash values, not key values).

The manual speaks about "duplicated key values" because a user cannot
really check whether he has duplicated hash values or not, he can only
see key values. And of course if there's high duplication in *key* values,
no hashing can fix the situation.
[29 Jan 2005 11:07] Shelby Moore
Thanks.

Thus we still need 3 improvements:

(1) Add a footnote to manual that slowdown is really caused by duplication (inverse
cardinality) of hashes, but which can not be seen and is often approximated by inverse
cardinality of KEY values.

(2) Test and use the faster hash function I provided above, if it is better, because the
hash function has to run twice on querys that have duplicate hashs (more than one
HASH_LINK linked from bucket).

(3) Consider in future, implementing MTF (as a compile option?) as it imparoves queries
on duplicates, by moving MRU to top of HASH_LINK list.

The reason #1 is needed is because otherwise users do not understand that they can get a
slowdown even with non-duplicate KEY values if the values exhibit certain periodic
frequencies.  And also to prevent people like me from asking the same question again. :)