Key Cache changes for Bug #17332
================================

Why more problems
-----------------

- Coverage testing. As agreed in the last review, I tried to achieve a
better coverage. I added stuff to my test script. This revealed new
crashes. It was agreed not to spend too much time on the coverage tests,
but due to new crashes, I deemed it necessary to fix all I could find. It
took some rounds until this was done.

- Index corruptions. As soon as the crashes were fixed, index
corruptions popped up. They happened only with resizing. I was trying
hard to find the reason. I was about to give up and call for help. But
then I tried to block all attempts to bypass the cache while flushing
for resize. This helped. Then I enabled parallel reads. This worked too.
Enabling parallel writes again brought back the problems. Thinking about
it, I found a problem in the "resize" branch in find_key_block(). I
solved it, and now everything works as it should.


Status of Testing
-----------------

I ran day-long to week-long tests on qudita2 (Quad Itanium), aix52 (Dual
Power4), melody (Dual AMD 64), and two local uniprocessor machines.
melody died quite often. But when it ran, it found the problems the
fastest in most cases.

The exact test script is attached to the bug report.

It uses four tables and four threads per table (16 in total) plus a
thread resizing the key cache. The 16 threads do a mix of INSERT
(concurrent), UPDATE, FLUSH TABLE, CHECK TABLE, CREATE TEMPORARY TABLE
... SELECT, and LOAD INDEX INTO CACHE.

The tables have one auto_increment primary key (1K index block size) and
three char columns. They have three additional indexes with three to
four columns each with different orders of the char columns (2K index
block size each).

The test runs an inifinite number of "rounds".

In every round the 16 threads run a loop of 50 iterations. Every
iteration runs one INSERT with randomly up to 512 rows. Randomly, in
average every fifth iteration, runs an UPDATE. Randomly six of seven
UPDATEs change every thirteenth row to something else in all but the
primary key columns. The seventh UPDATE changes all changed rows back.
Every other aforementioned statement runs randomly with a probability of
5% of the iterations.

In every round the resizer runs a loop with randomly 100..4100
iterations. Randomly, in an average of 10% of the iterations, it pauses
randomly for 1..3 seconds. With 10% probability it changes the cache
block size. With 5% probability it toggles delay_key_write.

The script can easily be adapted to run
- with different numbers of tables and threads per table,
- with disabled key cache,
- with very small key cache, so that evictions with empty LRU ring are
  very probable,
- with a small sized key cache,
- with or without resizing.

After fixing all of the bugs mentioned in this document, the test ran
very good with enabled key cache and resizing of buffer size and block
size.


To the reviewer:

I tried to make a diff to the last reviewed patch. It was about the same
size than the new patch. :(  So I didn't commit it to avoid confusion.

In the code I have some "Comment to be deleted". They are
useful for reviewing, but shall be removed before pushing.

If you do not have the time to read all the stuff meant to go into the
internals manual, you can search for "<new_". I have markups like
<new_code>..</new_code>, <new_bugfix>..</new_bugfix>, etc.

You might however want to check my new description of hash_links vs.
block_links in the chapter "Data Structures".


=============================================================
=============================================================
The Key Cache (start of a chapter for the internals document)
=============================================================
=============================================================

Introduction
============

The purpose of the MySQL Key Cache is to hold parts (or all) of one or
more MyISAM index files in memory. Whenever MyISAM requests some portion
of file data, the key cache fetches the enclosing file block(s) into one
(or more) cache block(s). When MyISAM writes, The cache may need to
fetch the file blocks that are affected by the new data before
overwriting the file. Only if full file blocks are overwritten, the
fetch is omitted. Of course, file blocks that are already in the cache
don't need to be read again.

Normally, at end of each SQL statement, the changed blocks for every
affected index file are flushed (written to the file) but kept in the
cache for later use. This per statement flush can be suppressed by the
table option DELAY_KEY_WRITE, the system or session variable
DELAY_KEY_WRITE, or the startup option --delay-key-write.

When the last thread closes a MyISAM table, the MyISAM table share is
closed too. In this case the changed blocks for the index file are
flushed (written to the file) and freed. This cannot be suppressed.


Nomenclature
============

Index Blocks, File Blocks, and Cache Blocks
-------------------------------------------
The MyISAM index file is organized in "index blocks". The key cache does
not distinguish between leaf blocks and branch blocks. "Index blocks"
can be of different size even within the same file.

The cached data are organized in "cache blocks". A "cache block" is a
contiguous range of memory addresses. Its purpose is to hold data from
the index file. The blocks of the index file, whose data are copied in
the "cache blocks" are called "file blocks". "File blocks" and "cache
blocks" are of the same size. This size is fixed per key cache. The
first "file block" starts at file offset zero. The "file blocks" lay in
the file without gaps and without overlap. Since the file size does not
need to be a multiple of the "file block" size, the last "file block"
can be partially filled with file data.

Since "file blocks" and "cache blocks" are tightly coupled, I often
speak of "blocks" where the distinction is obvious or unimportant.

However, there is one important difference. Usually the number of
configured "cache blocks" does not match the number of "file blocks".
This is especially true because the file can grow over time, while the
number of "cache blocks" remains constant until it is explicitly
reconfigured. The most important case is that the number of "cache
blocks" is smaller than the number of "file blocks". In this case it can
happen that all "cache blocks" are assigned to "file blocks" when
another "file block" needs to be accessed. Then one of the cached "file
blocks" needs to be "evicted" from the cache. For the selection, which
"block" to "evict", a "least recently used" ("LRU") mechanism is used.

"File blocks" and "index blocks" do not necessarily match. They do not
need to be of the same size nor do they need to be aligned. An "index
block" can start in the middle of a "file block" and spread over two or
more "file blocks". A "file block" can start in the middle of an "index
block" and spread over two or more "index blocks". However, by default,
"file blocks" have a size of the smallest possible "index block" (1K).
Since "index blocks" are aligned to their smallest possible size (1K),
in most cases "file blocks" are aligned to "index blocks" and don't
spread over two or more of them. However, it is common for "index
blocks" to be bigger than 1K. So a "file block" often contains a
fragment of an "index block" only.

If "file blocks" are configured bigger than the smallest "index block"
size, then it can happen that a "file block" contains data from more
than one index. But it is from the same file and thus from the same
table. So we could have data from different indexes of the same table
in a "file block".

But the notion of "index blocks" is irrelevant for the key cache. It
would work for record accesses of arbitrary lengths too. Whenever the
key cache is asked for an arbitrary fragment of data, it loads the "file
block(s)" which surround that fragment. To emphasize its generality, I
will omit references to the "index file" and speak of a cached "file"
only.

File Descriptor, Position, and File/Pos
---------------------------------------
A "file descriptor" is a non-negative integer value. It is a unique
descriptor of an open file in a process. However, in theory, multiple
file descriptors can reference the same file. The key cache has no
mechanism to detect this uncommon situation. It relies completely on the
caller that there is exactly one file descriptor for one cached file at
any given point in time. If there would be two or more file descriptors
for the same file, multiple copies of the same file block could exist in
the cache. If one cache block is changed (dirty), the other block(s)
wouldn't notice it and deliver old data to their caller. If two or more
cache blocks of the same file block would be dirty, changes would be
lost in the file because the last flushed dirty block would overwrite
the formerly flushed dirty blocks in the file.

A certain "file block" within a certain file is identified by the "file
descriptor" and the "position" of the block within this file. The
"position" is the offset of the block from the file beginning in bytes.
The identifiying combination of "file descriptor" and "position" is
often written as "file/pos".

Write Burst
-----------
When the dirty blocks of a file are "flushed" (written to the file), the
changed blocks for this file are collected in an array, which is ordered
in ascending "position". The blocks in the array are then written to the
file in a row. I call this a "write burst". The array is limited in
size. A big file with many changed blocks may require multiple "write
bursts" when it is flushed.


Data Structures
===============

The root structure for every key cache is 'KEY_CACHE'. The default
key cache is always present. Its root structure is the static variable
'dflt_key_cache_var', referenced by the pointer 'dflt_key_cache'. A
running MySQL server can have more key caches. Their root structures
will be allocated when they are created.

For every file block that is in the cache, or is requested to go in the
cache, there exists a 'hash_link' structure that holds the file
descriptor and the position of the block in the file. It can be seen as
a representative of the file block.

For every cache block there exists a 'block_link' structure and a block
buffer. A cache block which is in use is assigned to a file block
(hash_link). The hash_link points back to the block_link. During
eviction of a file block there can be two hash_links pointing at a
block_link. The one that the block is still assigned to (the block_link
points back to it) and the one that is requested to go in the cache.

The structures are split to allow an efficient reassignment of cache
blocks to file blocks. Assume a well-used cache, which has lesser cache
blocks than the sum of file blocks in all index files which it caches.
All cache blocks are in use. When a file block is requested, which is
not in the cache, the least recently used file block must be evicted.
Imagine two threads trying to access the same block at (almost) the same
time. Note that the threads are synchronised on the cache structures by
a mutex. But they release the mutex when they need to wait for I/O or
for something else.

  T1:  Wants to read file block (1,10)
  T2:  Wants to read file block (1,10)
  T1:  Gets a new hash_link for (1,10)
  T1:  Notices there is no free block, flushes another block (2,20)
       (Assuming here that block 2,20 is dirty)
  T2:  Finds the hash_link for (1,10)
  T2:  Notices it is in flush, starts waiting
  T1:  Reassigns the cache block to file block (1,10)
  T1:  Reads data from file into the block buffer
  T1:  Signals T2
  T2:  Awakes with file block in cache

Now imagine the same with a common structure for file block and cache
block:

  T1:  Wants to read file block (1,10)
  T2:  Wants to read file block (1,10)
  T1:  Does not find the block (1,10) in the cache
  T1:  Notices there is no free block, flushes another block (2,20)
       (Assuming here that block 2,20 is dirty)
  T2:  Does not find the block in the cache
  T2:  Notices there is no free block, flushes another block (3,30)
       (Assuming here that block 3,30 is dirty)
  T1:  Still does not find the block (1,10) in the cache
  T1:  Reassigns the flushed cache block (2,20) to file block (1,10)
  T1:  Reads data from file into the block buffer
  T2:  Now finds the block (1,10) in the cache, but not read from file yet
  T2:  Releases flushed block (3,30) back to the LRU ring, starts waiting
  T1:  Signals T2
  T2:  Awakes with file block in cache

This way we would have needed to search for the block in the cache twice
per thread and would do an unnecessary I/O (the flush of (3,30)).
However this would only apply if the least recently blocks are dirty. If
T1 finds the least recently used block clean, it would immediately
reassign it and start reading. T2 would find the block and wait. A
background thread could flush least recently blocks regularily to
minimize the mentioned overhed. But that's all theory. We have the split
between hash_link and block_link. This seems to be the most efficient
solution for this problem.

The hash_links are organized in a hash structure (hence the name).
Requests for file data are done by file/pos. Through the hash the right
hash_link can be found quickly. Free hash_links are in the
free_hash_list chain.

The block_links are organized in a couple of structures. The most
prominent one is the LRU ring. A block can also be in the
free_block_list chain, in the changed_blocks hash, or in the file_blocks
hash. The latter three are mutual exclusive. A block is either free or
contains valid changed or unchanged data. The first two are also mutual
exclusive. A free block cannot be in the LRU ring and vice versa.

The linkage of block_links is done as follows:
- The LRU ring uses (**prev_used, *next_used).
- The free_block_list uses (*next_used).
- The chains of the changed_blocks and file_blocks hashes use
  (**prev_changed, *next_changed).

Blocks that are in the LRU ring, are always also in either the
changed_blocks hash or the file_blocks hash. The reverse is not true.
Blocks are removed from the LRU ring when they are requested for use
(read, write, flush) and linked back after use. But they are still in
one of the hashes.

Block status
------------
Blocks can have a combination of zero or more of the below bits:
- BLOCK_ERROR         An error occured when performing file I/O.
- BLOCK_READ          The cache block contains a copy of the file block.
- BLOCK_IN_SWITCH     The block is selected for eviction.
- BLOCK_REASSIGNED    The block is selected for eviction or to be freed.
- BLOCK_IN_FLUSH      The block is selected for flush.
- BLOCK_CHANGED       The cache block contents is changed over the file
                      block contents. The file block contents is wrong
                      until the block is flushed.
<new_code>
- BLOCK_IN_USE        The block is not unused (neither free, nor never
                      used). This is needed for debugging only.
- BLOCK_IN_EVICTION   The block is selected for eviction. This happens
                      only if the LRU ring was empty when a new block
                      was requested. See the section "BLOCK_IN_EVICTION"
                      for further explanation.
- BLOCK_IN_FLUSHWRITE A flush operation is writing this block currently.
                      A writer must wait until done.
- BLOCK_FOR_UPDATE    Buffer contents is being changed currently.
                      A flusher must wait until done.

The requirements for the new flags (except BLOCK_IN_USE) are explained
later.
</new_code>


Locking
=======

All key cache locking is done with a single mutex per key cache:
keycache->cache_lock. This mutex is locked almost all the time when
executing code in mf_keycache.c. However it is released for I/O and some
copy operations.

The cache_lock is also released when waiting for some event. Waiting
and signalling is done via condition variables. In most cases the
thread waits on its thread->suspend condition variable. Every thread
has a my_thread_var structure, which contains this variable and a
'*next' and '**prev' pointer. These pointers are used to insert the
thread into a wait queue.

NOTE: Since there is only one pair of queue pointers per thread, a
thread can be in one wait queue only. A thread is in a wait queue only
when it is waiting.

Before starting to wait on its condition variable with
pthread_cond_wait(), the thread enters itself to a specific wait queue
with link_into_queue() (double linked with '*next' + '**prev') or
wait_on_queue() (single linked with '*next').

Another thread, when releasing a resource, looks up the waiting thread
in the related wait queue. It sends a signal with
pthread_cond_signal() to the waiting thread.

NOTE: Depending on the particular wait situation, either the sending
thread removes the waiting thread from the wait queue with
unlink_from_queue() or release_whole_queue() respectively, or the waiting
thread removes itself.

There is one exception from this locking scheme. Each block has a
reference to a condition variable (condvar). It holds a reference to
the thread->suspend condition variable, if that thread is waiting for
the block. When that thread is signalled, the reference is cleared.
This is similar to the above, but it clearly means that only one
thread can wait for a particular block. There is no queue in this
case. Strangely enough block->convar is used for waiting for the
assigned hash_link only. More precisely it is used to wait for all
requests to be unregistered from the assigned hash_link.


Wait queues
===========

When a thread needs to wait for another thread, it links its
my_thread_var structure into a wait queue and waits on its 'suspend'
condition variable. There are two types of wait queues:

- Double linked (**prev, *next) in a ring. This allows a thread to be
  unlinked from the middle of the ring. The operations are:
  - link_into_queue()
  - unlink_from_queue()

- Single linked (*next) in a NULL terminated chain. Only the thread at
  the chain head can be unlinked. The operations are:
  - wait_on_queue()
  - release_whole_queue() which unlinks all threads and signals each.

The other thread, when releasing the resource, looks up a waiting thread
in the related wait queue. It sends a signal to the waiting thread and
removes it from the wait queue with unlink_from_queue() or in
release_whole_queue() respectively.

NOTE: Since there is only one pair of queue pointers per thread, a
thread can be in one wait queue only. A thread is in a wait queue only
when it is waiting.

<old_bugfix>
However, there was an anomaly. The resize_queue contains threads that
are waiting for resizing the key cache. Only one thread at a time can
resize the cache. However, in the original implementation the thread
that got control for resizing was not removed from the queue.
Consequentially it must not enter any other wait queue. But this
happened even in the original implementation. In flush_key_blocks_int(),
when one or more blocks are in eviction, the thread waits until their
eviction is complete. Same happens for blocks in flush by another thread
in the new design. In both cases the resize_queue was broken.

I guess the resizing thread was left in the resize_queue so that there
was a way to wake it when it waited for the last threads to leave the
cache after the flush was complete. But this means that the resize_queue
served two purposes. It contained threads waiting for resize and one
thread waiting for other threads to leave the cache.

The fix for this is to have two queues for the resizing. The thread that
starts resizing is now unlinked from the resize_queue. After flushing,
if it needs to wait for other threads to complete their I/O, it is
linked into the new waiting_for_resize_cnt queue. Since there can be
only one thread in this queue, a simple pointer would have done the
trick. But I wanted to have the same set of operations for all
wait/signal instances. And finally the queues are implemented by one
'last_thread' pointer only. Just linking and unlinking are a bit more
complex than necessary.
</old_bugfix>

<new_code>
The resize_queue serves two purposes:
1. Threads that want to do a resize wait there if in_resize is set.
   This is not used in the server. The server refuses a second resize
   request if one is already active. keycache->in_init is used for the
   synchronization. See set_var.cc.
2. Threads that want to access blocks during resize wait here during
   the re-initialization phase.
When the resize is done, all threads on the queue are signalled.
Hypothetical resizers can compete for resizing, and read/write
requests will restart to request blocks from the freshly resized
cache. If the cache has been resized too small, it is disabled and
'can_be_used' is false. In this case read/write requests bypass the
cache. Since they increment and decrement 'cnt_for_resize_op', the
next resizer can wait on the queue 'waiting_for_resize_cnt' until all
I/O finished.

Instead of specially implemented wait/signal constructions I do now use
wait_on_queue()/release_whole_queue() like almost everywhere.
</new_code>

<new_code>
As agreed during review, I combined add_to_queue() and the always
following wait loops into a new function wait_on_queue().

I also renamed release_queue() to release_whole_queue() to make more
clear that all entries from the queue are released. The function checks
for an empty queue internally now.

Both functions are defined empty if THREAD is not defined. So it is safe
just to call the functions in the code and neither take care for THREAD
nor if the queue is empty.
</new_code>

<new_code>
The queue waiting_for_resize_cnt is used by a resizer to wait for
pending I/O after the flush phase. No I/O should be active when memory
for the cache structures is re-allocated and/or the key_cache_block_size
is changed. I/O operations start with inc_counter_for_resize_op() and
end with dec_counter_for_resize_op(). The last I/O, which decrements
cnt_for_resize_op to zero, wakes the resizer. Synchronization on this
queue is done with wait_on_queue()/release_whole_queue().
</new_code>

The queue waiting_for_hash_link is used in the very improbable case that
a thread wants to access a file block which is not in the cache and all
hash_links are in use already.

The queue waiting_for_block is used when a thread wants to access a file
block which is not in the cache and the LRU ring is empty. This means
that all blocks are currently in use for read, write, and/or flush.

Every block has two wait queues:

block->wqueue[COND_FOR_REQUESTED] is used to synchronize secondary
requestors for a block. This happens when more than one thread wants to
access the same file block before it has been read into the block
buffer. The first thread that requests the file block acquires a cache
block and starts reading the file block into it. Before it starts to
read it releases the cache_lock. This gives other threads the chance to
request the same file block. These need to wait on
block->wqueue[COND_FOR_REQUESTED] until the read is complete.

<new_code>
block->wqueue[COND_FOR_REQUESTED] is also used by threads that flush
cache blocks. It might happen that a flush is due while another thread
is writing into the cache block. The flusher has to wait until this is
complete.
</new_code>

block->wqueue[COND_FOR_SAVED] is used to synchronize threads that wait
for a block to finish flushing, eviction, or becoming free.


Access Methods
==============

There are five operations that request one or more blocks:

- key_cache_read()
- key_cache_insert()
- key_cache_write()
- flush_key_blocks()
- resize_key_cache()

When one of the first three requests a block, its hash_link is marked in
use (hash_link->requests++). After they are done with the block, they
release the hash_link and signal threads that may be waiting for it
(remove_reader()). Waiting for hash_link->requests happens during
eviction and freeing (wait_for_readers()).

<new_bugfix>
There was an anomaly. key_cache_write() did not call remove_reader()
like the other functions, but decremented hash_link->requests directly
and did not wake waiting threads.

This is safe as long as key_cache_write() does not need to read a part
of the cache block from file. In this case it does not release the
cache_lock and no other thread can grab it for eviction or free. If the
block is marked for eviction or free when key_cache_write() requests it,
find_key_block() waits until the eviction or free is complete and evicts
another block. If key_cache_write() again does not need to read a part
of the block, it replaces its contents and decrements
hash_link->requests without releasing the cache_lock.

However, if only a part of the cache block is to be replaced, and the
rest needs to be read from file, then the cache_lock is released for I/O
and it could be possible that another thread wants to evict or free the
block and waits for it to be released.

key_cache_write() does now call remove_reader() like the other
functions.
</new_bugfix>

hash_link->requests protects a block against reassignment and free. As
long as a block points to a hash_link with a request, eviction and free
wait for the request to go away before the final reassignment/free.
hash_link->requests does not protect a block against flush. Neither
against explicit flush, nor against implicit flush during eviction.

A similar thing is block->requests. It counts how many threads want to
do something with a block. Whenever a thread wants to read, write,
insert (load), flush, or free a block, it increments the count and takes
the block off the LRU ring. This means that the block is implicitly
protected against eviction.

After a thread is done with a block, it decrements block->requests. The
thread that decrements it to zero puts the block back in the LRU ring
(unless it was a 'free' operation).

block->requests protect a block against eviction. Even against being
selected for eviction. block->requests does not protect a block against
flush or free.

In summary the three above mentioned standard functions implicitly
increment both hash_link->requests and block->requests. This protects
the blocks against reuse for other purposes. Only flush is allowed in
parallel. But this does not modify the block contents nor reassign it
for different purposes.

However, the key cache does not prohibit two write operations to change
the same block at the same time. As mentioned above, there is a single
cache_lock (mutex). So the blocks cannot be modified at exactly the same
time. But if a write operation changes a part of the cache block only, a
read of the rest might be necessary. In this case the cache_lock is
released and the other write could do its work.

So be warned that proper locking is required in the caller of the key
cache operations.

SERIALIZED_READ_FROM_CACHE
--------------------------
<new_code>
It turned out to be no problem to release the cache lock even while
actually copying new contents to the block buffer in key_cache_write().
</new_code>

BLOCK_READ
----------
<old_bugfix>
read_block() and key_cache_insert() did set block->status to BLOCK_READ
and cleared all other bits. So the information was lost if free() has
been called for this block (BLOCK_REASSIGNED was set).

The fix for this is to add the BLOCK_READ bit to block->status without
clearing other bits.
</old_bugfix>

offset in key_cache_read()
--------------------------
<new_bugfix>
If keycache->can_be_used is TRUE when entering the function (with the
old code), but FALSE after taking cache_lock (applies to the new code
too), offset can be non-zero when jumping to 'no_key_cache'. In the
direct my_pread() filepos+offset was used.

I removed "+offset" and moved the declaration of 'offset' to the inner
block. Offset must not be used on the first iteration (the original
buffer, length, filepos values must be used). And it is zero at any
later iteration anyway.
</new_bugfix>

length/read_length in key_cache_write()
---------------------------------------
<new_bugfix>
During resize, when find_key_block() returned NULL, because we have to
write to file directly, we used 'length' to write the rest of 'buff'.
This could mean that we wrote past the end of the block.

Though this would not hurt as we don't write wrong contents, we still
should not do this. The next block could be in the cache and if it is
dirty, we need to overwrite its contents anyway.

I changed my_pwrite(..., length,...) to my_pwrite(..., read_length,
...). So I write only until end of the block during this iteration. If
there are more data in 'buff' it will be taken care in the next
iteration, handling a cached block if appropriate.
</new_bugfix>

keycache->can_be_used
---------------------
<new_bugfix>
key_cache_read(), key_cache_insert(), and key_cache_write() used this.
If it was false (not set, clear, zero), these functions accessed the
index file directly and didn't use a mutex.

This was checked outside of the cache_lock.

I changed the functions to take the lock if the cache is initialized.
They do this before the loop now. They do not release the lock for every
iteration. They hold the lock unless doing I/O or copying data from the
block buffer.

After entering the lock, they check for a resize in progress. If the
flush phase is over, they wait for completion of the resize (the
re-initialization). After the flush phase the resizer waits for I/Os to
complete. It would not be nice to start another I/O at this moment.
Otherwise the resizer could easily starve.

Thereafter 'cnt_for_resize_op' is incremented so that another resizer
can wait for this I/O. The count is decremented before the function
returns. Incrementing/decrementing is no longer done in every loop
iteration. But it is always done in the cache_lock. So the code after
"no_key_cache:" looks somewhat complicated.

Since the count is incremented/decremented within the cache_lock and
every I/O is counted, the resizer can wait for pending I/O before
starting the re-initialization.

keycache->can_be_used is no longer changed during resize. It is set when
the cache is [re-]initilized with proper parameters. It is cleared when
the cache is [re-]initialized too small and thus disabled.
</new_bugfix>

The semantics of keycache->can_be_used is the same as (disk_blocks > 0).
I propose to get rid of it.

Also there is keycache->disk_blocks and keycache->blocks. Both 'int'. I
propose to get rid of one of them.

PAGE_WAIT_TO_BE_READ
--------------------
<new_bugfix>
In key_cache_insert() and key_cache_write(), if page_st is
PAGE_WAIT_TO_BE_READ then we must wait for the block in switch to be
reassigned and read by another thread. Otherwise we 1.) might not know
the correct hash_link to unregister our request from, and 2.) might
write new data into the wrong block, or 3.) might write into a block
that is read from file afterwards.
</new_bugfix>

Proposals:

- Change 'wrmode' parameter of find_key_block() to 'mode'. Have modes
  MODE_READ, MODE_WRITE, and MODE_INSERT. For the latter return NULL if
  the block is already in the cache or another thread is reading it
  already. Unregister the hash_link request in this case.

- In cases where PAGE_WAIT_TO_BE_READ would be returned, wait in
  find_key_block() until the block is read. The respective branch in
  read_block() as well as all respective handling in key_cache_read(),
  key_cache_write(), and key_cache_insert() can be dropped.


Resizing
========

Resizing consists of two major phases:
- The flush phase: Flush all dirty cache blocks and free all blocks.
- The re-initialization phase: re-allocate memory structures and set new
  parameters, most notable key_cache_block_size.

The basic operation is so that
1. the cache_lock (mutex) is taken,
2. the cache is marked in resize and in flush for resize,
3. all dirty (changed) blocks are flushed,
4. all blocks are freed,
5. the cache is unmarked flush for resize,
6. the resizer waits for I/O that bypassed the cash in the flush phase,
7. all cache structures (except the root KEY_CACHE and the mutex) are freed,
8. the cache structures are reallocated with the new size,
9. the cache structures are initialized so that it is usable again,
10. new paameters are set,
11. all threads waiting for end of resize are awaken, and
12. the cache_lock is released.

Resizing the key cache is implemented in an online fashion. The normal
MyISAM operation is not stopped.

Resizing 1: no new writes, but new reads
----------------------------------------
<new_bugfix/new_code>
It may not be a good idea to disallow new changed blocks in the cache
while allowing new clean blocks. (This was the old strategy.)

On a write request we tried to find the block in the cache. If it was
not there, we immediately went to write directly. If it was present and
clean, we freed the block and wrote directly. If it was present and
dirty, we worked on the dirty block (this was new by my former patch.
Formerly we freed it even!).

When going to write directly, we release the cache_lock. This allows
another thread to read the same block. It will not find it in the cache,
but it may find a free block or evict a clean block. In both cases it
can start reading the block from file without waiting for anything
first. If the second thread is "in luck", it reads the block (with old
contents) from file before the first thread even started to write.

When the first thread completes its write, we have a new block in file
and an old block in cache. If another change of this block is due before
the resizing finishes by freeing all blocks, this change will be done
based on the old contents from the cache.

So IMHO we should not allow to let new clean block in the cache during
resize either.

Or to allow new dirty block to go in during resize too. But this could
make resizing an infinite process.

Note that no two threads can acces the same index block at the same time
because they have to have a "key_root_lock" on the index. A new read of
an index block should only be possible after the other thread has all
writes completed and unlocked the "key_root_lock". However, if
key_cache_block_size is big enough to contain data from two or more
indexes, it can fail badly.

I implemented it so: During resize
- read  requests for a non-cached block read  from file directly,
- write requests for a non-cached block write to   file directly,
- read  requests for a dirty block use the block,
- write requests for a dirty block use the block,
- read  requests for a clean block use the block,
- write requests for a clean block free the block and write to file
directly.
</new_bugfix/new_code>

Resizing 2: flush_key_blocks_int() and key_cache_write()
--------------------------------------------------------
<new_code>
After week long struggle with index corruptions I noticed that the
parallelism between flush and write of the same block could lead to an
inconsistent buffer contents in the index file.

If the write modifies the buffer contents while it is being written
(flushed) to file, then the write finishes first, because it did not
release the cache_lock. (Even if it did, it set BLOCK_CHANGED before
modifying the buffer). When the flusher acquires the cache_lock after
the write, it clears BLOCK_CHANGED. This allows the block to be evicted
or freed without another flush. The change of the last writer is lost.
In the worst case parts of the changes are included, leaving an
incosistent block behind.

The above could even happen in normal cache operation. Even with 1K
cache block size. The normal flush after a statement takes place after
the thr_lock has been released. Another thread can modify the cache
blocks while the flush is going on.

Different situation, same cause: Assume a cache with all blocks clean. A
writer evicts a block. It wants to overwrite a part of the block only.
So it has to read the new block first. It releases the cache_lock for
reading. A resizer sets resize_in_flush. It does not find dirty blocks.
It starts to free all blocks. When it comes to the one the writer is
reading, it has to wait until the writer releases the block. The writer
awakes from reading, gets cache_lock, and wakes the waiting resizer. The
resizer awakes inside free_block(). It frees the dirty block.

I fixed these with a pair of block status flags: BLOCK_IN_FLUSHWRITE,
BLOCK_FOR_UPDATE. The former is set by a flusher right before it
releases the cache_lock for writing the block. The latter is set by a
writer before it releases the cache_lock. Both check for the other flag
and wait on block->wqueue[COND_FOR_REQUESTED] (flusher) or
block->wqueue[COND_FOR_SAVED] (writer).

BLOCK_IN_FLUSHWRITE: This is set by the flusher right before the
cache_lock is released for writing. It is cleared right after writing.
key_cache_write() can proceed if BLOCK_IN_FLUSH is set but not if
BLOCK_IN_FLUSHWRITE is set. If BLOCK_IN_FLUSHWRITE is set, it waits on
block->wqueue[COND_FOR_SAVED].

BLOCK_FOR_UPDATE: This is set in key_cache_write() as soon as possible
before the cache_lock is released for waiting or copying the buffer
contents. It is cleared after the buffer is finally changed. It is used
by a flusher to skip the block in a flush burst. flush_key_blocks_int()
repeats it search for changed blocks and will finally find the skipped
block again. To avoid an infinite loop, it remembers one of the blocks
marked BLOCK_FOR_UPDATE and waits on block->wqueue[COND_FOR_REQUESTED]
for its modification to complete. When signalling COND_FOR_REQUESTED,
key_cache_write() does not wake threads waiting for the block to be
flushed. BLOCK_FOR_UPDATE is also used to prevent free_block() to be
called. The block requires a(nother) flush before being freed.
</new_code>

Resizing 3: BLOCK_IN_FLUSH
--------------------------
<old_bugfix>
We did initially have a loop over the *number* of changed blocks when
flushing. One block from the LRU ring was checked for its file
descriptor and all blocks for this file were flushed. All blocks were
selected from the changed_blocks hash that belonged to this file. These
blocks were marked for flush and put in an array for a write burst. The
array was sorted in increasing write position. Every block was written.
Before every write the cache_lock was released and taken back after the
write. Only during these write periods parallel cache operations were
possible. After taking back the mutex, all threads waiting for the block
were signalled and the block was freed.

One big problem here was that it was not checked if the block had
already beed marked for flush before. At the end of most writing SQL
statements, during unlock of a MyISAM table, or during the final close
of the table, the index blocks for this table are flushed. This flush
can happen in parallel with the big flush from the cache resizing. So it
was possible that two threads flushed the same blocks.

During the write burst, after freeing the flushed block and before
writing the next one, no check was done if the block had been freed in
between.

Another place, where changed blocks are flushed, is cache eviction. A
block with changed data may be selected for eviction (***). It has to be
flushed before it can be reused for another file block. In this
situation it has neither been checkd if the block was marked for flush
already nor has it been marked before writing. But during the write for
eviction the cache_lock was released too. In this situation the
"flusher" could have completed its write and freed the block.
</old_bugfix>

Resizing 4: BLOCK_IN_EVICTION
-----------------------------
<new_bugfix>
(***) CORRECTION: When a block is marked BLOCK_IN_FLUSH by a flushing
thread, a request is registered against the block. This unlinks it
from the LRU ring so that it cannot be selected for eviction. If it had
been selected for eviction before, it had been marked BLOCK_IN_SWITCH
and thus would not be selected for flush. However there was one
exception. If the LRU ring was empty and a block was requested for
eviction, then the next block that is put on the LRU ring is immediately
attached to the requesting hash_link and the requesting thread is
signalled. From this moment until the requesting thread awakes, a
flusher could select the block for flush. The block was not marked in
any way during this interval. Though it is not on the LRU ring, it could
still be selected for flush. If it is unchanged, it could even
immediately be freed by the flusher. When the evicter awakes, it could
find a block in flush, a free block, or even a block which has been
reused for something else after being free. NOTE: This horrible scenario
can only happen if the LRU ring got empty. In normal operation this
happens only if there are less cache blocks than threads. Every thread
works on one block at a time only. This applies to key_cache_read(),
key_cache_insert() and key_cache_write(). An exception is
flush_key_blocks(). During flush a thread can take a lot of blocks from
the LRU ring. The same goes with resize_key_cache(). I fixed the bug
with a new status flag, BLOCK_IN_EVICTION. See the section
BLOCK_IN_EVICTION below.
</new_bugfix>

Resizing 5: flush_all_key_blocks()
----------------------------------
<old_bugfix>
The number of changed blocks was not reliable. It was not safe to use it
as an indicator whether the flush was complete. I saw infinite loops.
The outer loop thought that there should be changed blocks left, but the
inner loop didn't find any.

Also blocks are not always in the LRU ring. I saw crashes due to an
empty LRU ring but with non-zero number of changed blocks. However we
could not have an infinite loop here as flushing included freeing of
clean blocks for the file too. A second peek in the LRU ring could not
see the same block again.

The fix for this is to loop over all chains of the changed_blocks hash
until they all are empty. Then free all clean blocks. Changed blocks are
always in the changed_block hash (***). The additional advantage of this
two-step approach is that the flushed blocks do not need to be freed in
the first step. They can still be used for reading until they are freed
in the second step.
</old_bugfix>

Resizing 6: flush_key_blocks_int()
----------------------------------
<new_insight>
(***) There is one exception. Dirty block in eviction, which need to be
flushed, are temporarily put into a "first_in_switch" chain within
flush_key_blocks_int(). But since flush_key_blocks_int() waits until
their eviction is done (which makes them clean blocks), this is
irrelevant here.

Unfortunately there is another chance for failure. A second flusher
could move the block from that "first_in_switch" chain to its own
"first_in_switch" chain. The first flusher could then believe all
changed blocks of "its" file are clean now and finish prematurely.

I strongly recommend to get rid of the temporary "first_in_switch"
chain. Using BLOCK_IN_FLUSH also during eviction makes this possible.
</new_insight>

<old_bugfix>
Flushing of the same file by two threads (the table "owner" and the
resizer) can also lead to a situation where one has to wait for the
block and the other one frees it after use. This was not taken into
account in wait_for_readers().

The fix for this is to check if the block still references a hash_link.
</old_bugfix>

<old_bugfix>
While freeing clean blocks, the file_blocks chain was traversed without
a check if the 'next' block has been moved from its place in the chain
while free_block() had to wait for readers. And it was not checked if
the block had already been marked for to be freed (BLOCK_REASSIGNED) by
another thread.

The fix for this are two nested loops. The outer loop runs until no more
blocks to be freed are found. The inner loop traverses the file_blocks
chain. Whenever a block needs to be freed, the state of the 'next' block
is saved and checked after free_block(). If it changed, the loop is
restarted because the 'next' block may no longer be at the old position
in the current chain. Also we need to skip all blocks that are marked
for eviction or for to be freed.
</old_bugfix>

<new_bugfix>
It could even happen that a write started before the start of the flush.
It could still be in progress on a clean block. We have to skip this
too. Whenever we found something to skip, we need to restart the flush
from the beginning (checking for dirty blocks). These writes could move
a block from the file_blocks hash to the changed_blocks hash.
</new_bugfix>

flush_key_blocks()
------------------
<new_bugfix>
While waiting for the cache_lock in flush_key_blocks(), the cache could
become disabled. I won't try to flush in this situation any more.

I do now check according to the other global functions. At first,
keycache->key_cache_initialized is checked. If this is true, the
cache_lock is taken. Only then it is checked if the function can
proceed.
</new_bugfix>


Miscellaneous problems
======================

BLOCK_IN_EVICTION (Eviction with empty LRU ring)
------------------------------------------------
<new_code>
When a block needs to be evicted, the least recently used block is taken
from the LRU ring. If the LRU ring is empty, the evicting thread needs
to wait for a block to be put on the LRU ring. (BTW. Even free_block()
puts the block on the LRU ring and takes it back again without
intermediate release of the cache_lock.)

A block is put in the LRU ring by link_block(). Before it links the
block in the LRU ring, it checks if threads are waiting for a block. If
so, it does not link the block in the LRU ring, but assigns it to the
hash_link of a waiting thread instead. Then it registers a request on
the block for every thread waiting on the same hash_link and wakes them
all.

In this case nothing was set in block_link. The most obvious change to
do would be to set the flag BLOCK_IN_SWITCH. But this cannot be done in
link_block(). Multiple threads may be waiting for the block. Only the
first one must execute the eviction. All others have to wait. The logic
is so that the first thread sets BLOCK_IN_SWITCH and the others wait
when they see it.

The outcome was that nothing in the block indicated that it was selected
for eviction until the first of the requesting threads set
BLOCK_IN_SWITCH.

When the thread that called link_block() releases the cache_lock,
another thread continues to run. This could be another thread but one of
those waiting for the selected block. This thread could want to do all
sort of things with the selected block. For example free it.

free_block() could not detect that the block has been selected for
eviction. In theory it could detect the additional request(s) on the
block (one request must be registered on the block anyway before it can
be freed). But this was not done in the original code.

I fixed this with a new block->status flag: BLOCK_IN_EVICTION. It is set
by link_block() when the block is not linked in the LRU ring, but
assigned to a requesting hash_link. I check it in free_block() after
unreg_request(). If set, I stop modifying the block in any way.

I use that flag also to prevent calling free_block() for a block that
has been selected for eviction already.
</new_code>

Proposal:

We could save a lot of hassles in the code if the cache behaves much
more stupid in the extreme case of empty free list, all blocks in use,
*and* empty LRU ring.

If there is no block available, the thread needs to wait on
waiting_for_block. This is no change yet.

But instead of special handling in link_block() and free_block() I would
just put the block in the LRU ring or in the free list respectively and
wake all threads from waiting_for_block.

The threads that awake restart their search for a block from the
beginning (get hash_link, ...). The first thread that awakes finds the
block and starts to use it. Later awaking threads do either fail and
wait again, or they are in luck if they wanted to access the same block
as the other thread, or there is yet another block available meanwhile.

Admittedly this could produce more CPU load in a situation where the
system is short of cache blocks anyway. But OTOH this is a very seldom
situation. And it requires a far too small configured cache.

With all the bugs fixed, mentioned in this document, I did not get a
single gcov count on the lines that handle the empty LRU ring. Not even
with an 8-block cache and 16 threads competing for them.

But all the fixes around BLOCK_IN_EVICTION would be unnecessary. Also
the special handling in link-block() (which does not link the block in
the LRU ring in this case), and the uncommon use of thread->opt_info,
which require a lot of commenting, would become obsolete.

free_block()
------------
<new_bugfix>
free_block() called unlink_hash() before handling the LRU stuff
mentioned above. Since we need the old hash_link when the block is
handed over for eviction, I moved this call down behind all other unregs
and unlinks.
</new_bugfix>

LOAD INDEX INTO CACHE
---------------------
Caveats of the current implemetation:
1. All indexes must have the same block size.
2. All blocks of the table are flushed with FLUSH_RELEASE first.
3. IGNORE LEAVES won't work well with big cache blocks (see below).
4. If index file is bigger than cache, it could evict self-loaded blocks.
Advantage of the current implemetation:
- Bigger read size.

Proposal:
MyISAM hands over to key cache a list of (position, length) for the
blocks to load. Key cache computes maximum chunks to read while
excluding file blocks already in cache. Key cache stops when all blocks
are used by this file, or, in other words, when as many blocks for this
file have been requested as there are cache blocks. So it would not
evict pages loaded by itself.

<new_bugfix>
mysql_preload_keys() took a TL_READ lock. Concurrent inserts could
modify index blocks in the cache, but mi_preload() reads directly from
file, bypassing the cache. In theory it should not overwrite cache
blocks that have valid data already. But the dirty blocks could be
evicted (with flush) after mi_preload() read the data and before
key_cache_insert() tries to load that block.

key_cache_write() assumed that no readers of "its" block exist. Usually
an "key_root_lock" assures this. Hence it did not wake readers of the
block. But mi_preload() does not take a "key_root_lock" and thus could
become a secondary requestor of the block that key_cache_write() was
writing. This made me experiencing a real deadlock. (See later that
key_cache_write() must also wake readers because big key cache blocks
could contain data from two or more indexes.)

The fix is to use TL_READ_NO_INSERT for preload. The alternative would
be to acquire the "key_root_lock" for all indexes for the whole
preloading.
</new_bugfix>

<new_bugfix>
While key_cache_insert() does not overwrite the contents of valid
blocks, it did overwrite blocks with the status PAGE_WAIT_TO_BE_READ.
This means blocks that are in read by another thread. This does also
happen for blocks to be changed by another thread. And it could affect
blocks that are still assigned to other file blocks (during flush for
eviction).

I fixed it so that key_cache_insert() fills blocks with PAGE_TO_BE_READ
only. Pages with PAGE_WAIT_TO_BE_READ will be read by another thread
anyway. Pages with PAGE_READ are in the cache already.
</new_bugfix>

<new_bugfix>
With key_cache_block_size > maximum index block size of this table it
was a problem that key_cache_insert() ignored existing blocks
(PAGE_READ). The data is loaded in chunks of 32K by default. It starts
after the index file header. This means that for every chunk the first
and last of the affected cache blocks is filled partially only. Every
succeeding chunk finds its first block in the cache already (partially
filled from the previous chunk), and ignores it.

Because we allow reads while loading, it can happen that a reader finds
one of those partially filled blocks. If it wants data from this block
behind what has been loaded so far, it reports an error. If a block is
in the cache, it must contain as much data as is available for it in the
file. This means that every block except the last one of the file must
be completely filled.

Consequently, key_cache_insert() needs to read the block itself, if the
caller supplies less data than the block size.

But this would clearly slow down the loading. It would happen very often
in normal operation with a big key_cache_block_size.

There is no way around this for LOAD INDEX ... IGNORE LEAVES.

But without IGNORE LEAVES we can assure that the loaded chunks are
aligned with the cache blocks. I included some test code for it (#if
!defined(INGO_TEST_LOADIDX_OFF)) in mi_preload.c and a comment in
mf_keycache.c. It works. But since it is not strongly needed I left the
defines in the code so that I can remove it quickly.
</new_bugfix>

<new_code>
key_cache_insert() does not try to load any index data if it detects a
disabled key cache or resize in progress.
</new_code>

Note that IGNORE LEAVES may not work as expected if cache blocks are
bigger than the smallest index blocks. Depending on the difference in
size we may not save much. The key cache cannot read partial cache
blocks. If key_cache_insert() is given a part of a cache block only, it
will read the whole block itself. To avoid double reading of the index
file, we always must supply a full cache block to key_cache_insert().
But to check if we have branch index blocks, we must work on index
blocks. The required algorithm is more complex than the one currently
existing in mi_preload().

DBUG_ASSERT()
-------------
<new_code>
To be able to assert more block status requirements, I needed to reorder
some initializations and deinitializations.
</new_code>

Statistics
----------
<new_bugfix>
keycache->global_cache_w_requests was counted twice during resize (in
the !block branch in key_cache_write()) and not counted in myisamchk (in
the special branch at top of key_cache_write()).

keycache->global_cache_read was counted for primary and secondary
requests in find_key_block(). I moved this to read_block() in the branch
for primary requests where the real read is done. I do also increment
it in key_cache_insert() for every requested block. The read has already
been done by the caller in this case.
</new_bugfix>


=============================
<new_ideas and open_problems>
=============================

OPT_KEY_CACHE_BLOCK_SIZE
------------------------
This option is defined with a subtraction of MALLOC_OVERHEAD:
  {"key_cache_block_size", OPT_KEY_CACHE_BLOCK_SIZE,
   "The default size of key cache blocks",
   (gptr*) &dflt_key_cache_var.param_block_size,
   (gptr*) 0,
   0, (GET_ULONG | GET_ASK_ADDR), REQUIRED_ARG,
   KEY_CACHE_BLOCK_SIZE , 512, 1024*16, MALLOC_OVERHEAD, 512, 0},
This means that an option of --key_cache_block_size=1024 results in key
cache blocks of 512 bytes size. To get 1024 byte blocks one must request
--key_cache_block_size=1032. With -DSAFEMALLOC it is even
--key_cache_block_size=1060. I believe this is a bug. At least it is not
documented.

Another thing is that we do *not* enforce it to be a power of two. Just
a multiple of 512. But we use constructs like this on it:

    offset= (uint) (filepos & (keycache->key_cache_block_size-1));

I veryfied by testing that the key cache does not work with a block size
of 12K for example. BTW, this can also be configured with a SET
statement.

BLOCK_ERROR
-----------
Eviction of a changed block: If my_pwrite() fails, block is marked
BLOCK_ERROR, but reassigned anyway. This makes the evicting thread mark
'its' table as crashed. This may be the wrong table. The block contents
is not changed, but the block is assigned to the wrong table.

If this block is found by another thread (who did not notice the crash
of the table yet) it finds it correctly assigned to the hash_link but
not marked BLOCK_READ. It concludes that it must be a block in switch
and behave as a secondary reader (PAGE_WAIT_TO_BE_READ). Fortunately
BLOCK_ERROR is still set. So it wouldn't wait, but report an error.

I propose to 1. abstain from reassigning the block in error, 2. leave it
in the changed_blocks hash so it is due for another attempt to flush, 3.
do not put it back to the LRU ring so it cannot be evicted again, 4. set
BLOCK_ERROR, 5. clear BLOCK_IN_SWITCH and let the evicter select another
block.

When a thread tries to access that block (read, write, insert, flush),
the error would be reported back to the caller and the right table would
be marked crashed. At flush the block would be tried to write again. If
this fails again, it would be freed (FLUSH_RELEASE) or left untouched
(FLUSH_KEEP).

If the error flush happens in a resize operation, the cache should stay
in the 'resize_in_flush' state until all erroneous blocks are flushed by
the right thread(s). Direct I/O is allowed in this state.

A potential problem of my proposal could raise if all blocks get marked
BLOCK_ERROR. All attempts to evict a block would stall. A detection of
this situation could be implemented. Any attempt to evict a block would
then be redirected to bypass the cache.

To provoke these errors, error injection is required.

global_blocks_changed
---------------------
Used for SHOW STATUS like 'Key_blocks_not_flushed'. It is
incremented/decremented together with blocks_changed. So it shows the
current number of dirty blocks.

But global_blocks_changed is cleared in reset_key_cache_counters(). So
it can overflow on the next decrement.

And global_blocks_changed is not cleared in end_key_cache(). This should
be safe if it was counted correctly and not reset by
reset_key_cache_counters(). Otherwise a wrong count could survive
resize_key_cache() if a too small cache size has been requested.

This sounds hypothetical, but I saw:
| Key_blocks_not_flushed | 103067   | 
| Key_blocks_unused      | 0        | 
| Key_blocks_used        | 29       | 
in several test runs. This should not normally be possible.

I found a source of incorrect counting, introduced by myself. Now SHOW
STATUS looks reasonable. But my question is still vaild: What do we want
to show with SHOW STATUS like 'Key_blocks_not_flushed'? Is it correct to
handle it differently from 'blocks_changed'?

I propose to get rid of 'global_blocks_changed', use 'blocks_changed'
instead and not to clear this count in reset_key_cache_counters(). Or,
to get less changes elsewhere in the server, get rid of 'blocks_changed'
and count 'global_blocks_changed' where 'blocks_changed' is counted now.

reg_requests()
--------------
I propose to change the name to reg_block_request() to avoid confusion
with hash_link requests. Use singular as it is always called for one
request only. Also it would be nice to have hash_link->h_requests and
block->b_requests for easier searching of one of these variables.

remove_reader()
---------------
This function takes a 'block' as argument, though its purpose is to
decrement a request from a hash_link and to wake threads waiting on the
hash_link. A 'block' is needed because the condition to signal is
block->condvar. This means that the function can be used only after a
block is assigned to the hash_link. The mere fact that the hash_link
points to a block is insufficient because it could be a block in
eviction for this hash_link. In this case block->condvar cannot be used
because it belongs to another hash_link. One would signal the wrong
thread. Overmore block->condvar is only used for the purpose to wait for
the referenced hash_link to become "unrequested".

The function would be more flexible if it takes a 'hash_link' and
the condvar is located in hash_link. This would come handy in
key_cache_insert() when it detects that a block for the file/pos is
already in eviction. It could just unregister its request an proceed.
Now it needs to wait until the block is assigned before it can
unregister.

For a "casual reader" (or a "beginner") it might even be easier to
understand what remove_reader() and wait_for_readers() really do.

Additionally it would be more inline with other wait/signal events if we
would use a wait queue instead of a condvar. Though there should never
be more than one thread waiting for a hash_link to become "unrequested".
The overhead would be small, but another anomaly would vanish thus
saving a casual reader some thinking.

volatile
--------
<experimental_change>
After seeing unexplainable crashes, I declared most of the members of
KEY_CACHE, HASH_LINK, and BLOCK_LINK volatile. This seemed to have
improved the situation. I always got reasonable output from gdb since.

After I have got a working patch, I removed the volatile keywords again.
The system runs as stable as before. So I'll forget about volatile.
</experimental_change>

Background key cache flusher
----------------------------
Every N seconds ensure than the M least recently blocks are "clean".

Multiple resize flushers
------------------------
Since index files could be on different disks we may want to allow for
multiple threads doing the flush.

Maximum cache size
------------------
Enlarge some variables to allow key_buffer_size > 4G.

hash_links hash
---------------
Even if we make the hash as big as file descriptors are available, we
can have quite long chains of hash_links for each file. Beginning with a
certain size of the key cache it might become better to have a more
complex access structure. Either another hash per file, or an ordered
chain with jump pointers. We should profile get_hash_link() to see if we
have a problem there.

</new_ideas and open_problems>


========================
<old_ideas and concerns>
========================

Possible improvements (old suggestions)
---------------------

1.)
Block hashes. The changed_blocks and file_blocks hashes are indexed by
file descriptor. Each of their chains can contain blocks for several
files. It might be more efficient to have file_heads in these chains and
blocks for one file only chained below the file_heads. This would make
flushing faster. Flushing is done per file and so could use up a whole
chain without caring for skipping other files blocks.

     file_blocks[i] -> file_head(17) -> file_head(21)
                           |                |
                           v                v
                       block_link
                           |
                           v
                       block_link

Similar could be done for hash_links.

     hash_root[i] -> hash_link_file(17) -> hash_link_file(21)
                           |                     |
                           v                     v
                     hash_link_pos(0)
                           |
                           v
                     hash_link_pos(1024)

<monty> You can get the same effect by just making
key_cache->file_blocks[] as big as max number of file descriptors.

2.)
It is possible that the new algorithms make the use of the resize
counter (cnt_for_resize_op) unecessary. Flushing is done so that at its
end all blocks are free. It waits for all users of pages before freeing
them. So there cannot be any pages in use any more. The only place where
it might be relevant is in flush_key_blocks(). This should be checked.

<monty> I think the above is true and you may do another patch for this
(not to be added to 5.0)

<new_insight>
I do now use this to wait for I/O bypassing the cache. MyISAM level
locking is not sufficient if key_cache_block_size is bigger than the
smallest index blocks. We must not restart cache operation while I/O for
parts of the file blocks are still in progress. Another thread could
want to acceess another part of the same file block and thus needs to
read the whole block while a direct writer is still writing his part of
the block. The block contents would be unpredictable.

I believe this was also the idea of the original code. It did also wait
for cnt_for_resize_op between the flush phase and the re-initialization.
However it did not count direct I/O.
</new_insight>


Concerns (old)
--------

my_thread_var. The key cache uses the condition variable of the
my_thread_var structure for waiting. This could also be used elsewhere
in the MySQL server. We should check that stray signals cannot confuse
the key cache code at any place (that is, check if wait lopps are always
used correctly). And there should be a server-wide rule how to use the
'next' and 'prev' pointers of my_thread_var. The key cache code cannot
protect against abuse of these pointers. It may even be good to make the
wait_for_event()/signal_event() function pair global mysys functions to
be used for all waiting/signalling that is based on my_thread_var.

<monty> on the other hand, I agree the key_cache code should be safe for
stray signals...

</old_ideas and concerns>