Key Cache changes for Bug #17332 ================================ Why more problems ----------------- - Coverage testing. As agreed in the last review, I tried to achieve a better coverage. I added stuff to my test script. This revealed new crashes. It was agreed not to spend too much time on the coverage tests, but due to new crashes, I deemed it necessary to fix all I could find. It took some rounds until this was done. - Index corruptions. As soon as the crashes were fixed, index corruptions popped up. They happened only with resizing. I was trying hard to find the reason. I was about to give up and call for help. But then I tried to block all attempts to bypass the cache while flushing for resize. This helped. Then I enabled parallel reads. This worked too. Enabling parallel writes again brought back the problems. Thinking about it, I found a problem in the "resize" branch in find_key_block(). I solved it, and now everything works as it should. Status of Testing ----------------- I ran day-long to week-long tests on qudita2 (Quad Itanium), aix52 (Dual Power4), melody (Dual AMD 64), and two local uniprocessor machines. melody died quite often. But when it ran, it found the problems the fastest in most cases. The exact test script is attached to the bug report. It uses four tables and four threads per table (16 in total) plus a thread resizing the key cache. The 16 threads do a mix of INSERT (concurrent), UPDATE, FLUSH TABLE, CHECK TABLE, CREATE TEMPORARY TABLE ... SELECT, and LOAD INDEX INTO CACHE. The tables have one auto_increment primary key (1K index block size) and three char columns. They have three additional indexes with three to four columns each with different orders of the char columns (2K index block size each). The test runs an inifinite number of "rounds". In every round the 16 threads run a loop of 50 iterations. Every iteration runs one INSERT with randomly up to 512 rows. Randomly, in average every fifth iteration, runs an UPDATE. Randomly six of seven UPDATEs change every thirteenth row to something else in all but the primary key columns. The seventh UPDATE changes all changed rows back. Every other aforementioned statement runs randomly with a probability of 5% of the iterations. In every round the resizer runs a loop with randomly 100..4100 iterations. Randomly, in an average of 10% of the iterations, it pauses randomly for 1..3 seconds. With 10% probability it changes the cache block size. With 5% probability it toggles delay_key_write. The script can easily be adapted to run - with different numbers of tables and threads per table, - with disabled key cache, - with very small key cache, so that evictions with empty LRU ring are very probable, - with a small sized key cache, - with or without resizing. After fixing all of the bugs mentioned in this document, the test ran very good with enabled key cache and resizing of buffer size and block size. To the reviewer: I tried to make a diff to the last reviewed patch. It was about the same size than the new patch. :( So I didn't commit it to avoid confusion. In the code I have some "Comment to be deleted". They are useful for reviewing, but shall be removed before pushing. If you do not have the time to read all the stuff meant to go into the internals manual, you can search for ".., .., etc. You might however want to check my new description of hash_links vs. block_links in the chapter "Data Structures". ============================================================= ============================================================= The Key Cache (start of a chapter for the internals document) ============================================================= ============================================================= Introduction ============ The purpose of the MySQL Key Cache is to hold parts (or all) of one or more MyISAM index files in memory. Whenever MyISAM requests some portion of file data, the key cache fetches the enclosing file block(s) into one (or more) cache block(s). When MyISAM writes, The cache may need to fetch the file blocks that are affected by the new data before overwriting the file. Only if full file blocks are overwritten, the fetch is omitted. Of course, file blocks that are already in the cache don't need to be read again. Normally, at end of each SQL statement, the changed blocks for every affected index file are flushed (written to the file) but kept in the cache for later use. This per statement flush can be suppressed by the table option DELAY_KEY_WRITE, the system or session variable DELAY_KEY_WRITE, or the startup option --delay-key-write. When the last thread closes a MyISAM table, the MyISAM table share is closed too. In this case the changed blocks for the index file are flushed (written to the file) and freed. This cannot be suppressed. Nomenclature ============ Index Blocks, File Blocks, and Cache Blocks ------------------------------------------- The MyISAM index file is organized in "index blocks". The key cache does not distinguish between leaf blocks and branch blocks. "Index blocks" can be of different size even within the same file. The cached data are organized in "cache blocks". A "cache block" is a contiguous range of memory addresses. Its purpose is to hold data from the index file. The blocks of the index file, whose data are copied in the "cache blocks" are called "file blocks". "File blocks" and "cache blocks" are of the same size. This size is fixed per key cache. The first "file block" starts at file offset zero. The "file blocks" lay in the file without gaps and without overlap. Since the file size does not need to be a multiple of the "file block" size, the last "file block" can be partially filled with file data. Since "file blocks" and "cache blocks" are tightly coupled, I often speak of "blocks" where the distinction is obvious or unimportant. However, there is one important difference. Usually the number of configured "cache blocks" does not match the number of "file blocks". This is especially true because the file can grow over time, while the number of "cache blocks" remains constant until it is explicitly reconfigured. The most important case is that the number of "cache blocks" is smaller than the number of "file blocks". In this case it can happen that all "cache blocks" are assigned to "file blocks" when another "file block" needs to be accessed. Then one of the cached "file blocks" needs to be "evicted" from the cache. For the selection, which "block" to "evict", a "least recently used" ("LRU") mechanism is used. "File blocks" and "index blocks" do not necessarily match. They do not need to be of the same size nor do they need to be aligned. An "index block" can start in the middle of a "file block" and spread over two or more "file blocks". A "file block" can start in the middle of an "index block" and spread over two or more "index blocks". However, by default, "file blocks" have a size of the smallest possible "index block" (1K). Since "index blocks" are aligned to their smallest possible size (1K), in most cases "file blocks" are aligned to "index blocks" and don't spread over two or more of them. However, it is common for "index blocks" to be bigger than 1K. So a "file block" often contains a fragment of an "index block" only. If "file blocks" are configured bigger than the smallest "index block" size, then it can happen that a "file block" contains data from more than one index. But it is from the same file and thus from the same table. So we could have data from different indexes of the same table in a "file block". But the notion of "index blocks" is irrelevant for the key cache. It would work for record accesses of arbitrary lengths too. Whenever the key cache is asked for an arbitrary fragment of data, it loads the "file block(s)" which surround that fragment. To emphasize its generality, I will omit references to the "index file" and speak of a cached "file" only. File Descriptor, Position, and File/Pos --------------------------------------- A "file descriptor" is a non-negative integer value. It is a unique descriptor of an open file in a process. However, in theory, multiple file descriptors can reference the same file. The key cache has no mechanism to detect this uncommon situation. It relies completely on the caller that there is exactly one file descriptor for one cached file at any given point in time. If there would be two or more file descriptors for the same file, multiple copies of the same file block could exist in the cache. If one cache block is changed (dirty), the other block(s) wouldn't notice it and deliver old data to their caller. If two or more cache blocks of the same file block would be dirty, changes would be lost in the file because the last flushed dirty block would overwrite the formerly flushed dirty blocks in the file. A certain "file block" within a certain file is identified by the "file descriptor" and the "position" of the block within this file. The "position" is the offset of the block from the file beginning in bytes. The identifiying combination of "file descriptor" and "position" is often written as "file/pos". Write Burst ----------- When the dirty blocks of a file are "flushed" (written to the file), the changed blocks for this file are collected in an array, which is ordered in ascending "position". The blocks in the array are then written to the file in a row. I call this a "write burst". The array is limited in size. A big file with many changed blocks may require multiple "write bursts" when it is flushed. Data Structures =============== The root structure for every key cache is 'KEY_CACHE'. The default key cache is always present. Its root structure is the static variable 'dflt_key_cache_var', referenced by the pointer 'dflt_key_cache'. A running MySQL server can have more key caches. Their root structures will be allocated when they are created. For every file block that is in the cache, or is requested to go in the cache, there exists a 'hash_link' structure that holds the file descriptor and the position of the block in the file. It can be seen as a representative of the file block. For every cache block there exists a 'block_link' structure and a block buffer. A cache block which is in use is assigned to a file block (hash_link). The hash_link points back to the block_link. During eviction of a file block there can be two hash_links pointing at a block_link. The one that the block is still assigned to (the block_link points back to it) and the one that is requested to go in the cache. The structures are split to allow an efficient reassignment of cache blocks to file blocks. Assume a well-used cache, which has lesser cache blocks than the sum of file blocks in all index files which it caches. All cache blocks are in use. When a file block is requested, which is not in the cache, the least recently used file block must be evicted. Imagine two threads trying to access the same block at (almost) the same time. Note that the threads are synchronised on the cache structures by a mutex. But they release the mutex when they need to wait for I/O or for something else. T1: Wants to read file block (1,10) T2: Wants to read file block (1,10) T1: Gets a new hash_link for (1,10) T1: Notices there is no free block, flushes another block (2,20) (Assuming here that block 2,20 is dirty) T2: Finds the hash_link for (1,10) T2: Notices it is in flush, starts waiting T1: Reassigns the cache block to file block (1,10) T1: Reads data from file into the block buffer T1: Signals T2 T2: Awakes with file block in cache Now imagine the same with a common structure for file block and cache block: T1: Wants to read file block (1,10) T2: Wants to read file block (1,10) T1: Does not find the block (1,10) in the cache T1: Notices there is no free block, flushes another block (2,20) (Assuming here that block 2,20 is dirty) T2: Does not find the block in the cache T2: Notices there is no free block, flushes another block (3,30) (Assuming here that block 3,30 is dirty) T1: Still does not find the block (1,10) in the cache T1: Reassigns the flushed cache block (2,20) to file block (1,10) T1: Reads data from file into the block buffer T2: Now finds the block (1,10) in the cache, but not read from file yet T2: Releases flushed block (3,30) back to the LRU ring, starts waiting T1: Signals T2 T2: Awakes with file block in cache This way we would have needed to search for the block in the cache twice per thread and would do an unnecessary I/O (the flush of (3,30)). However this would only apply if the least recently blocks are dirty. If T1 finds the least recently used block clean, it would immediately reassign it and start reading. T2 would find the block and wait. A background thread could flush least recently blocks regularily to minimize the mentioned overhed. But that's all theory. We have the split between hash_link and block_link. This seems to be the most efficient solution for this problem. The hash_links are organized in a hash structure (hence the name). Requests for file data are done by file/pos. Through the hash the right hash_link can be found quickly. Free hash_links are in the free_hash_list chain. The block_links are organized in a couple of structures. The most prominent one is the LRU ring. A block can also be in the free_block_list chain, in the changed_blocks hash, or in the file_blocks hash. The latter three are mutual exclusive. A block is either free or contains valid changed or unchanged data. The first two are also mutual exclusive. A free block cannot be in the LRU ring and vice versa. The linkage of block_links is done as follows: - The LRU ring uses (**prev_used, *next_used). - The free_block_list uses (*next_used). - The chains of the changed_blocks and file_blocks hashes use (**prev_changed, *next_changed). Blocks that are in the LRU ring, are always also in either the changed_blocks hash or the file_blocks hash. The reverse is not true. Blocks are removed from the LRU ring when they are requested for use (read, write, flush) and linked back after use. But they are still in one of the hashes. Block status ------------ Blocks can have a combination of zero or more of the below bits: - BLOCK_ERROR An error occured when performing file I/O. - BLOCK_READ The cache block contains a copy of the file block. - BLOCK_IN_SWITCH The block is selected for eviction. - BLOCK_REASSIGNED The block is selected for eviction or to be freed. - BLOCK_IN_FLUSH The block is selected for flush. - BLOCK_CHANGED The cache block contents is changed over the file block contents. The file block contents is wrong until the block is flushed. - BLOCK_IN_USE The block is not unused (neither free, nor never used). This is needed for debugging only. - BLOCK_IN_EVICTION The block is selected for eviction. This happens only if the LRU ring was empty when a new block was requested. See the section "BLOCK_IN_EVICTION" for further explanation. - BLOCK_IN_FLUSHWRITE A flush operation is writing this block currently. A writer must wait until done. - BLOCK_FOR_UPDATE Buffer contents is being changed currently. A flusher must wait until done. The requirements for the new flags (except BLOCK_IN_USE) are explained later. Locking ======= All key cache locking is done with a single mutex per key cache: keycache->cache_lock. This mutex is locked almost all the time when executing code in mf_keycache.c. However it is released for I/O and some copy operations. The cache_lock is also released when waiting for some event. Waiting and signalling is done via condition variables. In most cases the thread waits on its thread->suspend condition variable. Every thread has a my_thread_var structure, which contains this variable and a '*next' and '**prev' pointer. These pointers are used to insert the thread into a wait queue. NOTE: Since there is only one pair of queue pointers per thread, a thread can be in one wait queue only. A thread is in a wait queue only when it is waiting. Before starting to wait on its condition variable with pthread_cond_wait(), the thread enters itself to a specific wait queue with link_into_queue() (double linked with '*next' + '**prev') or wait_on_queue() (single linked with '*next'). Another thread, when releasing a resource, looks up the waiting thread in the related wait queue. It sends a signal with pthread_cond_signal() to the waiting thread. NOTE: Depending on the particular wait situation, either the sending thread removes the waiting thread from the wait queue with unlink_from_queue() or release_whole_queue() respectively, or the waiting thread removes itself. There is one exception from this locking scheme. Each block has a reference to a condition variable (condvar). It holds a reference to the thread->suspend condition variable, if that thread is waiting for the block. When that thread is signalled, the reference is cleared. This is similar to the above, but it clearly means that only one thread can wait for a particular block. There is no queue in this case. Strangely enough block->convar is used for waiting for the assigned hash_link only. More precisely it is used to wait for all requests to be unregistered from the assigned hash_link. Wait queues =========== When a thread needs to wait for another thread, it links its my_thread_var structure into a wait queue and waits on its 'suspend' condition variable. There are two types of wait queues: - Double linked (**prev, *next) in a ring. This allows a thread to be unlinked from the middle of the ring. The operations are: - link_into_queue() - unlink_from_queue() - Single linked (*next) in a NULL terminated chain. Only the thread at the chain head can be unlinked. The operations are: - wait_on_queue() - release_whole_queue() which unlinks all threads and signals each. The other thread, when releasing the resource, looks up a waiting thread in the related wait queue. It sends a signal to the waiting thread and removes it from the wait queue with unlink_from_queue() or in release_whole_queue() respectively. NOTE: Since there is only one pair of queue pointers per thread, a thread can be in one wait queue only. A thread is in a wait queue only when it is waiting. However, there was an anomaly. The resize_queue contains threads that are waiting for resizing the key cache. Only one thread at a time can resize the cache. However, in the original implementation the thread that got control for resizing was not removed from the queue. Consequentially it must not enter any other wait queue. But this happened even in the original implementation. In flush_key_blocks_int(), when one or more blocks are in eviction, the thread waits until their eviction is complete. Same happens for blocks in flush by another thread in the new design. In both cases the resize_queue was broken. I guess the resizing thread was left in the resize_queue so that there was a way to wake it when it waited for the last threads to leave the cache after the flush was complete. But this means that the resize_queue served two purposes. It contained threads waiting for resize and one thread waiting for other threads to leave the cache. The fix for this is to have two queues for the resizing. The thread that starts resizing is now unlinked from the resize_queue. After flushing, if it needs to wait for other threads to complete their I/O, it is linked into the new waiting_for_resize_cnt queue. Since there can be only one thread in this queue, a simple pointer would have done the trick. But I wanted to have the same set of operations for all wait/signal instances. And finally the queues are implemented by one 'last_thread' pointer only. Just linking and unlinking are a bit more complex than necessary. The resize_queue serves two purposes: 1. Threads that want to do a resize wait there if in_resize is set. This is not used in the server. The server refuses a second resize request if one is already active. keycache->in_init is used for the synchronization. See set_var.cc. 2. Threads that want to access blocks during resize wait here during the re-initialization phase. When the resize is done, all threads on the queue are signalled. Hypothetical resizers can compete for resizing, and read/write requests will restart to request blocks from the freshly resized cache. If the cache has been resized too small, it is disabled and 'can_be_used' is false. In this case read/write requests bypass the cache. Since they increment and decrement 'cnt_for_resize_op', the next resizer can wait on the queue 'waiting_for_resize_cnt' until all I/O finished. Instead of specially implemented wait/signal constructions I do now use wait_on_queue()/release_whole_queue() like almost everywhere. As agreed during review, I combined add_to_queue() and the always following wait loops into a new function wait_on_queue(). I also renamed release_queue() to release_whole_queue() to make more clear that all entries from the queue are released. The function checks for an empty queue internally now. Both functions are defined empty if THREAD is not defined. So it is safe just to call the functions in the code and neither take care for THREAD nor if the queue is empty. The queue waiting_for_resize_cnt is used by a resizer to wait for pending I/O after the flush phase. No I/O should be active when memory for the cache structures is re-allocated and/or the key_cache_block_size is changed. I/O operations start with inc_counter_for_resize_op() and end with dec_counter_for_resize_op(). The last I/O, which decrements cnt_for_resize_op to zero, wakes the resizer. Synchronization on this queue is done with wait_on_queue()/release_whole_queue(). The queue waiting_for_hash_link is used in the very improbable case that a thread wants to access a file block which is not in the cache and all hash_links are in use already. The queue waiting_for_block is used when a thread wants to access a file block which is not in the cache and the LRU ring is empty. This means that all blocks are currently in use for read, write, and/or flush. Every block has two wait queues: block->wqueue[COND_FOR_REQUESTED] is used to synchronize secondary requestors for a block. This happens when more than one thread wants to access the same file block before it has been read into the block buffer. The first thread that requests the file block acquires a cache block and starts reading the file block into it. Before it starts to read it releases the cache_lock. This gives other threads the chance to request the same file block. These need to wait on block->wqueue[COND_FOR_REQUESTED] until the read is complete. block->wqueue[COND_FOR_REQUESTED] is also used by threads that flush cache blocks. It might happen that a flush is due while another thread is writing into the cache block. The flusher has to wait until this is complete. block->wqueue[COND_FOR_SAVED] is used to synchronize threads that wait for a block to finish flushing, eviction, or becoming free. Access Methods ============== There are five operations that request one or more blocks: - key_cache_read() - key_cache_insert() - key_cache_write() - flush_key_blocks() - resize_key_cache() When one of the first three requests a block, its hash_link is marked in use (hash_link->requests++). After they are done with the block, they release the hash_link and signal threads that may be waiting for it (remove_reader()). Waiting for hash_link->requests happens during eviction and freeing (wait_for_readers()). There was an anomaly. key_cache_write() did not call remove_reader() like the other functions, but decremented hash_link->requests directly and did not wake waiting threads. This is safe as long as key_cache_write() does not need to read a part of the cache block from file. In this case it does not release the cache_lock and no other thread can grab it for eviction or free. If the block is marked for eviction or free when key_cache_write() requests it, find_key_block() waits until the eviction or free is complete and evicts another block. If key_cache_write() again does not need to read a part of the block, it replaces its contents and decrements hash_link->requests without releasing the cache_lock. However, if only a part of the cache block is to be replaced, and the rest needs to be read from file, then the cache_lock is released for I/O and it could be possible that another thread wants to evict or free the block and waits for it to be released. key_cache_write() does now call remove_reader() like the other functions. hash_link->requests protects a block against reassignment and free. As long as a block points to a hash_link with a request, eviction and free wait for the request to go away before the final reassignment/free. hash_link->requests does not protect a block against flush. Neither against explicit flush, nor against implicit flush during eviction. A similar thing is block->requests. It counts how many threads want to do something with a block. Whenever a thread wants to read, write, insert (load), flush, or free a block, it increments the count and takes the block off the LRU ring. This means that the block is implicitly protected against eviction. After a thread is done with a block, it decrements block->requests. The thread that decrements it to zero puts the block back in the LRU ring (unless it was a 'free' operation). block->requests protect a block against eviction. Even against being selected for eviction. block->requests does not protect a block against flush or free. In summary the three above mentioned standard functions implicitly increment both hash_link->requests and block->requests. This protects the blocks against reuse for other purposes. Only flush is allowed in parallel. But this does not modify the block contents nor reassign it for different purposes. However, the key cache does not prohibit two write operations to change the same block at the same time. As mentioned above, there is a single cache_lock (mutex). So the blocks cannot be modified at exactly the same time. But if a write operation changes a part of the cache block only, a read of the rest might be necessary. In this case the cache_lock is released and the other write could do its work. So be warned that proper locking is required in the caller of the key cache operations. SERIALIZED_READ_FROM_CACHE -------------------------- It turned out to be no problem to release the cache lock even while actually copying new contents to the block buffer in key_cache_write(). BLOCK_READ ---------- read_block() and key_cache_insert() did set block->status to BLOCK_READ and cleared all other bits. So the information was lost if free() has been called for this block (BLOCK_REASSIGNED was set). The fix for this is to add the BLOCK_READ bit to block->status without clearing other bits. offset in key_cache_read() -------------------------- If keycache->can_be_used is TRUE when entering the function (with the old code), but FALSE after taking cache_lock (applies to the new code too), offset can be non-zero when jumping to 'no_key_cache'. In the direct my_pread() filepos+offset was used. I removed "+offset" and moved the declaration of 'offset' to the inner block. Offset must not be used on the first iteration (the original buffer, length, filepos values must be used). And it is zero at any later iteration anyway. length/read_length in key_cache_write() --------------------------------------- During resize, when find_key_block() returned NULL, because we have to write to file directly, we used 'length' to write the rest of 'buff'. This could mean that we wrote past the end of the block. Though this would not hurt as we don't write wrong contents, we still should not do this. The next block could be in the cache and if it is dirty, we need to overwrite its contents anyway. I changed my_pwrite(..., length,...) to my_pwrite(..., read_length, ...). So I write only until end of the block during this iteration. If there are more data in 'buff' it will be taken care in the next iteration, handling a cached block if appropriate. keycache->can_be_used --------------------- key_cache_read(), key_cache_insert(), and key_cache_write() used this. If it was false (not set, clear, zero), these functions accessed the index file directly and didn't use a mutex. This was checked outside of the cache_lock. I changed the functions to take the lock if the cache is initialized. They do this before the loop now. They do not release the lock for every iteration. They hold the lock unless doing I/O or copying data from the block buffer. After entering the lock, they check for a resize in progress. If the flush phase is over, they wait for completion of the resize (the re-initialization). After the flush phase the resizer waits for I/Os to complete. It would not be nice to start another I/O at this moment. Otherwise the resizer could easily starve. Thereafter 'cnt_for_resize_op' is incremented so that another resizer can wait for this I/O. The count is decremented before the function returns. Incrementing/decrementing is no longer done in every loop iteration. But it is always done in the cache_lock. So the code after "no_key_cache:" looks somewhat complicated. Since the count is incremented/decremented within the cache_lock and every I/O is counted, the resizer can wait for pending I/O before starting the re-initialization. keycache->can_be_used is no longer changed during resize. It is set when the cache is [re-]initilized with proper parameters. It is cleared when the cache is [re-]initialized too small and thus disabled. The semantics of keycache->can_be_used is the same as (disk_blocks > 0). I propose to get rid of it. Also there is keycache->disk_blocks and keycache->blocks. Both 'int'. I propose to get rid of one of them. PAGE_WAIT_TO_BE_READ -------------------- In key_cache_insert() and key_cache_write(), if page_st is PAGE_WAIT_TO_BE_READ then we must wait for the block in switch to be reassigned and read by another thread. Otherwise we 1.) might not know the correct hash_link to unregister our request from, and 2.) might write new data into the wrong block, or 3.) might write into a block that is read from file afterwards. Proposals: - Change 'wrmode' parameter of find_key_block() to 'mode'. Have modes MODE_READ, MODE_WRITE, and MODE_INSERT. For the latter return NULL if the block is already in the cache or another thread is reading it already. Unregister the hash_link request in this case. - In cases where PAGE_WAIT_TO_BE_READ would be returned, wait in find_key_block() until the block is read. The respective branch in read_block() as well as all respective handling in key_cache_read(), key_cache_write(), and key_cache_insert() can be dropped. Resizing ======== Resizing consists of two major phases: - The flush phase: Flush all dirty cache blocks and free all blocks. - The re-initialization phase: re-allocate memory structures and set new parameters, most notable key_cache_block_size. The basic operation is so that 1. the cache_lock (mutex) is taken, 2. the cache is marked in resize and in flush for resize, 3. all dirty (changed) blocks are flushed, 4. all blocks are freed, 5. the cache is unmarked flush for resize, 6. the resizer waits for I/O that bypassed the cash in the flush phase, 7. all cache structures (except the root KEY_CACHE and the mutex) are freed, 8. the cache structures are reallocated with the new size, 9. the cache structures are initialized so that it is usable again, 10. new paameters are set, 11. all threads waiting for end of resize are awaken, and 12. the cache_lock is released. Resizing the key cache is implemented in an online fashion. The normal MyISAM operation is not stopped. Resizing 1: no new writes, but new reads ---------------------------------------- It may not be a good idea to disallow new changed blocks in the cache while allowing new clean blocks. (This was the old strategy.) On a write request we tried to find the block in the cache. If it was not there, we immediately went to write directly. If it was present and clean, we freed the block and wrote directly. If it was present and dirty, we worked on the dirty block (this was new by my former patch. Formerly we freed it even!). When going to write directly, we release the cache_lock. This allows another thread to read the same block. It will not find it in the cache, but it may find a free block or evict a clean block. In both cases it can start reading the block from file without waiting for anything first. If the second thread is "in luck", it reads the block (with old contents) from file before the first thread even started to write. When the first thread completes its write, we have a new block in file and an old block in cache. If another change of this block is due before the resizing finishes by freeing all blocks, this change will be done based on the old contents from the cache. So IMHO we should not allow to let new clean block in the cache during resize either. Or to allow new dirty block to go in during resize too. But this could make resizing an infinite process. Note that no two threads can acces the same index block at the same time because they have to have a "key_root_lock" on the index. A new read of an index block should only be possible after the other thread has all writes completed and unlocked the "key_root_lock". However, if key_cache_block_size is big enough to contain data from two or more indexes, it can fail badly. I implemented it so: During resize - read requests for a non-cached block read from file directly, - write requests for a non-cached block write to file directly, - read requests for a dirty block use the block, - write requests for a dirty block use the block, - read requests for a clean block use the block, - write requests for a clean block free the block and write to file directly. Resizing 2: flush_key_blocks_int() and key_cache_write() -------------------------------------------------------- After week long struggle with index corruptions I noticed that the parallelism between flush and write of the same block could lead to an inconsistent buffer contents in the index file. If the write modifies the buffer contents while it is being written (flushed) to file, then the write finishes first, because it did not release the cache_lock. (Even if it did, it set BLOCK_CHANGED before modifying the buffer). When the flusher acquires the cache_lock after the write, it clears BLOCK_CHANGED. This allows the block to be evicted or freed without another flush. The change of the last writer is lost. In the worst case parts of the changes are included, leaving an incosistent block behind. The above could even happen in normal cache operation. Even with 1K cache block size. The normal flush after a statement takes place after the thr_lock has been released. Another thread can modify the cache blocks while the flush is going on. Different situation, same cause: Assume a cache with all blocks clean. A writer evicts a block. It wants to overwrite a part of the block only. So it has to read the new block first. It releases the cache_lock for reading. A resizer sets resize_in_flush. It does not find dirty blocks. It starts to free all blocks. When it comes to the one the writer is reading, it has to wait until the writer releases the block. The writer awakes from reading, gets cache_lock, and wakes the waiting resizer. The resizer awakes inside free_block(). It frees the dirty block. I fixed these with a pair of block status flags: BLOCK_IN_FLUSHWRITE, BLOCK_FOR_UPDATE. The former is set by a flusher right before it releases the cache_lock for writing the block. The latter is set by a writer before it releases the cache_lock. Both check for the other flag and wait on block->wqueue[COND_FOR_REQUESTED] (flusher) or block->wqueue[COND_FOR_SAVED] (writer). BLOCK_IN_FLUSHWRITE: This is set by the flusher right before the cache_lock is released for writing. It is cleared right after writing. key_cache_write() can proceed if BLOCK_IN_FLUSH is set but not if BLOCK_IN_FLUSHWRITE is set. If BLOCK_IN_FLUSHWRITE is set, it waits on block->wqueue[COND_FOR_SAVED]. BLOCK_FOR_UPDATE: This is set in key_cache_write() as soon as possible before the cache_lock is released for waiting or copying the buffer contents. It is cleared after the buffer is finally changed. It is used by a flusher to skip the block in a flush burst. flush_key_blocks_int() repeats it search for changed blocks and will finally find the skipped block again. To avoid an infinite loop, it remembers one of the blocks marked BLOCK_FOR_UPDATE and waits on block->wqueue[COND_FOR_REQUESTED] for its modification to complete. When signalling COND_FOR_REQUESTED, key_cache_write() does not wake threads waiting for the block to be flushed. BLOCK_FOR_UPDATE is also used to prevent free_block() to be called. The block requires a(nother) flush before being freed. Resizing 3: BLOCK_IN_FLUSH -------------------------- We did initially have a loop over the *number* of changed blocks when flushing. One block from the LRU ring was checked for its file descriptor and all blocks for this file were flushed. All blocks were selected from the changed_blocks hash that belonged to this file. These blocks were marked for flush and put in an array for a write burst. The array was sorted in increasing write position. Every block was written. Before every write the cache_lock was released and taken back after the write. Only during these write periods parallel cache operations were possible. After taking back the mutex, all threads waiting for the block were signalled and the block was freed. One big problem here was that it was not checked if the block had already beed marked for flush before. At the end of most writing SQL statements, during unlock of a MyISAM table, or during the final close of the table, the index blocks for this table are flushed. This flush can happen in parallel with the big flush from the cache resizing. So it was possible that two threads flushed the same blocks. During the write burst, after freeing the flushed block and before writing the next one, no check was done if the block had been freed in between. Another place, where changed blocks are flushed, is cache eviction. A block with changed data may be selected for eviction (***). It has to be flushed before it can be reused for another file block. In this situation it has neither been checkd if the block was marked for flush already nor has it been marked before writing. But during the write for eviction the cache_lock was released too. In this situation the "flusher" could have completed its write and freed the block. Resizing 4: BLOCK_IN_EVICTION ----------------------------- (***) CORRECTION: When a block is marked BLOCK_IN_FLUSH by a flushing thread, a request is registered against the block. This unlinks it from the LRU ring so that it cannot be selected for eviction. If it had been selected for eviction before, it had been marked BLOCK_IN_SWITCH and thus would not be selected for flush. However there was one exception. If the LRU ring was empty and a block was requested for eviction, then the next block that is put on the LRU ring is immediately attached to the requesting hash_link and the requesting thread is signalled. From this moment until the requesting thread awakes, a flusher could select the block for flush. The block was not marked in any way during this interval. Though it is not on the LRU ring, it could still be selected for flush. If it is unchanged, it could even immediately be freed by the flusher. When the evicter awakes, it could find a block in flush, a free block, or even a block which has been reused for something else after being free. NOTE: This horrible scenario can only happen if the LRU ring got empty. In normal operation this happens only if there are less cache blocks than threads. Every thread works on one block at a time only. This applies to key_cache_read(), key_cache_insert() and key_cache_write(). An exception is flush_key_blocks(). During flush a thread can take a lot of blocks from the LRU ring. The same goes with resize_key_cache(). I fixed the bug with a new status flag, BLOCK_IN_EVICTION. See the section BLOCK_IN_EVICTION below. Resizing 5: flush_all_key_blocks() ---------------------------------- The number of changed blocks was not reliable. It was not safe to use it as an indicator whether the flush was complete. I saw infinite loops. The outer loop thought that there should be changed blocks left, but the inner loop didn't find any. Also blocks are not always in the LRU ring. I saw crashes due to an empty LRU ring but with non-zero number of changed blocks. However we could not have an infinite loop here as flushing included freeing of clean blocks for the file too. A second peek in the LRU ring could not see the same block again. The fix for this is to loop over all chains of the changed_blocks hash until they all are empty. Then free all clean blocks. Changed blocks are always in the changed_block hash (***). The additional advantage of this two-step approach is that the flushed blocks do not need to be freed in the first step. They can still be used for reading until they are freed in the second step. Resizing 6: flush_key_blocks_int() ---------------------------------- (***) There is one exception. Dirty block in eviction, which need to be flushed, are temporarily put into a "first_in_switch" chain within flush_key_blocks_int(). But since flush_key_blocks_int() waits until their eviction is done (which makes them clean blocks), this is irrelevant here. Unfortunately there is another chance for failure. A second flusher could move the block from that "first_in_switch" chain to its own "first_in_switch" chain. The first flusher could then believe all changed blocks of "its" file are clean now and finish prematurely. I strongly recommend to get rid of the temporary "first_in_switch" chain. Using BLOCK_IN_FLUSH also during eviction makes this possible. Flushing of the same file by two threads (the table "owner" and the resizer) can also lead to a situation where one has to wait for the block and the other one frees it after use. This was not taken into account in wait_for_readers(). The fix for this is to check if the block still references a hash_link. While freeing clean blocks, the file_blocks chain was traversed without a check if the 'next' block has been moved from its place in the chain while free_block() had to wait for readers. And it was not checked if the block had already been marked for to be freed (BLOCK_REASSIGNED) by another thread. The fix for this are two nested loops. The outer loop runs until no more blocks to be freed are found. The inner loop traverses the file_blocks chain. Whenever a block needs to be freed, the state of the 'next' block is saved and checked after free_block(). If it changed, the loop is restarted because the 'next' block may no longer be at the old position in the current chain. Also we need to skip all blocks that are marked for eviction or for to be freed. It could even happen that a write started before the start of the flush. It could still be in progress on a clean block. We have to skip this too. Whenever we found something to skip, we need to restart the flush from the beginning (checking for dirty blocks). These writes could move a block from the file_blocks hash to the changed_blocks hash. flush_key_blocks() ------------------ While waiting for the cache_lock in flush_key_blocks(), the cache could become disabled. I won't try to flush in this situation any more. I do now check according to the other global functions. At first, keycache->key_cache_initialized is checked. If this is true, the cache_lock is taken. Only then it is checked if the function can proceed. Miscellaneous problems ====================== BLOCK_IN_EVICTION (Eviction with empty LRU ring) ------------------------------------------------ When a block needs to be evicted, the least recently used block is taken from the LRU ring. If the LRU ring is empty, the evicting thread needs to wait for a block to be put on the LRU ring. (BTW. Even free_block() puts the block on the LRU ring and takes it back again without intermediate release of the cache_lock.) A block is put in the LRU ring by link_block(). Before it links the block in the LRU ring, it checks if threads are waiting for a block. If so, it does not link the block in the LRU ring, but assigns it to the hash_link of a waiting thread instead. Then it registers a request on the block for every thread waiting on the same hash_link and wakes them all. In this case nothing was set in block_link. The most obvious change to do would be to set the flag BLOCK_IN_SWITCH. But this cannot be done in link_block(). Multiple threads may be waiting for the block. Only the first one must execute the eviction. All others have to wait. The logic is so that the first thread sets BLOCK_IN_SWITCH and the others wait when they see it. The outcome was that nothing in the block indicated that it was selected for eviction until the first of the requesting threads set BLOCK_IN_SWITCH. When the thread that called link_block() releases the cache_lock, another thread continues to run. This could be another thread but one of those waiting for the selected block. This thread could want to do all sort of things with the selected block. For example free it. free_block() could not detect that the block has been selected for eviction. In theory it could detect the additional request(s) on the block (one request must be registered on the block anyway before it can be freed). But this was not done in the original code. I fixed this with a new block->status flag: BLOCK_IN_EVICTION. It is set by link_block() when the block is not linked in the LRU ring, but assigned to a requesting hash_link. I check it in free_block() after unreg_request(). If set, I stop modifying the block in any way. I use that flag also to prevent calling free_block() for a block that has been selected for eviction already. Proposal: We could save a lot of hassles in the code if the cache behaves much more stupid in the extreme case of empty free list, all blocks in use, *and* empty LRU ring. If there is no block available, the thread needs to wait on waiting_for_block. This is no change yet. But instead of special handling in link_block() and free_block() I would just put the block in the LRU ring or in the free list respectively and wake all threads from waiting_for_block. The threads that awake restart their search for a block from the beginning (get hash_link, ...). The first thread that awakes finds the block and starts to use it. Later awaking threads do either fail and wait again, or they are in luck if they wanted to access the same block as the other thread, or there is yet another block available meanwhile. Admittedly this could produce more CPU load in a situation where the system is short of cache blocks anyway. But OTOH this is a very seldom situation. And it requires a far too small configured cache. With all the bugs fixed, mentioned in this document, I did not get a single gcov count on the lines that handle the empty LRU ring. Not even with an 8-block cache and 16 threads competing for them. But all the fixes around BLOCK_IN_EVICTION would be unnecessary. Also the special handling in link-block() (which does not link the block in the LRU ring in this case), and the uncommon use of thread->opt_info, which require a lot of commenting, would become obsolete. free_block() ------------ free_block() called unlink_hash() before handling the LRU stuff mentioned above. Since we need the old hash_link when the block is handed over for eviction, I moved this call down behind all other unregs and unlinks. LOAD INDEX INTO CACHE --------------------- Caveats of the current implemetation: 1. All indexes must have the same block size. 2. All blocks of the table are flushed with FLUSH_RELEASE first. 3. IGNORE LEAVES won't work well with big cache blocks (see below). 4. If index file is bigger than cache, it could evict self-loaded blocks. Advantage of the current implemetation: - Bigger read size. Proposal: MyISAM hands over to key cache a list of (position, length) for the blocks to load. Key cache computes maximum chunks to read while excluding file blocks already in cache. Key cache stops when all blocks are used by this file, or, in other words, when as many blocks for this file have been requested as there are cache blocks. So it would not evict pages loaded by itself. mysql_preload_keys() took a TL_READ lock. Concurrent inserts could modify index blocks in the cache, but mi_preload() reads directly from file, bypassing the cache. In theory it should not overwrite cache blocks that have valid data already. But the dirty blocks could be evicted (with flush) after mi_preload() read the data and before key_cache_insert() tries to load that block. key_cache_write() assumed that no readers of "its" block exist. Usually an "key_root_lock" assures this. Hence it did not wake readers of the block. But mi_preload() does not take a "key_root_lock" and thus could become a secondary requestor of the block that key_cache_write() was writing. This made me experiencing a real deadlock. (See later that key_cache_write() must also wake readers because big key cache blocks could contain data from two or more indexes.) The fix is to use TL_READ_NO_INSERT for preload. The alternative would be to acquire the "key_root_lock" for all indexes for the whole preloading. While key_cache_insert() does not overwrite the contents of valid blocks, it did overwrite blocks with the status PAGE_WAIT_TO_BE_READ. This means blocks that are in read by another thread. This does also happen for blocks to be changed by another thread. And it could affect blocks that are still assigned to other file blocks (during flush for eviction). I fixed it so that key_cache_insert() fills blocks with PAGE_TO_BE_READ only. Pages with PAGE_WAIT_TO_BE_READ will be read by another thread anyway. Pages with PAGE_READ are in the cache already. With key_cache_block_size > maximum index block size of this table it was a problem that key_cache_insert() ignored existing blocks (PAGE_READ). The data is loaded in chunks of 32K by default. It starts after the index file header. This means that for every chunk the first and last of the affected cache blocks is filled partially only. Every succeeding chunk finds its first block in the cache already (partially filled from the previous chunk), and ignores it. Because we allow reads while loading, it can happen that a reader finds one of those partially filled blocks. If it wants data from this block behind what has been loaded so far, it reports an error. If a block is in the cache, it must contain as much data as is available for it in the file. This means that every block except the last one of the file must be completely filled. Consequently, key_cache_insert() needs to read the block itself, if the caller supplies less data than the block size. But this would clearly slow down the loading. It would happen very often in normal operation with a big key_cache_block_size. There is no way around this for LOAD INDEX ... IGNORE LEAVES. But without IGNORE LEAVES we can assure that the loaded chunks are aligned with the cache blocks. I included some test code for it (#if !defined(INGO_TEST_LOADIDX_OFF)) in mi_preload.c and a comment in mf_keycache.c. It works. But since it is not strongly needed I left the defines in the code so that I can remove it quickly. key_cache_insert() does not try to load any index data if it detects a disabled key cache or resize in progress. Note that IGNORE LEAVES may not work as expected if cache blocks are bigger than the smallest index blocks. Depending on the difference in size we may not save much. The key cache cannot read partial cache blocks. If key_cache_insert() is given a part of a cache block only, it will read the whole block itself. To avoid double reading of the index file, we always must supply a full cache block to key_cache_insert(). But to check if we have branch index blocks, we must work on index blocks. The required algorithm is more complex than the one currently existing in mi_preload(). DBUG_ASSERT() ------------- To be able to assert more block status requirements, I needed to reorder some initializations and deinitializations. Statistics ---------- keycache->global_cache_w_requests was counted twice during resize (in the !block branch in key_cache_write()) and not counted in myisamchk (in the special branch at top of key_cache_write()). keycache->global_cache_read was counted for primary and secondary requests in find_key_block(). I moved this to read_block() in the branch for primary requests where the real read is done. I do also increment it in key_cache_insert() for every requested block. The read has already been done by the caller in this case. ============================= ============================= OPT_KEY_CACHE_BLOCK_SIZE ------------------------ This option is defined with a subtraction of MALLOC_OVERHEAD: {"key_cache_block_size", OPT_KEY_CACHE_BLOCK_SIZE, "The default size of key cache blocks", (gptr*) &dflt_key_cache_var.param_block_size, (gptr*) 0, 0, (GET_ULONG | GET_ASK_ADDR), REQUIRED_ARG, KEY_CACHE_BLOCK_SIZE , 512, 1024*16, MALLOC_OVERHEAD, 512, 0}, This means that an option of --key_cache_block_size=1024 results in key cache blocks of 512 bytes size. To get 1024 byte blocks one must request --key_cache_block_size=1032. With -DSAFEMALLOC it is even --key_cache_block_size=1060. I believe this is a bug. At least it is not documented. Another thing is that we do *not* enforce it to be a power of two. Just a multiple of 512. But we use constructs like this on it: offset= (uint) (filepos & (keycache->key_cache_block_size-1)); I veryfied by testing that the key cache does not work with a block size of 12K for example. BTW, this can also be configured with a SET statement. BLOCK_ERROR ----------- Eviction of a changed block: If my_pwrite() fails, block is marked BLOCK_ERROR, but reassigned anyway. This makes the evicting thread mark 'its' table as crashed. This may be the wrong table. The block contents is not changed, but the block is assigned to the wrong table. If this block is found by another thread (who did not notice the crash of the table yet) it finds it correctly assigned to the hash_link but not marked BLOCK_READ. It concludes that it must be a block in switch and behave as a secondary reader (PAGE_WAIT_TO_BE_READ). Fortunately BLOCK_ERROR is still set. So it wouldn't wait, but report an error. I propose to 1. abstain from reassigning the block in error, 2. leave it in the changed_blocks hash so it is due for another attempt to flush, 3. do not put it back to the LRU ring so it cannot be evicted again, 4. set BLOCK_ERROR, 5. clear BLOCK_IN_SWITCH and let the evicter select another block. When a thread tries to access that block (read, write, insert, flush), the error would be reported back to the caller and the right table would be marked crashed. At flush the block would be tried to write again. If this fails again, it would be freed (FLUSH_RELEASE) or left untouched (FLUSH_KEEP). If the error flush happens in a resize operation, the cache should stay in the 'resize_in_flush' state until all erroneous blocks are flushed by the right thread(s). Direct I/O is allowed in this state. A potential problem of my proposal could raise if all blocks get marked BLOCK_ERROR. All attempts to evict a block would stall. A detection of this situation could be implemented. Any attempt to evict a block would then be redirected to bypass the cache. To provoke these errors, error injection is required. global_blocks_changed --------------------- Used for SHOW STATUS like 'Key_blocks_not_flushed'. It is incremented/decremented together with blocks_changed. So it shows the current number of dirty blocks. But global_blocks_changed is cleared in reset_key_cache_counters(). So it can overflow on the next decrement. And global_blocks_changed is not cleared in end_key_cache(). This should be safe if it was counted correctly and not reset by reset_key_cache_counters(). Otherwise a wrong count could survive resize_key_cache() if a too small cache size has been requested. This sounds hypothetical, but I saw: | Key_blocks_not_flushed | 103067 | | Key_blocks_unused | 0 | | Key_blocks_used | 29 | in several test runs. This should not normally be possible. I found a source of incorrect counting, introduced by myself. Now SHOW STATUS looks reasonable. But my question is still vaild: What do we want to show with SHOW STATUS like 'Key_blocks_not_flushed'? Is it correct to handle it differently from 'blocks_changed'? I propose to get rid of 'global_blocks_changed', use 'blocks_changed' instead and not to clear this count in reset_key_cache_counters(). Or, to get less changes elsewhere in the server, get rid of 'blocks_changed' and count 'global_blocks_changed' where 'blocks_changed' is counted now. reg_requests() -------------- I propose to change the name to reg_block_request() to avoid confusion with hash_link requests. Use singular as it is always called for one request only. Also it would be nice to have hash_link->h_requests and block->b_requests for easier searching of one of these variables. remove_reader() --------------- This function takes a 'block' as argument, though its purpose is to decrement a request from a hash_link and to wake threads waiting on the hash_link. A 'block' is needed because the condition to signal is block->condvar. This means that the function can be used only after a block is assigned to the hash_link. The mere fact that the hash_link points to a block is insufficient because it could be a block in eviction for this hash_link. In this case block->condvar cannot be used because it belongs to another hash_link. One would signal the wrong thread. Overmore block->condvar is only used for the purpose to wait for the referenced hash_link to become "unrequested". The function would be more flexible if it takes a 'hash_link' and the condvar is located in hash_link. This would come handy in key_cache_insert() when it detects that a block for the file/pos is already in eviction. It could just unregister its request an proceed. Now it needs to wait until the block is assigned before it can unregister. For a "casual reader" (or a "beginner") it might even be easier to understand what remove_reader() and wait_for_readers() really do. Additionally it would be more inline with other wait/signal events if we would use a wait queue instead of a condvar. Though there should never be more than one thread waiting for a hash_link to become "unrequested". The overhead would be small, but another anomaly would vanish thus saving a casual reader some thinking. volatile -------- After seeing unexplainable crashes, I declared most of the members of KEY_CACHE, HASH_LINK, and BLOCK_LINK volatile. This seemed to have improved the situation. I always got reasonable output from gdb since. After I have got a working patch, I removed the volatile keywords again. The system runs as stable as before. So I'll forget about volatile. Background key cache flusher ---------------------------- Every N seconds ensure than the M least recently blocks are "clean". Multiple resize flushers ------------------------ Since index files could be on different disks we may want to allow for multiple threads doing the flush. Maximum cache size ------------------ Enlarge some variables to allow key_buffer_size > 4G. hash_links hash --------------- Even if we make the hash as big as file descriptors are available, we can have quite long chains of hash_links for each file. Beginning with a certain size of the key cache it might become better to have a more complex access structure. Either another hash per file, or an ordered chain with jump pointers. We should profile get_hash_link() to see if we have a problem there. ======================== ======================== Possible improvements (old suggestions) --------------------- 1.) Block hashes. The changed_blocks and file_blocks hashes are indexed by file descriptor. Each of their chains can contain blocks for several files. It might be more efficient to have file_heads in these chains and blocks for one file only chained below the file_heads. This would make flushing faster. Flushing is done per file and so could use up a whole chain without caring for skipping other files blocks. file_blocks[i] -> file_head(17) -> file_head(21) | | v v block_link | v block_link Similar could be done for hash_links. hash_root[i] -> hash_link_file(17) -> hash_link_file(21) | | v v hash_link_pos(0) | v hash_link_pos(1024) You can get the same effect by just making key_cache->file_blocks[] as big as max number of file descriptors. 2.) It is possible that the new algorithms make the use of the resize counter (cnt_for_resize_op) unecessary. Flushing is done so that at its end all blocks are free. It waits for all users of pages before freeing them. So there cannot be any pages in use any more. The only place where it might be relevant is in flush_key_blocks(). This should be checked. I think the above is true and you may do another patch for this (not to be added to 5.0) I do now use this to wait for I/O bypassing the cache. MyISAM level locking is not sufficient if key_cache_block_size is bigger than the smallest index blocks. We must not restart cache operation while I/O for parts of the file blocks are still in progress. Another thread could want to acceess another part of the same file block and thus needs to read the whole block while a direct writer is still writing his part of the block. The block contents would be unpredictable. I believe this was also the idea of the original code. It did also wait for cnt_for_resize_op between the flush phase and the re-initialization. However it did not count direct I/O. Concerns (old) -------- my_thread_var. The key cache uses the condition variable of the my_thread_var structure for waiting. This could also be used elsewhere in the MySQL server. We should check that stray signals cannot confuse the key cache code at any place (that is, check if wait lopps are always used correctly). And there should be a server-wide rule how to use the 'next' and 'prev' pointers of my_thread_var. The key cache code cannot protect against abuse of these pointers. It may even be good to make the wait_for_event()/signal_event() function pair global mysys functions to be used for all waiting/signalling that is based on my_thread_var. on the other hand, I agree the key_cache code should be safe for stray signals...