1) The size of the open_cache is normally only decreased in open_table just before adding another entry into it. However if all entries are in use and the unused_tables linked list is empty then only adding happens, thus the open_cache always grows to the biggest concurrent usage size and never shrinks from there. Is this on purpose or by mistake?? 2) unused_tables is a doubly linked list that contains all unused table entries in the open_cache. This is a global variable and protected by LOCK_open 3) The open_cache is a global hash table consisting of table entries either in use or cached for future use. It is hashed using length of database_name\0table_name\0 and this is the key. There can be several entries with the same key in the open_cache. 4) Open temporary tables are kept only on the thread object. There is a doubly linked list of them with thd->temporary_tables as first pointer. The key in this list is the name of the temporary table plus 4 bytes of server id and 4 bytes of pseudo_thread_id 5) refresh_version is a global variable used to trigger meta data changes on a table. It is changed by FLUSH TABLE commands that call close_cached_table. Each table object contains a refresh_version to check if they need to reopen the table. Also the thd object contains a refresh_version variable called version. Thus to ensure that a specific table is no longer used (after changing its definition in some or the other way, it is necessary to use refresh_version). close_cached_tables does this job for you using a TABLE_LIST and some other support variables. It also awaits that all tables have been removed from the cache if told to do so. 6) Description of open_table Open table first looks for: - Temporary tables (list on thd->temporary tables) - Locked tables (list on thd->locked_tables) If the thread has one locked table it can ONLY use locked tables. In 5.0 this code has been expanded to avoid deadlock when using stored procedures since all tables possibly used in the stored procedure must be locked before starting the execution of the stored procedure. There is also a hack to handle opening of a view when there are locked tables in the thread. Next it gets the LOCK_open (global lock) Refresh check overall --------------------- At first it checks whether refresh_version has been updated since it started opening tables. If so it returns and restarts the opening of the tables (when called from open_tables, the normal case). This case is always checked except for the first table where rather the thd->version is set. It is an important check for Meta Data updates but is only executed when a general reset of all tables have been requested by a FLUSH command. The reason this code is needed is when there are no tables in the cache for this table. In that case we still have to check for a general reload of all tables. In that case we need to release the tables opened thus far. Thus this check could be executed out of the normal path and moved to the else part when the normal code doesn't find a cached table to use. TODO RONM: Check this part of the code with Monty to see what could be changed here. Then it flushes any tables used by the handler interface if there are any handler tables. Loop over all tables in the cache with the same table name ---------------------------------------------------------- Then it goes on to check for opened tables in the cache that are NOT IN USE. For each table found with the same name it checks whether the table needs to be refreshed. This is necessary since the table can lie around in the cache for a while after it was changed and it is important to refresh it before using it if needed. The normal path of the loop --------------------------- When it finds a table not in need of refresh and not in use it exits the loop and unlinks the table from the unused_tables list. It sets the table in use by this thread releases the global mutex on LOCK_open. Normal exit of the subroutine ----------------------------- When called from open_tables (refresh != NULL) then it links the table into the open_tables single linked list on the thread object. It initialises the lock type to READ and performs some activities regarding table names, aliases and finally it initialises a set of variables on the table object. Another common path through the code ------------------------------------ Another common path in the code of the code is that there is no table found with this name in the cache. In this we first limit the growth of the table cache by deallocating the head of the unused_tables (which unfortunately is also the Least Recently Used). I would vote for both a first and a last reference here and to deallocate from last and put in last used first (or the opposite naming). Then the code goes on to open the frm file and create a new table object. There is some code to handle view which needs some further understanding of the view logic. It then sets up the shared object and inserts the new table object into the table cache. Then it rejoins the normal exit of the subroutine. Handling refresh when discovered in the loop -------------------------------------------- This code of outmost importance in understanding how to perform advanced meta data updates. It is however very uncommon that it is used. The code for refresh starts by closing all tables already opened that are in need of refresh. Before it proceeds it then waits for other threads to complete their use of the found tables (if called from open_tables). Then it signals refresh to open_tables (if called from there) and returns an unsuccess to indicate that the normal path wasn't taken. 7) Lock Tables processing (We currently skip handling of prelocking for stored procedures in this description) At first the set of real tables are derived from the table list (derived tables, views and infoschema tables are not "real" tables). The actual lock processing is performed by mysql_lock_tables. Queries that use locked tables do not use mysql_lock_tables since they have locks on all involved tables already before starting the query. This is expalained in separate section on handling of prelocked statements. Step 1 (in get_lock_data): Calculate the number of locks and the number of tables involved in the lock request. (Temporary tables are not locked and not counted). Step 2 (in get_lock_data): Allocate memory for locks and list of table references Step 3 (in get_lock_data): Get references to all involved tables lock by calling store_lock on each table handler involved. store_lock will actually also convert the lock to the appropriate level for the handler involved. Description of how various handlers change lock mode in store_lock ------------------------------------------------------------------ BDB will convert all write locks except TL_WRITE_ALLOW_READ and TL_WRITE_ONLY to TL_WRITE_ALLOW_WRITE. When tables are reopened after refresh version has been updated OR when LOCK TABLE processing is done then BDB will not change the lock type. The Federated handler does the same changes as BDB with the exceptions 1) If thd->tablespace_op is set no changes are made (This must be an overprecaution since Federated handler cannot be involved in tablespace operations and thus this variable can never be set. 2) It does also convert TL_READ_NO_INSERT to TL_READ (under the same condition as BDB converted above) to allow for concurrent inserts into t2 in query INSERT INTO t1 SELECT ... FROM t2; (This is most likely a bug) InnoDB changes the locks as the Federated (or rather the opposite) with the addition that it doesn't change locks for the SQL command TRUNCATE TABLE. The reason for this is most likely that InnoDB's Consistent Read is based on transaction handling and TRUNCATE TABLE probably does things on a lower level. NDB converts all write locks (except TL_WRITE_ONLY) to TL_WRITE_ALLOW_READ with the same exception as BDB on lock tables and reopen tables and it does also change TL_READ_NO_INSERT to TL_READ (NDB does allow concurrent inserts since no SQL statement is secure to be consistent anyways). It is most likely a bug that NDB also allows concurrent updates while performing ALTER TABLE (sets TL_WRITE_ALLOW_READ) since this will create havoc in the new table after ALTER TABLE is completed (naturally this can still happen from another MySQL server but can be avoided by NDB's use of allowing only one server to be connected under a certain mode of operation. The MyISAM Merge handler and the partition handler are a bit special here since they use the store_lock lock_type derived from underlying handlers. Currently both do it by calling store_lock on all underlying handlers. The partition handler can optimise here by only calling underlying handlers if they do some real work since it is not possible to access the underlying handlers directly through SQL statements. MyISAM and all other handlers in the tree set the lock type directly unless they are told to ignore the command through TL_IGNORE and the lock_type haven't already been set by a previous call (the same table can be present in a query more than once, through different handlers that all reference the same lock object). Step 4 (in mysql_lock_tables): Check if a global_read_lock is issued and we are to do a write operation and we are not to ignore global read locks. .... Step 5 (in lock_external): External locks are acquired on all tables. If lock mode is TL_READ OR TL_READ_WITH_SHARED_LOCKS OR TL_READ_HIGH_PRIORITY OR TL_READ_NO_INSERT then external lock mode is F_RDLCK for higher lock modes (>= TL_WRITE_ALLOW_WRITE) it is set to F_WRLCK. Description of how various handlers use external_lock ----------------------------------------------------- Most handlers doesn't support neither transactions nor external locking. For those handlers external_lock is a no-op (simply return 0). The only handler supporting external locks is the MyISAM handlers so this handler actually does some external lock processing. All the transactional handlers (InnoDB, BDB and NDB also takes advantage of the fact that external lock is called at the start of all statements (except where tables are already locked by lock tables or since we are in a stored procedure, in this case start_stmt does the same job). BDB checks whether a transaction object exists if not then it creates one. external_lock is called at the start of processing either with F_RDLCK or with F_WRLCK and it is also called at end of processing with F_UNLCK. 1) Begin processing If the transaction has not been opened yet (no tables previously external locked and no previous statement in transaction) then a transaction is started (either a local BDB transaction or a transaction known on higher level). 2) End processing If last table is closed and there is statement level transactions and there are no updates in the transaction then the transaction is committed. InnoDB handles things very similar to BDB except in a more complicated manner since there are many more features in InnoDB like consistent reads and more checks for optimised lock processing. NDB handles things also similar to BDB and InnoDB except that it also has some work to ensure that its local meta data cache is consistent. MyISAM and MyISAM Merge tables locks the database used by the table. The partition handler simply calls external lock on all underlying table handlers. Step 6 (in thr_multi_lock): This is where the actual lock processing takes place. There was a global lock used around the table cache in open table but here is where the actual tables gets locked by using the lock objects received in the store_lock calls. The first step here is important to get a deadlock free lock code. The locks on the tables are processed in a predefined order. This predefined order is by first sorting on place of lock object in memory and second to place locks for higher contention levels on the same table earlier in the process. The last step is needed since the same table can be in the same query with both read lock and write lock. Thus if write locking wasn't done before read locking there would be a deadlock possibility since another query could come in and also read_lock the table whereby the current query cannot proceed and the possibility for a deadlock is there. Step 7 (in thr_lock): For each table in the query a lock is acquired. At first the mutex on the lock object is acquired to ensure that only one query at a time is attempting to lock the table. This mutex is on a table basis and is thus not a global lock and should be ok from an SMP point of view. 1) Handling of Read Locks First we check for write locks - Write Lock encountered If our thread is the lock owner OR Write lock type is (TL_WRITE_ALLOW_WRITE OR TL_WRITE_DELAYED) OR Write lock type is (TL_WRITE_CONCURRENT_INSERT OR TL_WRITE_ALLOW_READ) AND our lock mode is NOT TL_READ_NO_INSERT THEN The lock is granted and is put last in the queue of current read requests. Some lock statistics is updated If no Read lock is granted and Write lock type is TL_WRITE_ONLY then the lock request is aborted and not put into any queue. Thus if lock is TL_WRITE_LOW_PRIORITY, TL_WRITE or TL_WRITE_ONLY then only Read locks are allowed from the same thread. For TL_WRITE_DELAYED and TL_WRITE_ALLOW_WRITE all Read locks are allowed for all threads. - No write lock encountered If no write locks waiting OR Write locks waiting are not of TL_WRITE or TL_WRITE_ONLY OR Our lock type is of type TL_READ_HIGH_PRIORITY OR Our thread already owns a read lock on the table THEN Same handling to put it into the current read queue (code duplicated with only a debug comment differing If none of these attempts to schedule the read immediately was successful we call wait_for_lock with read_wait queue as input. 2) Handling Delayed Write Locks If there is already an ongoing TL_WRITE_ONLY THEN abort lock request If no write locks are currently being processed OR write locks exist but they are not TL_WRITE_ALLOW_READ AND No TL_WRITE_ALLOW_READ requests are in write wait queue AND There is either a Read lock or a Write lock currently being processed THEN Insert the request into the Write wait lock queue using the thread variable suspend as condition for the pthread_cond_wait call. After this we return from here without waiting for the lock since it is a WRITE DELAYED request. 3) Handling of TL_WRITE_CONCURRENT_INSERT (used by INSERTs with special flag SQL_CONCURRENT allowing inserts to insert data at the end of the file even when table is read locked). It is normally upgraded to TL_WRITE except when When this feature is activated there is a method that can be inserted by the handler to either allow or disallow this feature at a specific point in time. If no method is provided then this is always prohibited (check how this integrates with partitioning). 1) An option has been used to set it to TL_WRITE_LOW_PRIORITY instead 2) When there is a check_status method then the TL_WRITE_CONCURRENT_INSERT flag is retained. 4) Finally for all Write locks except the WRITE DELAYED that got into the write wait queue above the following logic is processed with the lock type as it was or has been upgraded to. If there is a write lock request ongoing THEN If TL_WRITE_ONLY is ongoing the lock request is aborted. Otherwise If our thread is the owner of the write lock OR our lock type is TL_WRITE_ALLOW_WRITE AND There is no write wait queue AND Write lock type is TL_WRITE_ALLOW_WRITE THEN We insert ourselves into the write queue update some statistics and get_status (if exists) and return to continue processing ELSE We put ourselves into the write lock wait queue by calling wait_for_lock There is no write lock request ongoing THEN If no write lock requests waiting (thus neither a write lock queue or a write lock wait queue THEN First if TL_WRITE_CONCURRENT_INSERT was retained call check_status to see if it really is to be retained (this calls check_status method which in MyISAM is mi_check_status method) or if it is to be upgraded as explained above. If no read lock queue OR Our lock type is not (TL_WRITE_LOW_PRIORITY OR TL_WRITE OR TL_WRITE_ONLY) AND Our lock type is not (TL_WRITE_CONCURRENT_INSERT OR TL_WRITE_ALLOW_WRITE) OR There are no ongoing read requests of type READ_NO_INSERT THEN (Thus TL_WRITE_ALLOW_READ and TL_WRITE_DELAYED is always allowed with read queue, but TL_WRITE_ALLOW_WRITE and TL_WRITE_CONCURRENT_INSERT is not allowed if there is a READ_NO_INSERT currently ongoing) Insert request into write queue update some statistics, get_status and return ELSE Wait in write lock wait queue by calling wait_for_lock 7.1) Handling of wait_for_lock calls from above At first it is inserted into the lock wait queue (read_wait or write_wait) Then current mutex and condition is set-up using the mutex on the table and the condition in the thread-specific part. Then the condition is set on the THR_LOCK_DATA object (which is normally a part of the handler object). Finally one enters a loop where one checks for abort on a thread-specific variable (declared as volatile). pthread_cond_wait is called in the loop and there are two ways to leave the loop. Either the cond variable on the handler object was changed or the abort variable on the thread object was set. One reason for the loop logic is to ensure that spurious wake-ups doesn't cause any harm so only true wake-ups are serviced. If there is a condition variable still set then this indicates an abort and the request is still in the queue and should be removed from there. If no condition is set and the lock type is TL_UNLOCK then the request is already dequeued but the result is still to abort the request. The normal case is that one returns with reference to condition variable zero. Then some statistics is updated, get_status is called and the table mutex is unlocked. Finally one removes the current_mutex and current condition by zeroing them (done under protection of a thread specific mutex). Corresponding unlock logic -------------------------- In the same manner as tables are locked by mysql_lock_tables there are calls to handle unlock of tables. mysql_unlock_tables handles this. It first calls thr_multi_unlock that unlocks the table locks held by the query. It then calls unlock_external that calls external_lock with F_UNLCK on the same handlers as got the first external_lock call. What the handlers do with external_lock and F_UNLCK was described above. Handlers are not involved in thr_multi_lock so this is pure MySQL Server logic. thr_multi_unlock calls thr_unlock on all tables where lock is not TL_UNLOCK. Handling of thr_unlock ---------------------- Step 1: Acquire mutex on the table object Step 2: Release lock from wait queue on read or wait queue (some special logic to handle TL_WRITE_DELAYED as well. Step 3: Call update status if an update_status method exists and lock type released is on level TL_WRITE_CONCURRENT_INSERT, TL_WRITE_DELAYED, TL_WRITE_LOW_PRIORITY, TL_WRITE or TL_WRITE_ONLY. Step 4: Decrement read_no_write_count if lock type was READ_NO_INSERT. Step 5: Set lock type to UNLOCK on THR_LOCK_DATA object Step 6: If there are still active write locks we don't do anything. If there are no read locks left either THEN If waiting write lock requests exists AND The waiting write lock is not TL_WRITE_LOW_PRIORITY OR Waiting write lock is TL_WRITE_LOW_PRIORITY but No waiting read lock exist OR Waiting write lock is TL_WRITE_LOW_PRIORITY and Waiting read lock exists but Waiting read locks are not of the type TL_READ_HIGH_PRIORITY OR TL_READ_NO_INSERT THEN Check if we have had a long row of write locks. This means that waiting read requests have been livelocked and will therefore be allowed to execute now to avoid a further advancement of the livelock problem. The waiting read requests are started by calling free_all_read_locks. Next step is to start moving lock requests from write lock wait queue to write lock queue. This is done one item at a time. The normal upgrade TL_WRITE_CONCURRENT_INSERT is called only here it is upgraded immediately to TL_WRITE rather than giving it a chance to be upgraded to TL_WRITE_LOW_PRIORITY. Next the wake-up logic is used to set condition to zero and signal on the condition object. Loop is broken if a lock is found that has not lock type set to TL_WRITE_ALLOW_WRITE or the next lock in wait queue (if there is one) doesn't have lock type TL_WRITE_ALLOW_WRITE. There must be an object last in the queue which is not a real lock object that have lock type not equal to TL_WRITE_ALLOW_WRITE for this loop to not crash. If the loop was broken out of with objects remaining in write wait queue then if the lead object has lock type TL_WRITE_LOW_PRIORITY OR TL_WRITE OR TL_WRITE_ONLY then no more requests are started, mutex is unlocked and one returns from the method. Now if there were no write locks, no read locks and we have started at least one write request then we release all waiting read locks by calling free_all_read_locks with a parameter set to true if there are waiting write locks on level TL_WRITE_CONCURRENT_INSERT OR TL_WRITE_ALLOW_WRITE If there are no waiting read locks we exit and release mutex before leaving. Now next step is to handle the case where there are no more write requests ongoing but there are ongoing read requests. If there are waiting write requests AND waiting lock request is of of lock type TL_WRITE_ALLOW_WRITE OR TL_WRITE_ALLOW_READ OR TL_WRITE_CONCURRENT_INSERT OR TL_WRITE_DELAYED AND no ongoing requests with TL_READ_NO_INSERT set OR waiting lock type is of type TL_WRITE_ALLOW_READ OR TL_WRITE_DELAYED THEN Check if lock to be upgraded, if upgraded then start all waiting read requests but not any more write requests. If lock not upgraded 8) Handling of lock tables for prelocked queries