MySQL Bugs: #60563: ALTER TABLE hangs waiting for metadata lock when NO other sql stmt is running

Bug #60563	ALTER TABLE hangs waiting for metadata lock when NO other sql stmt is running
Submitted:	21 Mar 2011 10:56	Modified:	8 Jul 2011 19:27
Reporter:	Boris Livshutz	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S2 (Serious)
Version:	5.5.10	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
our application is now constantly freezing up and when i look at show processlist,  i see only one active statement,  and ALTER TABLE ...REORGANIZE PARTITION ...   

the stmt is in a state of "Waiting for table metadata lock"   

It never recovers from this and just hangs.   the only solution is to kill this stmt.

there are NO other statements running on the database.

Is there some lock that is leaking?  not getting closed in previous statements that have ended?

We don't have any sql testcase as our application sends out many statements.  But this is happening very often and on many different tables.  We do frequent repartitioning as we divide the table into many partitions based on time.

How to repeat:
we don't see any pattern, it just occurs every 20 minutes or so now.  it ran fine for a month on 5.5.8, but we have made no changes and it started happening.

for each of these tables where there is trouble we mostly do
a) select ... from tableA
b) bulk load into tableA
c) every 10 minutes, alter table tableA reorganize partition  PARTMAX INTO PARTITION PART2167841 VALUES....
followed by:
alter table tableA drop partition ....

both the reoranize and drop partition hang often now...

Thank you for the report.

but version 5.5.8 is old. Please try with current version 5.5.10 and if problem still exists send us output of SHOW CREATE TABLE tableA, SHOW PROCESSLIST and SHOW ENGINE INNODB STATUS\G Last 2 outputs taken during time when hang occurs.

I tried to change  lock_wait_timeout to a small value but that variable is not changing.  Is this something that has been fixed?   

set global lock_wait_timeout=400;

mysql> show variables like 'lock_w%';
+-------------------+----------+
| Variable_name     | Value    |
+-------------------+----------+
| lock_wait_timeout | 31536000 |
+-------------------+----------+

same thing when i try to put it in the my.cnf file.   

there is no error, but the value does not seem to get updated.

Thank you for the feedback.

Please read about differences of session and global variables and answer questions in my previous comment.

show processlist, show engine innodb status, and show create table

Attachment: show.log.zip (application/zip, text), 7.22 KiB.

I've upgraded to 5.5.10 and have the same issue.  I've uploaded the files as you requested.   Here are some observations which hopefully could help narrow this down.

** our application runs thousands of sql statements against mysql across up to 100 sessions
** when i first start the application, the first 6 to 10 attempts to ALTER TABLE are successful.  but at some point soon after that, the alter table hangs waiting on the metadata lock.   
** once this happens, even if i kill that thread manually, or it times out (when i set a low lock_timeout) it will always freeze each subsequent time that alter table is issued on that table.
** once it is forever waiting on the metadata lock, i can stop the application, which drops all connections.  then if i restart the application, it will once again work for a short time.  So restarting the app, seems to clear the problem for a short period.
** Also, when the ALTER freezes, and i stop the application.  i am able to successfully issue the alter table stmt manually in the mysql client.

because this is a ALTER TABLE....REORGANIZE PARTITION....   when it is running and starts to hang, I've watched the file system..  The files involved in those partitions seem to be modified correctly, and the file names end in "..TMP" which is what i normally observe when repartitioning.  But it seems right at the very end when the alter table stmt is finishing, that is when it stops waiting for the metadata lock.   i think its one of the final steps, maybe trying to update the schema's metadata?.  

I can see this especially when the alter table needs to run for a while just to move data from the MAX partition to the new partition, during that time the work is being done and the file sizes are changing, so the statement runs ok, its only once the files seem to be correctly allocated and data is done being moved, that the freeze up happens.

I'm having the exact same problem.
We run many DML statements and some DDL in multiple sessions. For some time it works but suddenly some tables are locked and my sessions with DDL/DML on them hang with "waiting for table metadata lock".
Sometimes they remain blocked even if I kill the application and all the sessions. I tried flushing tables and UNLOCK tables with no result. I wait for like some hours and suddenly the tables unlock.
Can't drop database either.
Please help!
Maybe it's similar to bug #57006

Keep in mind that metadata locks are kept for the duration of a transaction. Take a look at http://dev.mysql.com/doc/refman/5.5/en/metadata-locking.html

Davi,   The problem is that there is no other transaction running, so who is holding the metadata lock?  Some old transaction must have not released a lock.   When doing show processlist or in GDB trace,  no transaction is seen, so no one can be holding the metadata lock, yet our DDL is blocked waiting on this lock.   

I have seen bugs fixed in previous versions of 5.5.x where locks were being leaked, so I'm guessing this is another such case.

This is a very serious situation as when it happens it pretty much makes the whole system unusable and brings down production.

Are you sure there are no other transactions running? The fact that closing all open connections makes the problem go away might indicate otherwise.

I am sure there's no other transaction running and still the tables are locked.
I go with Boris that a transaction is not releasing the lock when ending or maybe deadlocking tables.
Version 5.1.x doesn't show this problem. It's really hard to reproduce for us.
In fact, in our development environment in Windows we could never achieve but in Linux production and development we do.
Really strange issue...
Any workaround at least?

5.1 does not have transactional metadata locking.

Davi,  if you look at the files i attached on March 22nd,  i uploaded the results from "show processlist" which clearly showed that NO other transaction was running.   

Is any developer planning to look into this critical bug?

Active transactions do not show up in any specific way through SHOW PROCESSLIST -- it only lists open sessions. But, In your case, SHOW ENGINE INNODB STATUS lists a active transaction that might be using the table.

I mentioned the session disconnection because when a session is disconnected, any open transactions are rolled back and metadata locks are released as if a ROLLBACK had been sent by the user. In case metadata locks are being leaked, its quite likely that a disconnect wouldn't solve it as it has a very similar effect (in relation to metadata locks) as if a COMMIT or ROLLBACK had been issued.

In "show engine innodb status" i only see one other stmt running and it is a query on a totally unrelated table.   Or do you see another transaction that i am missing?

Also, the transaction that is hung "Waiting for metadata lock" stays hung for hours, so even if some coincidental query may run, we don't see any record of a long running stmt.   

Since it seems no one will be able to provide a easy sequence of steps, since it seems to take a complex app (with probably some long and rapid sequence that leaks a lock), is there any other way to make progress on this bug?

> In "show engine innodb status" i only see one other stmt running and it
> is a query on a totally unrelated table.   Or do you see another transaction
> that i am missing?

There is a transaction that has been active for two and a half hours. Thread id 561.

Also, just to make it clear, if a statement inside a transaction ever access a table, a metadata lock is kept on the table until the transaction is either committed or rolled back. For example:

session 1>
  BEGIN;
  SELECT * FROM t1;

session 2>
  ALTER TABLE t1 ADD PARTITION ...;

Even after the SELECT has been executed, the ALTER TABLE won't succeed until session 1 either commits or rollbacks the transaction.

> Also, the transaction that is hung "Waiting for metadata lock" stays hung for
> hours, so even if some coincidental query may run, we don't see any record
> of a long running stmt. 

The statement might have been run inside a transaction that hasn't been properly ended (commit/rollback).
 
> Since it seems no one will be able to provide a easy sequence of steps, since
> it seems to take a complex app (with probably some long and rapid sequence
> that leaks a lock), is there any other way to make progress on this bug?

Enabling the the general log might shed some light on the sequence of steps. It will also show whether its a application problem or a server one.

Boris,

thank you for the feedback.

Please also send us output of SELECT * FROM performance_schema.MUTEX_INSTANCES WHERE LOCKED_BY_THREAD_ID IS NOT NULL and SELECT * FROM performance_schema.RWLOCK_INSTANCES WHERE WRITE_LOCKED_BY_THREAD_ID IS NOT NULL OR READ_LOCKED_BY_COUNT IS NOT NULL

And just for refernece output from SHOW ENGINE INNODB STATUS sent before:

---TRANSACTION 83E32995, not started, process no 17518, OS thread id 1546074432
mysql tables in use 124, locked 256 
MySQL thread id 685, query id 49788795 localhost 127.0.0.1 controller Waiting for table metadata lock
/*  */ ALTER TABLE metricdata_min REORGANIZE PARTITION PARTMAX INTO (   PARTITION PART2167973 VALUES LESS THAN (21679730),   PARTITION PARTMAX VALUES LESS THAN MAXVALUE)
...
---TRANSACTION 83EEB0BA, not started, process no 17518, OS thread id 1528768832
MySQL thread id 551, query id 52268614 localhost 127.0.0.1 controller
---TRANSACTION 83EEB06B, not started, process no 17518, OS thread id 1520515392
mysql tables in use 1, locked 0
MySQL thread id 550, query id 52268644 localhost 127.0.0.1 controller
/*  */ select MAX(applicatio0_.set_excluded_timestamp) as col_0_0_ from application_diagnostic_data applicatio0_ where (applicatio0_.entity_type='APPLICATION_COMPONENT' and applicatio0_.entity_id=100 or applicatio0_.entity_type='APPLICATION_COMPONENT_NODE' and applicatio0_.entity_id=65887) and applicatio0_.excluded=1 limit 2

There is no information about locked tables in other transactions.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

show processlist, show engine innodb status, and requested selects from performance schema

Attachment: bug-60563.txt (text/plain), 19.42 KiB.

Above attached file shows the following:
* Version 5.5.13
* Show processlist with process stuck waiting for metadata lock
* show engine innodb status showing 2 stuck transactions
* Requested selects from performance_schema both returing empty sets

I can cause this problem reliably using the unit test suite for the application I am working on.  It works fine on 5.1.X

Let me know if you need anymore information to continue debugging

Justin,

thank you for the feedback. In your case I see transactions which hold locks. So it is possible what in your case "Waiting for table metadata lock" is correct behavior.

Please send us full CREATE TABLE query (it can show output of SHOW FULL PROCESSLIST).

If you could create isolated test case it would be good too.

Here is the full create statement:
CREATE TABLE IF NOT EXISTS E_11_COMMENT_TEXT (time BIGINT(20) UNSIGNED NOT NULL, value TEXT DEFAULT NULL, status_code TINYINT(3) UNSIGNED NOT NULL DEFAULT 0, unique_key INT(10) NOT NULL, PRIMARY KEY(time, unique_key)) ENGINE=InnoDB DEFAULT CHARSET=latin1

I have isolated the cause down to a single test case, but unfortunately it involves lots of proprietary code, so I can't share it.
Basically it is an XA transaction spanning the config and storage databases that happens something like this:
1. Read data from config database
2. Run the above create statement on storage database
3. Insert data into created table
4. Write data to config database
5. Read data from config and storage databases
5. Repeat steps 1-4.

It is on the repeat that the create table statement locks up.  I originally thought that it may be a problem with XA, so I disabled XA and put both data sources within their own local transactions.  I also moved the config database to a different server so that only the storage requests would make to the server in question. Even after all of these changes it still causes a lock.

I just noticed that our test suite opens a separate connection to the database in order to create that table and the first statement holds a lock on that table preventing the second from running the create.  I'm assuming this would be expected behavior. 

Separating the first pass through steps 1-4 mentioned about from the second pass so that they run in different transactions solves this problem.

Well sorry for resurrecting an old bug when it was really my fault.  On the bright side I learned something about the inner workings of my test suite.

Justin,

thank you for the feedback. Good to know this is not MySQL bug. Closing as "Can't repeat" again as earlier cases were uncertain.