MySQL Bugs: #38661: all threads hang in "opening tables" or "waiting for table" and cpu is at 100%

Bug #38661	all threads hang in "opening tables" or "waiting for table" and cpu is at 100%
Submitted:	8 Aug 2008 8:59	Modified:	7 Mar 2010 18:21
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S2 (Serious)
Version:	6.0.7-debug,5.4	OS:	Linux
Assigned to:	Magne Mæhre	CPU Architecture:	Any

Description:
all threads stay in "Opening tables" or "Waiting for table" and cpu is at 100% when flush tables is run at a certain moment.

The State of the threads change between Opening tables and Waiting for table but the queries never seem to finish.

processlist snippet:

+------+-------------------+-----------------------
| Time | State             | Info
+------+-------------------+-----------------------
|    0 | NULL              | show processlist
|  395 | Opening tables    | insert into t1 values
|  395 | Opening tables    | update t1 set d='Shoah
|  395 | Opening tables    | insert into t1 values
|  395 | Opening tables    | insert into t1 values
|  395 | Opening tables    | insert into t1 values
|  395 | Opening tables    | select * from t1 where
|  395 | Waiting for table | insert into t1 values
|  395 | Opening tables    | insert into t1 values
|  394 | Opening tables    | update t1 set d='befit
|  395 | Opening tables    | update t1 set d='Frost
|  395 | Opening tables    | update t1 set d='drawl
|  395 | Opening tables    | update t1 set d='banni
|  395 | Flushing tables   | flush tables

3618 sbester   30  15  161m  34m 5028 S 99.8  7.5  11:10.92 mysqld 

How to repeat:
will upload a testcase shortly.

a bunch of info from the running binary/gdb

Attachment: bug38661_thread_info.txt (text/plain), 60.28 KiB.

testcase. i could only repeat on 6.0.7 on linux server build with --with-libevent.

Attachment: bug38661.c (text/plain), 7.86 KiB.

setting as verified.  let me know if it's not repeatable.

i just rebuilt the exact same server without specific --with-libevent and the problem still occurred.  so, it's not related to that.  not sure why my 6.0.5 and windows 6.0.7 don't see this problem.  maybe linux/build specific?

Bug is repeatable on Solaris10/x86, _without_ libevent

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/59668

2752 Magne Mahre	2008-11-24
      Bug #38661 all threads hang in "opening tables" or 
         "waiting for table" and cpu is at 100%
      
      A race between open_tables and a "flush table" operation
      resulted in neither being able to complete.  open_tables
      was not able to open the table and initiated a recover.
      The conditions for completing the recovery were too
      strict and couldn't be achieved while the flush was
      running.  The solution was to loosen the requirement
      that said that a share couldn't exist without a table,
      since this is actually a valid condition in certain
      cases.

I found a very similar bug for 5.x bug #41114 not sure if that is a duplicate..

Hi Shane!

I doubt that. The problem described in this bug report is specific for 6.* versions (at least as we understand it now).

So IMO it is better to keep those two bugs separate.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/60743

2766 Magne Mahre	2008-12-05
      Bug #38661 'all threads hang in "opening tables" or "waiting for table"
                  and cpu is at 100%'
      
      Concurrent execution of FLUSH TABLES statement and at least two statements
      using the same table might have led to live-lock which caused all three
      connections to stall and hog 100% of CPU.
      
      tdc_wait_for_old_versions() wrongly assumed that there cannot be a share
      with an old version and no used TABLE instances and thus was failing to
      perform wait in situation when such old share was cached in MDL subsystem
      thanks to a still active metadata lock on the table. So it might have
      happened that two or more connections simultaneously executing statements
      which involve table being flushed managed to prevent each other from
      waiting in this function by keeping shared metadata lock on the table 
      constantly active (i.e. one of the statements managed to take/hold this
      lock while other statements were calling tdc_wait_for_old_versions()).
      Thus they were forcing each other to loop infinitely in open_tables() -
      close_thread_tables_for_reopen() - tdc_wait_for_old_versions() cycle
      causing CPU hogging.
      
      This patch fixes this problem by removing this false assumption from
      tdc_wait_for_old_versions().
      
      Note that the problem is specific only for server versions >= 6.0.
      
      No test case is submitted for this test, as the test infrastructure
      hasn't got the necessary primitives to test the behaviour.  The
      manifestation is that throughput will decrease to a low level
      (possibly 0) after some time, and stay at that level.  Several
      transactions will not complete. 
      
      Manual testing can be done by running the code submitted by Shane 
      Bester attached to the bug report.  If the bug persists, the 
      transaction thruput will almost immediately drop to near zero 
      (shown as the transaction count output from the test program staying 
      on a close to constant value, instead of increasing rapidly)

Pushed into 6.0.9-alpha  (revid:magne.mahre@sun.com-20081205141333-p37s1bj9xubkqbgd) (version source revid:magne.mahre@sun.com-20081205141333-p37s1bj9xubkqbgd) (pib:5)

Pushed into 5.4.4

Noted in 5.4.4 changelog.

Concurrent connections executing FLUSH TABLES and at least two
statements using the same table could cause all three connections to 
stall with 100% CPU utilization.

Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.

Ignore previous comment about 5.4.2.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/93659

3031 Konstantin Osipov	2009-12-11
      Backport of:
      -----------------------------------------------------------
      2630.28.28 Magne Mahre  2008-12-05
      Bug #38661 'all threads hang in "opening tables" or "waiting for table"
                  and cpu is at 100%'
                            
      Concurrent execution of FLUSH TABLES statement and at least two statements
      using the same table might have led to live-lock which caused all three
      connections to stall and hog 100% of CPU.
              
      tdc_wait_for_old_versions() wrongly assumed that there cannot be a share
      with an old version and no used TABLE instances and thus was failing to
      perform wait in situation when such old share was cached in MDL subsystem
      thanks to a still active metadata lock on the table. So it might have
      happened that two or more connections simultaneously executing statements
      which involve table being flushed managed to prevent each other from
      waiting in this function by keeping shared metadata lock on the table 
      constantly active (i.e. one of the statements managed to take/hold this
      lock while other statements were calling tdc_wait_for_old_versions()).
      Thus they were forcing each other to loop infinitely in open_tables() - 
      close_thread_tables_for_reopen() - tdc_wait_for_old_versions() cycle
      causing CPU hogging.
              
      This patch fixes this problem by removing this false assumption from
      tdc_wait_for_old_versions().
       
      Note that the problem is specific only for server versions >= 6.0.
              
      No test case is submitted for this test, as the test infrastructure
      hasn't got the necessary primitives to test the behaviour.  The
      manifestation is that throughput will decrease to a low level
      (possibly 0) after some time, and stay at that level. Several
      transactions will not complete. 
              
      Manual testing can be done by running the code submitted by Shane 
      Bester attached to the bug report.  If the bug persists, the 
      transaction thruput will almost immediately drop to near zero 
      (shown as the transaction count output from the test program staying 
      on a close to constant value, instead of increasing rapidly).

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100216101445-2ofzkh48aq2e0e8o) (version source revid:kostja@sun.com-20091211154405-c9yhiewr9o5d20rq) (merge vers: 6.0.14-alpha) (pib:16)

Pushed into mysql-next-mr (revid:alik@sun.com-20100216101208-33qkfwdr0tep3pf2) (version source revid:kostja@sun.com-20091211111859-lse5qbt8k1ar9q2p) (pib:16)

Closing this bug as it is not repeatable in publicly available trees with versions < 6.0).

Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:vvaintroub@mysql.com-20100216221947-luyhph0txl2c5tc8) (merge vers: 5.5.99-m3) (pib:16)

No changelog entry needed.

I am having the same problem on a 5.1.31-1ubuntu2, and it happens just from time to time.