MySQL Bugs: #52356: query_cache_debug fails on Linux

Bug #52356	query_cache_debug fails on Linux
Submitted:	25 Mar 2010 9:16	Modified:	7 May 2010 19:51
Reporter:	Alexander Nozdrin	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S1 (Critical)
Version:	Celosia (M3), Dahlia (M4)	OS:	Linux
Assigned to:	Alexander Nozdrin	CPU Architecture:	Any
Tags:	experimental, pb2, test failure

Description:
First try to run query_cache_debug fails.
This happens only on Linux x86_64 gcov.
It's not a sporadic failure, as it was supposed to be initially.

Symptoms:

main.query_cache [ pass ] 17434
main.query_cache_debug [ fail ]
        Test ended at 2010-03-24 17:59:17

CURRENT_TEST: main.query_cache_debug
mysqltest: At line 230: query 'DELETE FROM t1' failed: 1689: Wait on a lock was aborted due to a pending exclusive lock

The result from queries just before the failure was:
< snip >
SELECT SQL_NO_CACHE id FROM information_schema.processlist WHERE state='wait_in_query_cache_insert' ORDER BY id ASC LIMIT 1 INTO @thread_id;
KILL QUERY @thread_id;
SELECT SQL_NO_CACHE id FROM information_schema.processlist WHERE state='wait_in_query_cache_insert' ORDER BY id DESC LIMIT 1 INTO @thread_id;
KILL QUERY @thread_id;
**
** Finally signal the DELETE statement on THD1 one last time.
** The stmt will complete the query cache invalidation and return 
** cache status to NO_FLUSH_IN_PROGRESS. On the status change
** One signal will be sent to the thread group waiting for executing
** invalidations and a broadcast signal will be sent to the thread 
** group holding result set writers.
SELECT SQL_NO_CACHE id FROM information_schema.processlist WHERE state='wait_in_query_cache_invalidate2' LIMIT 1 INTO @flush_thread_id;
KILL QUERY @flush_thread_id;
**
*************************************************************************
** No tables should be locked
=================================== Connection thd2
a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

More results from queries before failure can be found in /export/home2/pb2/build/sb_1-None-1269418682.66/mysql-trunk-gcov/mysql-test/var/log/query_cache_debug.log
Note: net_clear() skipped 11 bytes from file: socket (8)

 - saving '/export/home2/pb2/build/sb_1-None-1269418682.66/mysql-trunk-gcov/mysql-test/var/log/main.query_cache_debug/' to '/export/home2/pb2/build/sb_1-None-1269418682.66/mysql-trunk-gcov/mysql-test/var/log/main.query_cache_debug/'

Retrying test, attempt(2/3)...

main.query_cache_debug                   [ retry-pass ]    923

Retrying test, attempt(3/3)...

main.query_cache_debug                   [ retry-pass ]    646

How to repeat:
Check out PB pages for mysql-trunk and mysql-next-mr.
That failure started to happen after push from bugfixing.

Fails for me with recent mysql-next-mr tree on 32-bit Ubuntu 8.04 also (built using  BUILD/compile-pentium-debug-max):

openxs@ubuntu:/home2/openxs/bzr/mysql-next-mr/mysql-test$ ./mtr query_cache query_cache_debug
Logging: ./mtr  query_cache query_cache_debug
100325 12:03:57 [Note] Buffered information: Performance schema disabled (reason: start parameters).

100325 12:03:57 [Note] Plugin 'FEDERATED' is disabled.
100325 12:03:57 [Note] Plugin 'ndbcluster' is disabled.
MySQL Version 5.6.99
Checking supported features...
 - using ndbcluster when necessary, mysqld supports it
 - SSL connections supported
 - binaries are debug compiled
Collecting tests...
vardir: /home2/openxs/bzr/mysql-next-mr/mysql-test/var
Checking leftover processes...
 - found old pid 14879 in 'mysqld.2.pid', killing it...
   process did not exist!
 - found old pid 14684 in 'mysqld.1.pid', killing it...
   process did not exist!
Removing old var directory...
Creating var directory '/home2/openxs/bzr/mysql-next-mr/mysql-test/var'...
Installing system database...
Using server port 60581

==============================================================================

TEST                                      RESULT   TIME (ms)
------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
main.query_cache                         [ pass ]  26630
main.query_cache_debug                   [ fail ]
        Test ended at 2010-03-25 12:04:33

CURRENT_TEST: main.query_cache_debug
mysqltest: At line 230: query 'DELETE FROM t1' failed: 1689: Wait on a lock was aborted due to a pending exclusive lock

The result from queries just before the failure was:
< snip >
SELECT SQL_NO_CACHE id FROM information_schema.processlist WHERE state='wait_in_query_cache_insert' ORDER BY id ASC LIMIT 1 INTO @thread_id;
KILL QUERY @thread_id;
SELECT SQL_NO_CACHE id FROM information_schema.processlist WHERE state='wait_in_query_cache_insert' ORDER BY id DESC LIMIT 1 INTO @thread_id;
KILL QUERY @thread_id;
**
** Finally signal the DELETE statement on THD1 one last time.
** The stmt will complete the query cache invalidation and return 
** cache status to NO_FLUSH_IN_PROGRESS. On the status change
** One signal will be sent to the thread group waiting for executing
** invalidations and a broadcast signal will be sent to the thread 
** group holding result set writers.
SELECT SQL_NO_CACHE id FROM information_schema.processlist WHERE state='wait_in_query_cache_invalidate2' LIMIT 1 INTO @flush_thread_id;
KILL QUERY @flush_thread_id;
**
*************************************************************************
** No tables should be locked
=================================== Connection thd2
a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

More results from queries before failure can be found in /home2/openxs/bzr/mysql-next-mr/mysql-test/var/log/query_cache_debug.log

 - saving '/home2/openxs/bzr/mysql-next-mr/mysql-test/var/log/main.query_cache_debug/' to '/home2/openxs/bzr/mysql-next-mr/mysql-test/var/log/main.query_cache_debug/'

Only  2  of 3 completed.
mysql-test-run: *** ERROR: Not all tests completed
openxs@ubuntu:/home2/openxs/bzr/mysql-next-mr/mysql-test$ uname -a
Linux ubuntu 2.6.24-27-generic #1 SMP Fri Mar 12 01:10:31 UTC 2010 i686 GNU/Linux

Now it fails on other Linuxes too.
Making the test case experimental on Linux due to this bug.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/106073

3015 Alexander Nozdrin	2010-04-20
      Patch for Bug#52356: query_cache_debug fails on Linux.
      
      The problem was a race condition in the test case: there is a thread,
      that is doing DELETE FROM t1 in the background. There are two KILL QUERY
      of that DELETE statement. Finally, another thread is doing another DELETE
      FROM t1. The problem was that the test case didn't wait for the statement
      being killed to die happily and tried to execute another DELETE FROM t1,
      which obviously conflicted with the dying one.
      
      The fix is to wait for the dying statement to rest in peace before
      executing another DELETE FROM t1.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/106705

3015 Alexander Nozdrin	2010-04-27
      Patch for Bug#52356: query_cache_debug fails on Linux.
      
      There were two problems here:
        1. misleading error message
        2. abusing KILL QUERY in the test case
      
      1. The server reported "'DELETE FROM t1' failed: 1689: Wait on a lock was
      aborted due to a pending exclusive lock", while the proper error message
      should be "'DELETE FROM t1' failed: 1317: Query execution was interrupted".
      
      The problem is that the server has two different flags for
      signalling that a query is being killed: THD::killed and
      mysys_var::abort. The test case triggers a race: sometimes
      mysys_var::abort is set earlier than THD::killed. That leads 
      to the following situation:
      
        - thr_lock() checks mysys_var::abort and returns error status,
          since mysys_var::abort is set;
      
        - the caller (mysql_lock_tables()) gets an error from thr_lock(),
          but THD::killed is not set, so it decides that thr_lock() couldn't
          get a lock due to a pending exclusive lock.
      
      This is a known issue with the server and it's not going to be fixed soon.
      
      5.5 differs from 5.1 here as follows: when thr_lock() returns an error:
        - 5.1 continues trying thr_lock() until success;
        - 5.5 propagates the error
      
      The fix is to check both THD::killed and mysys_var::abort when
      analyzing error returned from thr_lock().
      
      2. The test case uses KILL QUERY is a highly concurent environment.
      
      The fix is to wait for the dying statement to rest in peace before
      executing another DELETE FROM t1.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/106865

3023 Alexander Nozdrin	2010-04-28
      Patch for Bug#52356: query_cache_debug fails on Linux.
      
      There were two problems here:
        1. misleading error message
        2. abusing KILL QUERY in the test case
      
      1. The server reported "'DELETE FROM t1' failed: 1689: Wait on a lock was
      aborted due to a pending exclusive lock", while the proper error message
      should be "'DELETE FROM t1' failed: 1317: Query execution was interrupted".
      
      The problem is that the server has two different flags for
      signalling that a query is being killed: THD::killed and
      mysys_var::abort. The test case triggers a race: sometimes
      mysys_var::abort is set earlier than THD::killed. That leads 
      to the following situation:
      
        - thr_lock() checks mysys_var::abort and returns error status,
          since mysys_var::abort is set;
      
        - the caller (mysql_lock_tables()) gets an error from thr_lock(),
          but THD::killed is not set, so it decides that thr_lock() couldn't
          get a lock due to a pending exclusive lock.
      
      This is a known issue with the server and it's not going to be fixed soon.
      
      5.5 differs from 5.1 here as follows: when thr_lock() returns an error:
        - 5.1 continues trying thr_lock() until success;
        - 5.5 propagates the error
      
      The fix is to check both THD::killed and mysys_var::abort when
      analyzing error returned from thr_lock().
      
      2. The test case uses KILL QUERY is a highly concurent environment.
      
      The fix is to wait for the dying statement to rest in peace before
      executing another DELETE FROM t1.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107079

3027 Alexander Nozdrin	2010-04-30
      Patch for Bug#52356: query_cache_debug fails on Linux.
      
      There were two problems here:
        1. misleading error message
        2. abusing KILL QUERY in the test case
      
      1. The server reported "'DELETE FROM t1' failed: 1689: Wait on a lock was
      aborted due to a pending exclusive lock", while the proper error message
      should be "'DELETE FROM t1' failed: 1317: Query execution was interrupted".
      
      The problem is that the server has two different flags for
      signalling that a query is being killed: THD::killed and
      mysys_var::abort. The test case triggers a race: sometimes
      mysys_var::abort is set earlier than THD::killed. That leads
      to the following situation:
      
        - thr_lock() checks mysys_var::abort and returns error status,
          since mysys_var::abort is set;
      
        - the caller (mysql_lock_tables()) gets an error from thr_lock(),
          but THD::killed is not set, so it decides that thr_lock() couldn't
          get a lock due to a pending exclusive lock.
      
      This is a known issue with the server and it's not going to be fixed soon.
      
      5.5 differs from 5.1 here as follows: when thr_lock() returns an error:
        - 5.1 continues trying thr_lock() until success;
        - 5.5 propagates the error
      
      2. The test case uses KILL QUERY is a highly concurent environment.
      
      The fix is to wait for the dying statement to rest in peace before
      executing another DELETE FROM t1.

The patch was approved by Kostja and pushed into trunk-bugfixing.

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100507091908-vqyhpwf2km0aokno) (version source revid:alik@sun.com-20100507091737-12vceffs11elb25g) (merge vers: 6.0.14-alpha) (pib:16)

Pushed into 5.5.5-m3 (revid:alik@sun.com-20100507091655-349gwq21ursz8y4p) (version source revid:alik@sun.com-20100507091655-349gwq21ursz8y4p) (merge vers: 5.5.5-m3) (pib:16)

Pushed into mysql-next-mr (revid:alik@sun.com-20100507091823-nzao4h3qosau4tin) (version source revid:alik@sun.com-20100507091720-ib9r8uny2aeazvas) (pib:16)

Noted in 5.5.5, 6.0.14 changelogs.

There was a race condition between flags used for signaling that a
query was killed, which led to error-reporting and lock-acquisition
problems.