Bug #41755 Intermittent failures of concurrent_ddl.test
Submitted: 26 Dec 2008 13:34 Modified: 6 Feb 2009 17:38
Reporter: Davi Arnaut (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Tests Severity:S3 (Non-critical)
Version:6.0-runtime OS:Linux
Assigned to: Jørgen Austvik CPU Architecture:Any

[26 Dec 2008 13:34] Davi Arnaut
Description:
Manually running the concurrent_ddl test case yields some intermittent failures.

How to repeat:
./mtr --suite=ddl_lock concurrent_ddl:

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

ddl_lock.concurrent_ddl        [ fail ]

Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
--- /concurrent_ddl.result    2008-12-23 13:53:25.000000000 +0300
+++ /concurrent_ddl.reject    2008-12-23 14:07:31.000000000 +0300
@@ -292,7 +292,7 @@
 # Switch to connection default
 SELECT SUM(a), SUM(b) FROM t2;
 SUM(a)    SUM(b)
-15    20
+NULL    NULL
 SELECT SUM(a), SUM(b) FROM t3;
 SUM(a)    SUM(b)
 NULL    NULL

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

ddl_lock.concurrent_ddl        [ fail ]

Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
Note: net_clear() skipped 11 bytes from file: socket (5)
mysqltest: At line 411: query 'reap' failed: 1146: Table 'test.t2' doesn't exist 

Suggested fix:
The net_clear() warnings are not interesting for now. What needs to be investigated is the test failures which are probably due to some bad interaction with debug sync points.
[28 Jan 2009 13:22] Jørgen Austvik
Davi, thanks for your report.

I am not able to reproduce here.

[ja@frigg24:mysql-test] uname -a
Linux frigg24 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 07:18:46 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
[ja@frigg24:mysql-test] cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.1 (Tikanga)

Branch: mysql-6.0-runtime

Could you please run:
./mysql-test-run --debug --suite=ddl_lock concurrent_ddl

...and attach or mail me
var/log/master.trace

from your run?

Also, are we running on the same branch and platform?
[28 Jan 2009 13:41] Davi Arnaut
Branch is 6.0-runtime based on revno: 2714
Linux skynet 2.6.27-9-server #1 SMP Thu Nov 20 22:56:07 UTC 2008 x86_64 GNU/Linux
Ubuntu 8.10
MTR_BUILD_THREAD="1798"
MTR_MEM="/dev/shm"
2x Dual-Core AMD Opteron(tm) Processor 2222

Trace file attached for failure:

mysqltest: At line 411: query 'reap' failed: 1146: Table 'test.t2' doesn't exist
[28 Jan 2009 14:40] Jørgen Austvik
This is the test code:

---------8<--------------8<--------------8<--------------8<-----
--echo # Switch to connection locker
connection locker;
SET DEBUG_SYNC= 'after_lock_tables_takes_lock
                 SIGNAL locked WAIT_FOR do_unlock';
--echo # "send" next statement
send
ALTER TABLE t1 RENAME t2;

--echo # Switch to connection waiter
connection waiter;
SET DEBUG_SYNC= 'now WAIT_FOR locked';
SET DEBUG_SYNC= 'after_start_ddl SIGNAL do_unlock';
--echo # "send" next statement
send
ALTER TABLE t2 RENAME t3;
---------8<--------------8<--------------8<--------------8<-----

In Davi's run, rename from t2 to t3 is T@6 and signals do_unlock to T@5 which is the rename t1 to t2:
---------8<--------------8<--------------8<--------------8<-----
T@6    : | | | | | | | debug_sync: sync_point: 'after_start_ddl'  activation_count: 1  hit_limit: 0  execute: 1  timeout: 0  signal: 
'do_unlock'  wait_for: ''
T@6    : | | | | | | | mutex: debug_sync_global.ds_mutex (0x1364bd0) locking
T@6    : | | | | | | | mutex: debug_sync_global.ds_mutex (0x1364bd0) locked
T@6    : | | | | | | | debug_sync_exec: signal 'do_unlock'  at: 'after_start_ddl'
---------8<--------------8<--------------8<--------------8<-----

...but then dies out when it can't find the file with the table definition:
---------8<--------------8<--------------8<--------------8<-----
T@6    : | | | | | | >my_open
T@6    : | | | | | | | my: Name: './test/t2.frm'  Flags: 0  MyFlags: 0
T@6    : | | | | | | <my_open
T@6    : | | | | | | >my_register_filename
T@6    : | | | | | | | error: Got error 2 on open
---------8<--------------8<--------------8<--------------8<-----

When I run, T@5 is the rename t2 to t3 and T@4 is rename t1 to t2, but the significant difference is that T@5 is locked on the open table, and then T@4 takes over.
---------8<--------------8<--------------8<--------------8<-----
T@5    : | | | | | >mysql_frm_type
T@5    : | | | | | | >my_open
T@5    : | | | | | | | my: Name: './test/t2.frm'  Flags: 0  MyFlags: 0
T@4    : | | | | | | | | | | debug_sync: awoke from do_unlock  global: do_unlock  error: 0
T@4    : | | | | | | | | | | debug_sync_exec: resume from 'do_unlock'  at: 'after_lock_tables_takes_lock'
T@4    : | | | | | | | | | | mutex: debug_sync_global.ds_mutex (0x1035310) unlocking
T@4    : | | | | | | | | | | mutex: mysys_var->mutex (0xa404228) locking
T@4    : | | | | | | | | | | mutex: mysys_var->mutex (0xa404228) locked
---------8<--------------8<--------------8<--------------8<-----

This is how the test is designed. When T@4 is finished with the work and unlock the MDL, T@5 wil run and rename the table.

Now I have to figure out why Davi's myopen returns to quickly:
---------8<--------------8<--------------8<--------------8<-----
T@6    : | | | | | | >my_open
T@6    : | | | | | | | my: Name: './test/t2.frm'  Flags: 0  MyFlags: 0
T@6    : | | | | | | <my_open
---------8<--------------8<--------------8<--------------8<-----
[29 Jan 2009 11:42] Jørgen Austvik
Further investigations by Davi, Trond and me makes us believe the open() system call in my_open() makes the thread being rescheduled in my environment, but not in Davi's environment. The test therefor depends on a side effect in my environment, end it is therefor a bug in the test case.

To fix it, I must stop the first rename (t1 to t2) thread while it has an exclusive MDL lock on t2, either by using another sync point, or by adding a new sync point.
[2 Feb 2009 15:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64884

2718 Jorgen Austvik	2009-02-02
      bug#41755: add sync point and use it for testing locking of table names during ALTER TABLE RENAME
[4 Feb 2009 9:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/65101

2718 Jorgen Austvik	2009-02-04
      bug#41755: add sync point and use it for testing locking of table names during ALTER TABLE RENAME
[4 Feb 2009 22:46] Bugs System
Pushed into 6.0.10-alpha (revid:kostja@sun.com-20090204224034-07j5981v33849tkw) (version source revid:jorgen.austvik@sun.com-20090204091118-y696ncr59wcyki5b) (merge vers: 6.0.10-alpha) (pib:6)
[5 Feb 2009 7:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/65287

2812 He Zhenxing	2009-02-05 [merge]
      Auto merge 6.0 -> 6.0-rpl
[6 Feb 2009 17:38] Paul DuBois
Test suite changes. No changelog entry needed.