Bug #49399 BACKUP/RESTORE thread fails to wake up on concurrent BACKUP/RESTORE
Submitted: 3 Dec 2009 13:53 Modified: 30 Dec 2009 13:44
Reporter: Philip Stoev Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Backup Severity:S2 (Serious)
Version:6.0-backup OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[3 Dec 2009 13:53] Philip Stoev
Description:
When executing a BACKUP/RESTORE workload, a BACKUP or RESTORE thread may end up with the following note in the processlist

"Backup_sx_lock: waiting for release of all Shared locks"

and will hang like that forever, even if all other clients are disconnected. The hang is not influenced by any timeout, however it is possible to KILL the hanging thread.

This behavior would be a problem if the tool that issued the BACKUP query was not prepared to handle the hang without an intervention from the operator.

The backtrace is as follows:

#0  0x000000315b00b309 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000a3348d in safe_cond_wait (cond=0x14d2500, mp=0x14d2460, file=0xc11e7f "bml.cc", line=246) at thr_mutex.c:423
#2  0x00000000008a8a20 in Backup_sx_lock::get_exclusive_lock (this=0x14d2460, thd=0x2407388) at bml.cc:246
#3  0x00000000008aa9f3 in obs::bml_get (thd=0x2407388) at si_objects.cc:3296
#4  0x0000000000ad8ace in Backup_restore_ctx::prepare (this=0x7ff4eeae41f0, backupdir=0x7ff4eeae5170, location=
      {str = 0x24008a8 "/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp", length = 56}) at kernel.cc:722
#5  0x0000000000ad8ce5 in Backup_restore_ctx::prepare_for_restore (this=0x7ff4eeae41f0, backupdir=0x7ff4eeae5170, orig_loc=
      {str = 0x24008a8 "/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp", length = 56},
    query=0x2400750 "RESTORE FROM '/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp' OVERWRITE", skip_gap_event=false) at kernel.cc:932
#6  0x0000000000ad9fc1 in execute_backup_command (thd=0x2407388, lex=0x2408c60, backupdir=0x7ff4eeae5170, overwrite=true, skip_gap_event=false)
    at kernel.cc:284
#7  0x00000000006857eb in mysql_execute_command (thd=0x2407388) at sql_parse.cc:2482
#8  0x000000000068d41b in mysql_parse (thd=0x2407388, inBuf=0x2400750 "RESTORE FROM '/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp' OVERWRITE",
    length=81, found_semicolon=0x7ff4eeae5f00) at sql_parse.cc:5985
#9  0x000000000068dfac in dispatch_command (command=COM_QUERY, thd=0x2407388,
    packet=0x2530ac9 " RESTORE FROM '/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp' OVERWRITE", packet_length=82) at sql_parse.cc:1078
#10 0x000000000068f51e in do_command (thd=0x2407388) at sql_parse.cc:760
#11 0x000000000067c344 in handle_one_connection (arg=0x2407388) at sql_connect.cc:1164
#12 0x000000315b0073da in start_thread () from /lib64/libpthread.so.0
#13 0x000000315a4e627d in clone () from /lib64/libc.so.6

How to repeat:
A RQG test case would be uploaded shortly.
[3 Dec 2009 13:54] Philip Stoev
grammar for bug 49399

Attachment: bug49399.yy (application/octet-stream, text), 69.74 KiB.

[3 Dec 2009 13:57] Philip Stoev
To reproduce with the RQG

$ perl runall.pl \
  --grammar=conf/49399.yy \
  --basedir=/build/bzr/mysql-6.0-backup \
  --queries=100K \
  --mysqld=--mysql-backup \
  --gendata=conf/WL5004_data.zz \
  --threads=10

Please disregard all output from the test itself. Shortly after takeoff, you will observe "Backup_sx_lock: waiting for release of all Shared locks" forming in the processlist. At that point, hit Ctrl+C on the RQG to terminate all connections except the offending one.

The grammar contains a huge variety of DDL and DML and has not been simplified due to the inability of the simplification tool to handle single-connection hangs like this bug. Apologies for that and please let me know if a simplifed test case is indeed required to fix this bug.
[3 Dec 2009 14:15] Rafal Somla
Possible problem (wild guess): bml_get() relies on the fact that each call to BML_instance->get_shared_lock() is matched with a call to BML_instance->release_shared_lock(). If a DDL which takes shared lock is interrupted and because of that does not release it, then we would get such hang.

Suggested solution. Represent shared lock as an object instance whose destructor releases the lock automatically. Destructor is always called, regardless of the way statement was interrupted.

The usage could look as follows:

 /* begining of DDL execution */
 BML_ticket bml_ticket;

 if (!bml_ticker.is_valid())
 {
   /* Opps, we could not obtain shared BML lock... */
 }

 /* do other stuff */

When bml_ticket is created, the shared bml lock is taken. When it is destroyed, lock is released.
[3 Dec 2009 14:40] Chuck Bell
May be related to BUG#49398.
[3 Dec 2009 15:35] Rafal Somla
Idea for a test which could verify my hypothesis.

Run DDL and stop it after taking the shared bml lock (using a sync point). Then run BACKUP and wait until it hangs on bml_get(). Then kill connection with DDL and try to reap BACKUP command. All this can be done from a test script. If it hangs we have a problem.
[10 Dec 2009 8:26] Ritheesh Vedire
The Test as described in the Rafal's hypothesis. No hang was observed.

Attachment: bug49399.test (application/octet-stream, text), 1.16 KiB.

[10 Dec 2009 8:28] Ritheesh Vedire
The result file. Backup got clearly executed even after a different thread,which acquired a shared lock is killed.

Attachment: bug49399.result (application/octet-stream, text), 549 bytes.

[14 Dec 2009 14:22] Philip Stoev
To reproduce this bug, please apply the following diff to the RQG

=== modified file 'lib/GenTest/Executor/MySQL.pm'
--- lib/GenTest/Executor/MySQL.pm       2009-12-04 11:40:28 +0000
+++ lib/GenTest/Executor/MySQL.pm       2009-12-14 14:20:45 +0000
@@ -399,7 +399,7 @@

        ER_BACKUP_SEND_DATA1()  => STATUS_BACKUP_FAILURE,
        ER_BACKUP_SEND_DATA2()  => STATUS_BACKUP_FAILURE,
-       ER_BACKUP_PROGRESS_TABLES() => STATUS_BACKUP_FAILURE,
+#      ER_BACKUP_PROGRESS_TABLES() => STATUS_BACKUP_FAILURE,
        ER_BACKUP_RUNNING()     => STATUS_SEMANTIC_ERROR,

        ER_CANT_CREATE_THREAD() => STATUS_ENVIRONMENT_FAILURE,

This is going to make the ER_BACKUP_PROGRESS_TABLES error non-fatal, so that the test can continue and the situation described in the bug report can form.

The RQG itself will not detect the problem, please manually issue SHOW PROCESSLIST in order to view the problematic BACKUP or RESTORE operation.
[30 Dec 2009 13:44] Ritheesh Vedire
BUG#49603 fixes this too. So setting  the bug as duplicate.