MySQL Bugs: #49399: BACKUP/RESTORE thread fails to wake up on concurrent BACKUP/RESTORE

Bug #49399	BACKUP/RESTORE thread fails to wake up on concurrent BACKUP/RESTORE
Submitted:	3 Dec 2009 13:53	Modified:	30 Dec 2009 13:44
Reporter:	Philip Stoev	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Backup	Severity:	S2 (Serious)
Version:	6.0-backup	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When executing a BACKUP/RESTORE workload, a BACKUP or RESTORE thread may end up with the following note in the processlist

"Backup_sx_lock: waiting for release of all Shared locks"

and will hang like that forever, even if all other clients are disconnected. The hang is not influenced by any timeout, however it is possible to KILL the hanging thread.

This behavior would be a problem if the tool that issued the BACKUP query was not prepared to handle the hang without an intervention from the operator.

The backtrace is as follows:

#0  0x000000315b00b309 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000a3348d in safe_cond_wait (cond=0x14d2500, mp=0x14d2460, file=0xc11e7f "bml.cc", line=246) at thr_mutex.c:423
#2  0x00000000008a8a20 in Backup_sx_lock::get_exclusive_lock (this=0x14d2460, thd=0x2407388) at bml.cc:246
#3  0x00000000008aa9f3 in obs::bml_get (thd=0x2407388) at si_objects.cc:3296
#4  0x0000000000ad8ace in Backup_restore_ctx::prepare (this=0x7ff4eeae41f0, backupdir=0x7ff4eeae5170, location=
      {str = 0x24008a8 "/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp", length = 56}) at kernel.cc:722
#5  0x0000000000ad8ce5 in Backup_restore_ctx::prepare_for_restore (this=0x7ff4eeae41f0, backupdir=0x7ff4eeae5170, orig_loc=
      {str = 0x24008a8 "/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp", length = 56},
    query=0x2400750 "RESTORE FROM '/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp' OVERWRITE", skip_gap_event=false) at kernel.cc:932
#6  0x0000000000ad9fc1 in execute_backup_command (thd=0x2407388, lex=0x2408c60, backupdir=0x7ff4eeae5170, overwrite=true, skip_gap_event=false)
    at kernel.cc:284
#7  0x00000000006857eb in mysql_execute_command (thd=0x2407388) at sql_parse.cc:2482
#8  0x000000000068d41b in mysql_parse (thd=0x2407388, inBuf=0x2400750 "RESTORE FROM '/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp' OVERWRITE",
    length=81, found_semicolon=0x7ff4eeae5f00) at sql_parse.cc:5985
#9  0x000000000068dfac in dispatch_command (command=COM_QUERY, thd=0x2407388,
    packet=0x2530ac9 " RESTORE FROM '/tmp/gentest19393-1259848350.83141-0.607944210133617.tmp' OVERWRITE", packet_length=82) at sql_parse.cc:1078
#10 0x000000000068f51e in do_command (thd=0x2407388) at sql_parse.cc:760
#11 0x000000000067c344 in handle_one_connection (arg=0x2407388) at sql_connect.cc:1164
#12 0x000000315b0073da in start_thread () from /lib64/libpthread.so.0
#13 0x000000315a4e627d in clone () from /lib64/libc.so.6

How to repeat:
A RQG test case would be uploaded shortly.

grammar for bug 49399

Attachment: bug49399.yy (application/octet-stream, text), 69.74 KiB.

To reproduce with the RQG

$ perl runall.pl \
  --grammar=conf/49399.yy \
  --basedir=/build/bzr/mysql-6.0-backup \
  --queries=100K \
  --mysqld=--mysql-backup \
  --gendata=conf/WL5004_data.zz \
  --threads=10

Please disregard all output from the test itself. Shortly after takeoff, you will observe "Backup_sx_lock: waiting for release of all Shared locks" forming in the processlist. At that point, hit Ctrl+C on the RQG to terminate all connections except the offending one.

The grammar contains a huge variety of DDL and DML and has not been simplified due to the inability of the simplification tool to handle single-connection hangs like this bug. Apologies for that and please let me know if a simplifed test case is indeed required to fix this bug.

Possible problem (wild guess): bml_get() relies on the fact that each call to BML_instance->get_shared_lock() is matched with a call to BML_instance->release_shared_lock(). If a DDL which takes shared lock is interrupted and because of that does not release it, then we would get such hang.

Suggested solution. Represent shared lock as an object instance whose destructor releases the lock automatically. Destructor is always called, regardless of the way statement was interrupted.

The usage could look as follows:

 /* begining of DDL execution */
 BML_ticket bml_ticket;

 if (!bml_ticker.is_valid())
 {
   /* Opps, we could not obtain shared BML lock... */
 }

 /* do other stuff */

When bml_ticket is created, the shared bml lock is taken. When it is destroyed, lock is released.

May be related to BUG#49398.

Idea for a test which could verify my hypothesis.

Run DDL and stop it after taking the shared bml lock (using a sync point). Then run BACKUP and wait until it hangs on bml_get(). Then kill connection with DDL and try to reap BACKUP command. All this can be done from a test script. If it hangs we have a problem.

The Test as described in the Rafal's hypothesis. No hang was observed.

Attachment: bug49399.test (application/octet-stream, text), 1.16 KiB.

The result file. Backup got clearly executed even after a different thread,which acquired a shared lock is killed.

Attachment: bug49399.result (application/octet-stream, text), 549 bytes.

To reproduce this bug, please apply the following diff to the RQG

=== modified file 'lib/GenTest/Executor/MySQL.pm'
--- lib/GenTest/Executor/MySQL.pm       2009-12-04 11:40:28 +0000
+++ lib/GenTest/Executor/MySQL.pm       2009-12-14 14:20:45 +0000
@@ -399,7 +399,7 @@

        ER_BACKUP_SEND_DATA1()  => STATUS_BACKUP_FAILURE,
        ER_BACKUP_SEND_DATA2()  => STATUS_BACKUP_FAILURE,
-       ER_BACKUP_PROGRESS_TABLES() => STATUS_BACKUP_FAILURE,
+#      ER_BACKUP_PROGRESS_TABLES() => STATUS_BACKUP_FAILURE,
        ER_BACKUP_RUNNING()     => STATUS_SEMANTIC_ERROR,

        ER_CANT_CREATE_THREAD() => STATUS_ENVIRONMENT_FAILURE,

This is going to make the ER_BACKUP_PROGRESS_TABLES error non-fatal, so that the test can continue and the situation described in the bug report can form.

The RQG itself will not detect the problem, please manually issue SHOW PROCESSLIST in order to view the problematic BACKUP or RESTORE operation.

BUG#49603 fixes this too. So setting  the bug as duplicate.