MySQL Bugs: #45694: Deadlock in replicated statement is not retried

Bug #45694	Deadlock in replicated statement is not retried
Submitted:	24 Jun 2009 7:04	Modified:	16 Sep 2009 10:11
Reporter:	Jesper Wisborg Krogh	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.77, 5.0 5.1, azalea bzr	OS:	Linux
Assigned to:	Alfranio Tavares Correia Junior	CPU Architecture:	Any
Tags:	deadlock, innodb, replication

Description:
If a statement replicates and is rolled back due to a deadlock on the slave, it is not always retried even with slave_transaction_retries set.

With the attached example the deadlock causes replication to stop with the error message:

Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'test'. Query: 'UPDATE deadlock1 SET id = id + 1'

The output from the mysql error log on the slave:

090624 16:35:42 [ERROR] Slave: Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'test'. Query: 'UPDATE deadlock1 SET id = id + 1', Error_code: 1213
090624 16:35:42 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'bin-update-log.000032' position 72470

The issue seems to show up when the deadlock involves two different indexes on the same table. From SHOW ENGINE INNODB STATUS:

------------------------
LATEST DETECTED DEADLOCK
------------------------
090624 16:35:42
*** (1) TRANSACTION:
TRANSACTION 0 257053, ACTIVE 3 sec, process no 26673, OS thread id 1080613184 updating or deleting
mysql tables in use 2, locked 2
LOCK WAIT 5 lock struct(s), heap size 1216, undo log entries 2
MySQL thread id 120, query id 2540 Updating
UPDATE deadlock2 SET somekey = somekey + 1 WHERE id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4487 page no 4 n bits 72 index `somekey` of table `test/deadlock2` trx id 0 257053 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 1; hex 01; asc  ;; 1: len 1; hex 01; asc  ;;

*** (2) TRANSACTION:
TRANSACTION 0 257054, ACTIVE 3 sec, process no 26673, OS thread id 1170024768 starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1216, undo log entries 3
MySQL thread id 113, query id 2541 localhost root statistics
SELECT * FROM deadlock2 WHERE id = 1 FOR UPDATE
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4487 page no 4 n bits 72 index `somekey` of table `test/deadlock2` trx id 0 257054 lock mode S locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 1; hex 01; asc  ;; 1: len 1; hex 01; asc  ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4487 page no 3 n bits 72 index `PRIMARY` of table `test/deadlock2` trx id 0 257054 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 1; hex 01; asc  ;; 1: len 6; hex 00000003ec1d; asc       ;; 2: len 7; hex 00000008b61b62; asc       b;; 3: len 1; hex 02; asc  ;;

*** WE ROLL BACK TRANSACTION (1)

How to repeat:
Using the attached file, setup the tables. Then execute the statement for node 1. As soon as that finishes switch to the other node and run the statements for node 2. The deadlock should occur.

Note that some of the statements such as the INSERT INTO deadlock1 VALUE (0) for node 2 are just there to ensure the replicated statement gets rolled back.

Suggested fix:
Allow the slave to automatically retry the transaction.

Test case

Attachment: replication_deadlock.sql (text/x-sql), 1.19 KiB.

Thank you for the report.

What do you mean by node1 and node2? Are they master/slave connections or what?

Sorry, yes node1 is the master and node2 is the slave.

Many thanks for writing a bug report.

I am not able to repeat this with actual source tree (MySQL 5.0.85). I will look if it is repeatable with actual release.

I am not able to repeat this with MySQL 5.0.83. Please, try this actual version and give us feedback if all is fine with it.

I used your test for reproducing.

Hi,

I just tried to install 5.0.83 on our test server and I can still replicate the problem.

I used the "Linux (x86)" tar ball with the attached my.cnf. The following grants and replication setup was done:

Node 1:
> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'node2' IDENTIFIED BY 'some_password';
> RESET MASTER;
> CHANGE MASTER TO MASTER_HOST='node2', MASTER_PORT=3507, MASTER_USER='repl', MASTER_LOG_FILE='node2-bin.000001', MASTER_LOG_POS=98, MASTER_PASSWORD='some_password';

Node 2:
> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'node1' IDENTIFIED BY 'some_password';
> RESET MASTER;
> CHANGE MASTER TO MASTER_HOST='node1', MASTER_PORT=3507, MASTER_USER='repl', MASTER_LOG_FILE='node1-bin.000001', MASTER_LOG_POS=98, MASTER_PASSWORD='some_password';

The steps were:

1) Run the SQL from the replication_deadlock.sql file to set up the schema and initial data.
2) Run the update on the master (node 1 in the replication_deadlock.sql file)
3) Wait (important!) for the statement in 2) to finish on the master before proceeding.
4) While the replicated statement is running on the slave, run the statements for node 2 from the replication_deadlock.sql file.

This is the same procedure as original described, but I thought it would be worth restating them.

I hope that helps.

The my.cnf file used for the test setup

Attachment: my.cnf (application/octet-stream, text), 85 bytes.

An idea - could this be related to bug #24989? Before 5.0.48 a deadlock inside a trigger would cause an error rather than a deadlock warning. Could it be that for that reason the deadlock is still considered a fatal error, which will prevent the retry?

Thank you for the feedback.

This is actually not a bug: if deadlock occurs transaction should be rolled back.

Since version 5.1 you can use binary log format ROW to avoid such deadlocks. See also http://dev.mysql.com/doc/refman/5.1/en/replication-formats.html

I don't agree that this is not a bug. According to the documentation (http://dev.mysql.com/doc/refman/5.0/en/replication-options-slave.html#sysvar_slave_transac...) when deadlock occurs in a replicated statement for an InnoDB table, then the transaction will be retried automatically when the slave_transaction_retries variable is greater than 0.

If you try a simpler test case (see attached file), then the transaction will indeed retry as expected with just an warning in the error log:

090724 16:46:49 [ERROR] Slave: Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'test'. Query: 'UPDAT
E deadlock SET value = value + 1 WHERE id = 2', Error_code: 1213

The error code is also the same as in the original test case - the difference being whether there is an automatic retry or not.

Regarding row based replication in 5.1 then that is not an option here as in the production case it is essential that the trigger actually runs on the slave.

Test case where the transaction is retried

Attachment: replication_deadlock_2.sql (text/x-sql), 877 bytes.

Thank you for the feedback.

Verified as described.

test case for the testsuite

Attachment: rpl_bug45694.test (application/octet-stream, text), 1.23 KiB.

option file

Attachment: rpl_bug45694-slave.opt (application/octet-stream, text), 40 bytes.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80391

3059 Alfranio Correia	2009-08-07
      BUG#45694 Deadlock in replicated statement is not retried
      
      If the SQL Thread fails to execute an event due to a temporary error (e.g.
      ER_LOCK_DEADLOCK) and the option "--slave_transaction_retries" is set the SQL
      Thread should not be aborted and the transaction should be restarted from the
      beginning and re-executed.
      
      Unfortunately, a wrong interpretation of the THD::is_fatal_error was preventing
      this behavior. In a nutshell, "this variable is set to TRUE if an execution of a
      compound statement cannot continue. In particular, it is used to disable access
      to the CONTINUE or EXIT handlers of stored routines. So even temporary errors
      may have this variable set.
      
      To fix the bug, we have done what follows:
      
         DBUG_ENTER("has_temporary_error");
      
      -  if (thd->is_fatal_error)
      -    DBUG_RETURN(0);
      -
         DBUG_EXECUTE_IF("all_errors_are_temporary_errors",
                         if (thd->main_da.is_error())
                         {

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80772

3059 Alfranio Correia	2009-08-13
      BUG#45694 Deadlock in replicated statement is not retried
      
      If the SQL Thread fails to execute an event due to a temporary error (e.g.
      ER_LOCK_DEADLOCK) and the option "--slave_transaction_retries" is set the SQL
      Thread should not be aborted and the transaction should be restarted from the
      beginning and re-executed.
      
      Unfortunately, a wrong interpretation of the THD::is_fatal_error was preventing
      this behavior. In a nutshell, "this variable is set to TRUE if an execution of a
      compound statement cannot continue. In particular, it is used to disable access
      to the CONTINUE or EXIT handlers of stored routines. So even temporary errors
      may have this variable set.
      
      To fix the bug, we have done what follows:
      
         DBUG_ENTER("has_temporary_error");
      
      -  if (thd->is_fatal_error)
      -    DBUG_RETURN(0);
      -
         DBUG_EXECUTE_IF("all_errors_are_temporary_errors",
                         if (thd->main_da.is_error())
                         {

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81098

3059 Alfranio Correia	2009-08-19
      BUG#45694 Deadlock in replicated statement is not retried
      
      If the SQL Thread fails to execute an event due to a temporary error (e.g.
      ER_LOCK_DEADLOCK) and the option "--slave_transaction_retries" is set the SQL
      Thread should not be aborted and the transaction should be restarted from the
      beginning and re-executed.
      
      Unfortunately, a wrong interpretation of the THD::is_fatal_error was preventing
      this behavior. In a nutshell, "this variable is set to TRUE if an execution of a
      compound statement cannot continue. In particular, it is used to disable access
      to the CONTINUE or EXIT handlers of stored routines. So even temporary errors
      may have this variable set.
      
      To fix the bug, we have done what follows:
      
         DBUG_ENTER("has_temporary_error");
      
      -  if (thd->is_fatal_error)
      -    DBUG_RETURN(0);
      -
         DBUG_EXECUTE_IF("all_errors_are_temporary_errors",
                         if (thd->main_da.is_error())
                         {

Pushed to 5.1-bugteam.

Pushed into 5.1.39 (revid:joro@sun.com-20090902154533-8actmfcsjfqovgsb) (version source revid:alfranio.correia@sun.com-20090824103744-fk8627t4c63eqznz) (merge vers: 5.1.39) (pib:11)

Documented bugfix in the 5.1.39 changelog as follows:

        When slave_transaction_retries is set, a statement that
        replicates, but is then rolled back due to a deadlock on the
        slave, should be retried. However, in certain cases, replication
        was stopped with error 1213 (Deadlock found when trying to get
        lock; try restarting transaction) instead, even when this
        variable was set.

Set status to NDI, waiting for additional pushes.

Pushed into 5.4.4-alpha (revid:alik@sun.com-20090914155317-m1g9wodmndzdj4l1) (version source revid:alik@sun.com-20090914155317-m1g9wodmndzdj4l1) (merge vers: 5.4.4-alpha) (pib:11)

Also documented in the 5.4.4 changelog.

Closed.

Pushed into 5.1.39-ndb-6.3.28 (revid:jonas@mysql.com-20091001055605-ap2kiaarr7p40mmv) (version source revid:jonas@mysql.com-20091001055605-ap2kiaarr7p40mmv) (merge vers: 5.1.39-ndb-6.3.28) (pib:11)

Pushed into 5.1.39-ndb-7.0.9 (revid:jonas@mysql.com-20091001072547-kv17uu06hfjhgjay) (version source revid:jonas@mysql.com-20091001071652-irejtnumzbpsbgk2) (merge vers: 5.1.39-ndb-7.0.9) (pib:11)

Pushed into 5.1.39-ndb-7.1.0 (revid:jonas@mysql.com-20091001123013-g9ob2tsyctpw6zs0) (version source revid:jonas@mysql.com-20091001123013-g9ob2tsyctpw6zs0) (merge vers: 5.1.39-ndb-7.1.0) (pib:11)

Pushed into 5.1.39-ndb-6.2.19 (revid:jonas@mysql.com-20091005103850-dwij2dojwpvf5hi6) (version source revid:jonas@mysql.com-20090930185117-bhud4ek1y0hsj1nv) (merge vers: 5.1.39-ndb-6.2.19) (pib:11)

The 5.4 fix has been pushed into 5.4.3.

Has a patch for this been ported into the 5.0 branch of mysql? I believe we're seeing this.

Thanks

Bug is NOT fixed in 5.0.x, even with 5.0.97. Sveta's original test case still ends up with error on slave:

[openxs@chief mysql-test]$ ./mtr rpl_bug45694
Logging: ./mtr rpl_bug45694
MySQL Version 5.0.97
Skipping ndbcluster
Setting mysqld to support SSL connections
Binaries are debug compiled
Using MTR_BUILD_THREAD      = 0
Using MASTER_MYPORT         = 9306
Using MASTER_MYPORT1        = 9307
Using SLAVE_MYPORT          = 9308
Using SLAVE_MYPORT1         = 9309
Using SLAVE_MYPORT2         = 9310
Using IM_PORT               = 9313
Using IM_MYSQLD1_PORT       = 9314
Using IM_MYSQLD2_PORT       = 9315
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
Installing Master Database
Installing Slave1 Database
=======================================================
Starting Tests in the 'main' suite

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

rpl_bug45694                   [ fail ]

--- /home/openxs/dbs/5.0/mysql-test/r/rpl_bug45694.result       2012-04-25 09:54:11.110985973 +0300
+++ /home/openxs/dbs/5.0/mysql-test/r/rpl_bug45694.reject       2012-04-25 09:54:34.764717800 +0300
@@ -0,0 +1,91 @@
+stop slave;
+drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
+reset master;
+reset slave;
+drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
+start slave;
+DROP TABLE IF EXISTS deadlock1;
+Warnings:
+Note   1051    Unknown table 'deadlock1'
+CREATE TABLE deadlock1 (
+id tinyint unsigned NOT NULL
+) ENGINE=InnoDB;
+DROP TABLE IF EXISTS deadlock2;
+Warnings:
+Note   1051    Unknown table 'deadlock2'
+CREATE TABLE deadlock2 (
+id tinyint unsigned NOT NULL auto_increment,
+somekey tinyint unsigned NOT NULL DEFAULT 0,
+PRIMARY KEY (id),
+INDEX (somekey)
+) ENGINE=InnoDB;
+CREATE TRIGGER aftupd_deadlock1 AFTER UPDATE ON deadlock1 FOR EACH ROW BEGIN
+DECLARE v_key, v_sleep tinyint unsigned;
+SELECT somekey INTO v_key
+FROM deadlock2
+WHERE id = 1;
+SELECT sleep(3) INTO v_sleep;
+IF (v_key = 1) THEN
+UPDATE deadlock2 SET somekey = somekey + 1 WHERE id = 1;
+END IF;
+END;
+//
+INSERT INTO deadlock1 VALUES
+(0);
+INSERT INTO deadlock2 VALUES
+(1, 1);
+UPDATE deadlock1 SET id = id + 1;
+START TRANSACTION;
+SELECT somekey
+FROM deadlock2
+WHERE somekey = 1
+LOCK IN SHARE MODE;
+somekey
+1
+INSERT INTO deadlock1 VALUES (0);
+INSERT INTO deadlock1 VALUES (0);
+INSERT INTO deadlock1 VALUES (0);
+SELECT sleep(3);
+sleep(3)
+0
+SELECT *
+FROM deadlock2
+WHERE id = 1
+FOR UPDATE;
+id     somekey
+1      1
+COMMIT;
+show slave status;
+Slave_IO_State Waiting for master to send event
+Master_Host    127.0.0.1
+Master_User    root
+Master_Port    9306
+Connect_Retry  1
+Master_Log_File        master-bin.000001
+Read_Master_Log_Pos    1597
+Relay_Log_File slave-relay-bin.000003
+Relay_Log_Pos  1545
+Relay_Master_Log_File  master-bin.000001
+Slave_IO_Running       Yes
+Slave_SQL_Running      No
+Replicate_Do_DB
+Replicate_Ignore_DB
+Replicate_Do_Table
+Replicate_Ignore_Table
+Replicate_Wild_Do_Table
+Replicate_Wild_Ignore_Table
+Last_Errno     1213
+Last_Error     Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'test'. Query: 'UPDATE deadlock1 SET id = id + 1'
+Skip_Counter   0
+Exec_Master_Log_Pos    1407
+Relay_Log_Space        1735
+Until_Condition        None
+Until_Log_File
+Until_Log_Pos  0
+Master_SSL_Allowed     No
+Master_SSL_CA_File
+Master_SSL_CA_Path
+Master_SSL_Cert
+Master_SSL_Cipher
+Master_SSL_Key
+Seconds_Behind_Master  NULL
...