Bug #45694 Deadlock in replicated statement is not retried
Submitted: 24 Jun 2009 7:04 Modified: 16 Sep 2009 10:11
Reporter: Jesper Wisborg Krogh Email Updates:
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:5.0.77, 5.0 5.1, azalea bzr OS:Linux
Assigned to: Alfranio Correia Target Version:5.1+
Tags: replication, deadlock, innodb
Triage: Triaged: D2 (Serious) / R2 (Low) / E2 (Low)

[24 Jun 2009 7:04] Jesper Wisborg Krogh
Description:
If a statement replicates and is rolled back due to a deadlock on the slave, it is not
always retried even with slave_transaction_retries set.

With the attached example the deadlock causes replication to stop with the error
message:

Error 'Deadlock found when trying to get lock; try restarting transaction' on query.
Default database: 'test'. Query: 'UPDATE deadlock1 SET id = id + 1'

The output from the mysql error log on the slave:

090624 16:35:42 [ERROR] Slave: Error 'Deadlock found when trying to get lock; try
restarting transaction' on query. Default database: 'test'. Query: 'UPDATE deadlock1 SET
id = id + 1', Error_code: 1213
090624 16:35:42 [ERROR] Error running query, slave SQL thread aborted. Fix the problem,
and restart the slave SQL thread with "SLAVE START". We stopped at log
'bin-update-log.000032' position 72470

The issue seems to show up when the deadlock involves two different indexes on the same
table. From SHOW ENGINE INNODB STATUS:

------------------------
LATEST DETECTED DEADLOCK
------------------------
090624 16:35:42
*** (1) TRANSACTION:
TRANSACTION 0 257053, ACTIVE 3 sec, process no 26673, OS thread id 1080613184 updating or
deleting
mysql tables in use 2, locked 2
LOCK WAIT 5 lock struct(s), heap size 1216, undo log entries 2
MySQL thread id 120, query id 2540 Updating
UPDATE deadlock2 SET somekey = somekey + 1 WHERE id = 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4487 page no 4 n bits 72 index `somekey` of table `test/deadlock2`
trx id 0 257053 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 1; hex 01; asc  ;; 1: len 1; hex 01; asc  ;;

*** (2) TRANSACTION:
TRANSACTION 0 257054, ACTIVE 3 sec, process no 26673, OS thread id 1170024768 starting
index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1216, undo log entries 3
MySQL thread id 113, query id 2541 localhost root statistics
SELECT * FROM deadlock2 WHERE id = 1 FOR UPDATE
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 4487 page no 4 n bits 72 index `somekey` of table `test/deadlock2`
trx id 0 257054 lock mode S locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 1; hex 01; asc  ;; 1: len 1; hex 01; asc  ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 4487 page no 3 n bits 72 index `PRIMARY` of table `test/deadlock2`
trx id 0 257054 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 1; hex 01; asc  ;; 1: len 6; hex 00000003ec1d; asc       ;; 2: len 7; hex
00000008b61b62; asc       b;; 3: len 1; hex 02; asc  ;;

*** WE ROLL BACK TRANSACTION (1)

How to repeat:
Using the attached file, setup the tables. Then execute the statement for node 1. As soon
as that finishes switch to the other node and run the statements for node 2. The deadlock
should occur.

Note that some of the statements such as the INSERT INTO deadlock1 VALUE (0) for node 2
are just there to ensure the replicated statement gets rolled back.

Suggested fix:
Allow the slave to automatically retry the transaction.
[24 Jun 2009 7:05] Jesper Wisborg Krogh
Test case

Attachment: replication_deadlock.sql (text/x-sql), 1.19 KiB.

[24 Jun 2009 7:48] Sveta Smirnova
Thank you for the report.

What do you mean by node1 and node2? Are they master/slave connections or what?
[24 Jun 2009 9:24] Jesper Wisborg Krogh
Sorry, yes node1 is the master and node2 is the slave.
[13 Jul 2009 4:57] Susanne Ebrecht
Many thanks for writing a bug report.

I am not able to repeat this with actual source tree (MySQL 5.0.85). I will look if it is
repeatable with actual release.
[13 Jul 2009 7:11] Susanne Ebrecht
I am not able to repeat this with MySQL 5.0.83. Please, try this actual version and give
us feedback if all is fine with it.

I used your test for reproducing.
[22 Jul 2009 0:27] Jesper Wisborg Krogh
Hi,

I just tried to install 5.0.83 on our test server and I can still replicate the problem.

I used the "Linux (x86)" tar ball with the attached my.cnf. The following grants and
replication setup was done:

Node 1:
> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'node2' IDENTIFIED BY 'some_password';
> RESET MASTER;
> CHANGE MASTER TO MASTER_HOST='node2', MASTER_PORT=3507, MASTER_USER='repl',
MASTER_LOG_FILE='node2-bin.000001', MASTER_LOG_POS=98, MASTER_PASSWORD='some_password';

Node 2:
> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'node1' IDENTIFIED BY 'some_password';
> RESET MASTER;
> CHANGE MASTER TO MASTER_HOST='node1', MASTER_PORT=3507, MASTER_USER='repl',
MASTER_LOG_FILE='node1-bin.000001', MASTER_LOG_POS=98, MASTER_PASSWORD='some_password';

The steps were:

1) Run the SQL from the replication_deadlock.sql file to set up the schema and initial
data.
2) Run the update on the master (node 1 in the replication_deadlock.sql file)
3) Wait (important!) for the statement in 2) to finish on the master before proceeding.
4) While the replicated statement is running on the slave, run the statements for node 2
from the replication_deadlock.sql file.

This is the same procedure as original described, but I thought it would be worth
restating them.

I hope that helps.
[22 Jul 2009 0:28] Jesper Wisborg Krogh
The my.cnf file used for the test setup

Attachment: my.cnf (application/octet-stream, text), 85 bytes.

[23 Jul 2009 1:34] Jesper Wisborg Krogh
An idea - could this be related to bug #24989? Before 5.0.48 a deadlock inside a trigger
would cause an error rather than a deadlock warning. Could it be that for that reason the
deadlock is still considered a fatal error, which will prevent the retry?
[24 Jul 2009 6:14] Sveta Smirnova
Thank you for the feedback.

This is actually not a bug: if deadlock occurs transaction should be rolled back.

Since version 5.1 you can use binary log format ROW to avoid such deadlocks. See also
http://dev.mysql.com/doc/refman/5.1/en/replication-formats.html
[24 Jul 2009 7:02] Jesper Wisborg Krogh
I don't agree that this is not a bug. According to the documentation
(http://dev.mysql.com/doc/refman/5.0/en/replication-options-slave.html#sysvar_slave_transac...)
when deadlock occurs in a replicated statement for an InnoDB table, then the transaction
will be retried automatically when the slave_transaction_retries variable is greater than
0.

If you try a simpler test case (see attached file), then the transaction will indeed
retry as expected with just an warning in the error log:

090724 16:46:49 [ERROR] Slave: Error 'Deadlock found when trying to get lock; try
restarting transaction' on query. Default database: 'test'. Query: 'UPDAT
E deadlock SET value = value + 1 WHERE id = 2', Error_code: 1213

The error code is also the same as in the original test case - the difference being
whether there is an automatic retry or not.

Regarding row based replication in 5.1 then that is not an option here as in the
production case it is essential that the trigger actually runs on the slave.
[24 Jul 2009 7:03] Jesper Wisborg Krogh
Test case where the transaction is retried

Attachment: replication_deadlock_2.sql (text/x-sql), 877 bytes.

[24 Jul 2009 7:36] Sveta Smirnova
Thank you for the feedback.

Verified as described.
[24 Jul 2009 7:37] Sveta Smirnova
test case for the testsuite

Attachment: rpl_bug45694.test (application/octet-stream, text), 1.23 KiB.

[24 Jul 2009 7:38] Sveta Smirnova
option file

Attachment: rpl_bug45694-slave.opt (application/octet-stream, text), 40 bytes.

[7 Aug 2009 18:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80391

3059 Alfranio Correia	2009-08-07
      BUG#45694 Deadlock in replicated statement is not retried
      
      If the SQL Thread fails to execute an event due to a temporary error (e.g.
      ER_LOCK_DEADLOCK) and the option "--slave_transaction_retries" is set the SQL
      Thread should not be aborted and the transaction should be restarted from the
      beginning and re-executed.
      
      Unfortunately, a wrong interpretation of the THD::is_fatal_error was preventing
      this behavior. In a nutshell, "this variable is set to TRUE if an execution of a
      compound statement cannot continue. In particular, it is used to disable access
      to the CONTINUE or EXIT handlers of stored routines. So even temporary errors
      may have this variable set.
      
      To fix the bug, we have done what follows:
      
         DBUG_ENTER("has_temporary_error");
      
      -  if (thd->is_fatal_error)
      -    DBUG_RETURN(0);
      -
         DBUG_EXECUTE_IF("all_errors_are_temporary_errors",
                         if (thd->main_da.is_error())
                         {
[13 Aug 2009 18:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80772

3059 Alfranio Correia	2009-08-13
      BUG#45694 Deadlock in replicated statement is not retried
      
      If the SQL Thread fails to execute an event due to a temporary error (e.g.
      ER_LOCK_DEADLOCK) and the option "--slave_transaction_retries" is set the SQL
      Thread should not be aborted and the transaction should be restarted from the
      beginning and re-executed.
      
      Unfortunately, a wrong interpretation of the THD::is_fatal_error was preventing
      this behavior. In a nutshell, "this variable is set to TRUE if an execution of a
      compound statement cannot continue. In particular, it is used to disable access
      to the CONTINUE or EXIT handlers of stored routines. So even temporary errors
      may have this variable set.
      
      To fix the bug, we have done what follows:
      
         DBUG_ENTER("has_temporary_error");
      
      -  if (thd->is_fatal_error)
      -    DBUG_RETURN(0);
      -
         DBUG_EXECUTE_IF("all_errors_are_temporary_errors",
                         if (thd->main_da.is_error())
                         {
[19 Aug 2009 15:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81098

3059 Alfranio Correia	2009-08-19
      BUG#45694 Deadlock in replicated statement is not retried
      
      If the SQL Thread fails to execute an event due to a temporary error (e.g.
      ER_LOCK_DEADLOCK) and the option "--slave_transaction_retries" is set the SQL
      Thread should not be aborted and the transaction should be restarted from the
      beginning and re-executed.
      
      Unfortunately, a wrong interpretation of the THD::is_fatal_error was preventing
      this behavior. In a nutshell, "this variable is set to TRUE if an execution of a
      compound statement cannot continue. In particular, it is used to disable access
      to the CONTINUE or EXIT handlers of stored routines. So even temporary errors
      may have this variable set.
      
      To fix the bug, we have done what follows:
      
         DBUG_ENTER("has_temporary_error");
      
      -  if (thd->is_fatal_error)
      -    DBUG_RETURN(0);
      -
         DBUG_EXECUTE_IF("all_errors_are_temporary_errors",
                         if (thd->main_da.is_error())
                         {
[24 Aug 2009 10:49] Alfranio Correia
Pushed to 5.1-bugteam.
[2 Sep 2009 16:42] Bugs System
Pushed into 5.1.39 (revid:joro@sun.com-20090902154533-8actmfcsjfqovgsb) (version source
revid:alfranio.correia@sun.com-20090824103744-fk8627t4c63eqznz) (merge vers: 5.1.39)
(pib:11)
[9 Sep 2009 8:56] Jon Stephens
Documented bugfix in the 5.1.39 changelog as follows:

        When slave_transaction_retries is set, a statement that
        replicates, but is then rolled back due to a deadlock on the
        slave, should be retried. However, in certain cases, replication
        was stopped with error 1213 (Deadlock found when trying to get
        lock; try restarting transaction) instead, even when this
        variable was set.

Set status to NDI, waiting for additional pushes.
[14 Sep 2009 16:03] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090914155317-m1g9wodmndzdj4l1) (version
source revid:alik@sun.com-20090914155317-m1g9wodmndzdj4l1) (merge vers: 5.4.4-alpha)
(pib:11)
[16 Sep 2009 10:11] Jon Stephens
Also documented in the 5.4.4 changelog.

Closed.
[1 Oct 2009 5:59] Bugs System
Pushed into 5.1.39-ndb-6.3.28 (revid:jonas@mysql.com-20091001055605-ap2kiaarr7p40mmv)
(version source revid:jonas@mysql.com-20091001055605-ap2kiaarr7p40mmv) (merge vers:
5.1.39-ndb-6.3.28) (pib:11)
[1 Oct 2009 7:25] Bugs System
Pushed into 5.1.39-ndb-7.0.9 (revid:jonas@mysql.com-20091001072547-kv17uu06hfjhgjay)
(version source revid:jonas@mysql.com-20091001071652-irejtnumzbpsbgk2) (merge vers:
5.1.39-ndb-7.0.9) (pib:11)
[1 Oct 2009 13:25] Bugs System
Pushed into 5.1.39-ndb-7.1.0 (revid:jonas@mysql.com-20091001123013-g9ob2tsyctpw6zs0)
(version source revid:jonas@mysql.com-20091001123013-g9ob2tsyctpw6zs0) (merge vers:
5.1.39-ndb-7.1.0) (pib:11)
[5 Oct 2009 10:50] Bugs System
Pushed into 5.1.39-ndb-6.2.19 (revid:jonas@mysql.com-20091005103850-dwij2dojwpvf5hi6)
(version source revid:jonas@mysql.com-20090930185117-bhud4ek1y0hsj1nv) (merge vers:
5.1.39-ndb-6.2.19) (pib:11)
[7 Oct 2009 1:47] Paul DuBois
The 5.4 fix has been pushed into 5.4.3.
[14 May 2010 21:34] Kevin Fries
Has a patch for this been ported into the 5.0 branch of mysql? I believe we're seeing
this.

Thanks