MySQL Bugs: #20697: slave fails to rollback replicated transaction hang over innodb_lock_wait

Bug #20697	slave fails to rollback replicated transaction hang over innodb_lock_wait_timeou
Submitted:	26 Jun 2006 13:27	Modified:	21 Nov 2006 19:56
Reporter:	Sebastian Nohn	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.0.18-standard,5.0.22	OS:	Linux (Linux)
Assigned to:	Andrei Elkin	CPU Architecture:	Any

Description:
When something is executed on the slave that modifies data and is taking longer than lock_timeout and locks some data and another query using transactions on the master is executed, that is affected by that lock, the replication can break because it seems not the whole transaction on the slave is rolled back and restarted but the transaction is not rolled back and restarted.

How to repeat:
On the slave:

<?php
// Does not more than "BEGIN"
DB::createTransaction();
$sql = 'SELECT * from advertisement where id_advertisement = 2056551 FOR UPDATE';
// Execute and Commit.
DB::executeRawQuery($sql);
sleep(100);
?>

On the master:

BEGIN;
INSERT INTO advertisement (ID_PLACEMENT_AD, ID_RULE_SET, ID_ACTIVATION_LOG, RULE_NAME, DATE_CREATED, ID_EXTERNAL_TYPE) values (2249, 1, 1, 'Test for Lock', now(), 1);
UPDATE advertisement SET date_deactivated=now() where id_advertisement = 2056551;
COMMIT;

After that, error log says:

060626 15:10:07 [ERROR] Slave: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'dms_2_0'. Query: 'UPDATE advertisement SET date_deactivated=now() where id_advertisement = 2056551', Error_code: 1205
060626 15:10:07 [ERROR] Slave: Error 'Duplicate entry '2056553' for key 1' on query. Default database: 'dms_2_0'. Query: 'INSERT INTO advertisement (ID_PLACEMENT_AD, ID_RULE_SET, ID_ACTIVATION_LOG, RULE_NAME, DATE_CREATED, ID_EXTERNAL_TYPE) values (2249, 1, 1, 'Test for Lock', now(), 1)', Error_code: 1062
060626 15:10:07 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'bladeA8-bin.000001' position 12030780 060626 15:11:06 [Warning] Aborted connection 10645 to db: 'dms_2_0' user: 'dms_int' host: '192.168.200.129' (Got an error reading communication packets)

root@localhost > show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: 192.168.200.108
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: bladeA8-bin.000001
        Read_Master_Log_Pos: 12089132
             Relay_Log_File: bladeA9-relay-bin.000005
              Relay_Log_Pos: 9823794
      Relay_Master_Log_File: bladeA8-bin.000001
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB: 
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: dms_2_0.test_aggregation_%,dms_2_0.test_figures,dms_2_0.test_olap_%,dms_2_0.figures,dms_2_0.aggregation_%,dms_2_0.olap_%
                 Last_Errno: 1062
                 Last_Error: Error 'Duplicate entry '2056553' for key 1' on query. Default database: 'dms_2_0'. Query: 'INSERT INTO advertisement (ID_PLACEMENT_AD, ID_RULE_SET, ID_ACTIVATION_LOG, RULE_NAME, DATE_CREATED, ID_EXTERNAL_TYPE) values (2249, 1, 1, 'Test for Lock', now(), 1)'
               Skip_Counter: 0
        Exec_Master_Log_Pos: 12030780
            Relay_Log_Space: 9882146
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

The insert has been commited to the slave:

root@localhost [dms_2_0]> select date_created from advertisement where id_advertisement = 2056553;
+---------------------+
| date_created        |
+---------------------+
| 2006-06-26 15:10:04 |
+---------------------+
1 row in set (0.00 sec)

The update not:

root@localhost [dms_2_0]> select date_deactivated from advertisement where id_advertisement = 2056551;
+------------------+
| date_deactivated |
+------------------+
| NULL             |
+------------------+
1 row in set (0.00 sec)

Relay log on slave says:

/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
# at 9823794
#060626 15:10:05 server id 1101  end_log_pos 12030856   Query   thread_id=378925        exec_time=0     error_code=0
use dms_2_0;
SET TIMESTAMP=1151327405;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1;
SET @@session.sql_mode=0;
SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=1;
SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=31;
BEGIN;
# at 9823870
#060626 15:10:04 server id 1101  end_log_pos 28         Intvar
SET INSERT_ID=2056553;
# at 9823898
#060626 15:10:04 server id 1101  end_log_pos 272        Query   thread_id=378925        exec_time=0     error_code=0
SET TIMESTAMP=1151327404;
SET @@session.time_zone='SYSTEM';
INSERT INTO advertisement (ID_PLACEMENT_AD, ID_RULE_SET, ID_ACTIVATION_LOG, RULE_NAME, DATE_CREATED, ID_EXTERNAL_TYPE) values (2249, 1, 1, 'Test for Lock', now(), 1);
# at 9824142
#060626 15:10:04 server id 1101  end_log_pos 431        Query   thread_id=378925        exec_time=0     error_code=0
SET TIMESTAMP=1151327404;
UPDATE advertisement SET date_deactivated=now() where id_advertisement = 2056551;
# at 9824301
#060626 15:10:05 server id 1101  end_log_pos 12031314   Xid = 5502260
COMMIT;
# at 9824328

Suggested fix:
If a part of a transaction fails on the slave, rollback the whole transaction and repeat the whole transaction. Do not leave parts of the transaction and repeat the whole transaction.

No workaround possible.

Sebastian, thanks for your report.

I read through logs and diagnosis could be violation of transaction atomicity unless some other, maybe slave side, connection concurrently inserted a record with
id_advertisement==056553 and date_created==2006-06-26 15:10:04

so that we see the second slave's complaint
   060626 15:10:07 [ERROR] Slave: Error 'Duplicate entry '2056553'

Can you make sure that this does not happen?
While you don't exclude this possibility it makes me me wondering
if reported reported claim is correct:

   The insert has been commited to the slave:

   root@localhost [dms_2_0]> select date_created from advertisement where
    
   id_advertisement = 2056553;
   +---------------------+
   | date_created        |
   +---------------------+
   | 2006-06-26 15:10:04 |

I am sure that noone and nothing else performed an INSERT with the same primary key. The bug was reproduced on a separate machine not included into any development- or production-enviroment.

My claim about the insert is exactly the problem. There is NO other entity doing that INSERT, that INSERT happens via replication as you can see from the relay.log. It happens in a transaction that is not properly rolled back and repeated in a LOCK (and maybe other) situation.

testcase for 4.1.15->5.0.18 for csc issue. Will test 5.0-bk next.

Attachment: testcase.sql (text/x-delimtext), 1.15 KiB.

Andrei,
See attached .sql file which has repeatable testcase.  Verified against 5.0.22 master --> 5.0.22 slave.

Shane, thanks for the script which I run to confess that
obviously slave indeed does not rollback replicated transaction.

Any news on this issue?

ETA upon the end of analysis by tomorrow.

Sebastian, a bit late note, sorry. i wonder if increasing innodb_lock_wait_timeout can help for you customer.

I am encountering the same problem on ubuntu.  We have two slaves of the same master,
one on ubuntu:
    ubuntu
    2.6.15-23-amd64 x86_64
    mysql  Ver 14.12 Distrib 5.0.22, for pc-linux-gnu (x86_64) using readline 5.1
and one on redhat:
    redhat
    2.6.9-42.0.2.ELsmp x86_64
    mysql  Ver 14.12 Distrib 5.0.24a, for unknown-linux-gnu (x86_64) using readline 5.0

On the redhat machine, we have nothing else running besides replication.
On the ubuntu machine, we are running some heavy duty reports that are locking some of the innodb tables into which replication is trying to insert.  One of those reports is failing frequently with the error:
        ERROR 1213 (40001) at line 826: Deadlock found when trying to get lock; try restarting transaction
Relication on the ubuntu machine is periodically failing with:
        Duplicate entry 'some number X' for key 1
When I go to investigate, I find that 'some number X' already exists in the table, and its attribute fields are identical to what replication is trying to insert.  When I use the:
        set sql_slave_skip_counter = 1;
technique, I hit the rest of the insert statements in that transaction, and they also each error out with the duplicate key error, and their data is also identical to what is already in the table.  It looks like the entire transaction is being replayed (although I did not verify that every single statement in the
transaction hit the error -- if one or two of them at the end of the transaction did not get the error, I would not have noticed).
I have not yet tied the replication errors to the same time as the deadlock errors because I have not been looking for that, but the deadlock errors are happening more frequently (6/day) than the replication errors (1/2 days)
The same transaction gets replicated to the redhat slave with no error.
There is no process on either slave which modifies data in the tables in question.

show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 50    |
+--------------------------+-------+

To what value do you suggest changing this variable?

In order to eliminate these two errors, we are currently modifying the reports to hit MyISAM copies of the InnoDB tables.  A set of triggers on the InnoDB tables will insert the primary key of the changed rows in another InnoDB table, and a periodic process will read the list of changed primary keys and apply the changes to the MyISAM table.

Craig,

I just submitted a patch for this problem. It is in that slave sql thread waits over innodb_lock_wait_timeout, and after that the replicated transaction failed to replay.

You must have hit the same problem if your slave's error log containsthe line
Slave: Error 'Lock wait timeout exceeded; try restarting transaction' on query.

It would be safe for your data to raise the timeout value meanwhile the patch is under reviewing. The increment depends on your slave's local transactions duration and their rate.

I hope my patch will be approved soon.
Thanks for your report.

I increased the innodb_lock_wait_timeout to 100 and it hasn't had a noticeable impact.  I'm going to increase it to 300 with which I think I'll have better luck.

I had previously set innodb_support_xa to 0.  Could this be related?  Would it be worth setting this back to 1?

The patch appeared to go to closed bug#16228.

Graig, i don't think this issue, requiring a record on the exceeded timeout,
has any relation to (except from 14.5.4 `InnoDB' Startup Options and System Variables)

   * `innodb_support_xa'

     When set to `ON' or 1 (the default), this variable enables
     `InnoDB' support for two-phase commit in XA transactions. Enabling
     `innodb_support_xa' causes an extra disk flush for transaction
     preparation. If you don't care about using XA, you can disable
     this variable by setting it to `OFF' or 0 to reduce the number of
     disk flushes and get better `InnoDB' performance.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/13450

ChangeSet@1.2297, 2006-10-11 11:07:20+03:00, aelkin@dsl-hkigw8-feb9fb00-191.dhcp.inet.fi +1 -0
  bug#20697 - results in sync with the test

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14566

ChangeSet@1.2323, 2006-10-30 17:09:28+02:00, aelkin@dsl-hkigw8-feaef900-46.dhcp.inet.fi +2 -0
  BUG#20697 slave fails to rollback replicated transaction hang over innodb_lock_wait_timeou
  post-merge fixes in test/result

Pushed into 5.0.32.
Pushed into 5.1.14.

Noted in 5.0.32, 5.1.14 changelogs.

On slave servers, transactions that exceeded the lock wait
timeout failed to roll back properly.