Bug #38660 Lock wait timeout exceeded while failing-over on master-master replication setup
Submitted: 8 Aug 2008 8:50 Modified: 11 Dec 2008 16:19
Reporter: Bart Verwilst Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.51a, 5.1.26 OS:Linux
Assigned to: CPU Architecture:Any
Tags: failover, lock, replication

[8 Aug 2008 8:50] Bart Verwilst
Description:
We have a master-master replication setup which replicates several databases. All tables are InnoDB. The setup is ( supposed to be ) highly available because of the use of heartbeat with mon. If we run a batch of inserts/updates on the current master/active db, and then execute /usr/lib/heartbeat/hb_standby to make the VIP switch the the other mysql server, the following can be seen in a "SHOW INNODB STATUS\G":

---TRANSACTION 0 209205, ACTIVE 42 sec, process no 27910, OS thread id 2422524848 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 320, 1 row lock(s)
MySQL thread id 618, query id 47824 Updating
update Ispserviceorder set action='CreateTerminal', ispid=12, macaddr='00:00:00:00:00:2D', numberRetries=1, operatorid='sap', orderState='FAILED', orderdate='2008-08-08 08:03:15', orderid=12, processeddate=null, scheduleddate=null, sitid=45, subaction=null, terminalType='Sat3Play' where objectID=112
------- TRX HAS BEEN WAITING 42 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 1226 n bits 208 index `PRIMARY` of table `ccmt`.`ispserviceorder` trx id 0 209205 lock_mode X locks rec but not gap waiting
Record lock, heap no 138 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
 0: len 4; hex 80000070; asc    p;; 1: len 6; hex 000000033124; asc     1$;; 2: len 7; hex 00000004ef0a57; asc       W;; 3: len 14; hex 4372656174655465726d696e616c; asc CreateTerminal;; 4: len 4; hex 8000000c; asc     ;; 5: len 17; hex 30303a30303a30303a30303a30303a3244; asc 00:00:00:00:00:2D;; 6: len 4; hex 80000001; asc     ;; 7: len 3; hex 736170; asc sap;; 8: len 10; hex 50524f43455353494e47; asc PROCESSING;; 9: len 8; hex 800012436d6eb3bb; asc    Cmn  ;; 10: len 4; hex 8000000c; asc     ;; 11: SQL NULL; 12: SQL NULL; 13: len 4; hex 8000002d; asc    -;; 14: SQL NULL; 15: len 8; hex 53617433506c6179; asc Sat3Play;;

------------------
---TRANSACTION 0 209188, ACTIVE 254 sec, process no 27910, OS thread id 2379021232
22 lock struct(s), heap size 2496, 13 row lock(s), undo log entries 10
MySQL thread id 666, query id 47752 CCMT-A 172.16.32.1 ccmt
Trx read view will not see trx with id >= 0 209189, sees < 0 209189

After 10 iterations of the lock timeout, a "SHOW SLAVE STATUS\G" shows this:

Last_Errno: 1205
Last_Error: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'ccmt'. Query: 'update Ispserviceorder set action='CreateTerminal', ispid=12, macaddr='00:00:00:00:00:2D', numberRetries=1, operatorid='sap', orderState='FAILED', orderdate='2008-08-08 08:03:15', orderid=12, processeddate=null, scheduleddate=null, sitid=45, subaction=null, terminalType='Sat3Play' where objectID=112'
                 Skip_Counter: 0

It appears that a lock has gone stale? and forces replication to break. Restarting the mysql server resumes replication just fine afaics. Stopping the slave on the lock-infested server takes a long time:

mysql> stop slave;
Query OK, 0 rows affected (31.12 sec)

After which the lock messages disappear in the innodb status output. Starting the slave again brings the lock issues back as before.

We tried with 2 versions of the mysql server, 5.0.51a and 5.1.26.

Other things we tried is fiddling with these options in my.cnf:

innodb_table_locks
innodb_rollback_on_timeout
innodb_thread_concurrency

None of those did anything to alleviate the problem.

This problem is 100% reproducible here.

How to repeat:
- Setup master-master replication with heartbeat for HA
- Run a batch provision of inserts and updates
- During the provisioning, do a failover with /usr/lib/heartbeat/hb_standby
- Watch the mysql server choke on the lock

Suggested fix:
No idea :(
[11 Nov 2008 16:19] Susanne Ebrecht
Many thanks for writing a bug report.

Both versions are very old. Please try newer version MySQL 5.0.67 or 5.1.29.
[12 Dec 2008 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".