MySQL Bugs: #77826: Transaction applied partially on a slave with relay_log_info

Bug #77826	Transaction applied partially on a slave with relay_log_info_repository=table
Submitted:	24 Jul 2015 8:19	Modified:	28 Jul 2015 12:25
Reporter:	Przemyslaw Malkowski	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6.25, 5.6.26	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	relay_log_info_repository

Description:
A transaction may be executed partially(!) on a slave running with relay_log_info_repository=TABLE, when a slave SQL thread is blocked for more then innodb_lock_wait_timeout. Instead of re-trying just a timeouted statement it re-tries the whole transaction. As by default the innodb_rollback_on_timeout is OFF, this leads to data inconsistency. Happens for both statement and ROW format.

The workaround for that is switching innodb_rollback_on_timeout to ON, or using relay_log_info_repository=FILE.
Tested on: 5.6.25-log MySQL Community Server (GPL)

How to repeat:
In slave's my.cnf, add:
relay_log_info_repository           = TABLE

--master:
use test
create table t1 (id int auto_increment primary key, a varchar(50));
create table t2 (id int auto_increment primary key, a varchar(50));
insert into t1 values (null,"aa"),(null,"bb"),(null,"cc");
insert into t2 values (null,"AA"),(null,"BB"),(null,"CC");

-- slave:
begin; select * from t2 for update;

-- master:
begin;
insert into t1 values (10,"dd");
insert into t2 values (10,"DD");
commit;

As a result, the slave ends up with error:
2015-07-24 09:57:53 14775 [Warning] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'test'. Query: 'insert into t2 values (10,"DD")', Error_code: 1205
2015-07-24 09:57:53 14775 [ERROR] Slave SQL: Error 'Duplicate entry '10' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into t1 values (10,"dd")', Error_code: 1062
2015-07-24 09:57:53 14775 [Warning] Slave: Duplicate entry '10' for key 'PRIMARY' Error_code: 1062
2015-07-24 09:57:53 14775 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000002' position 946

And inconsistent data comparing to the master:
--master:
master [localhost] {msandbox} (test) > select * from t1; select * from t2;
+----+------+
| id | a    |
+----+------+
|  1 | aa   |
|  2 | bb   |
|  3 | cc   |
| 10 | dd   |
+----+------+
4 rows in set (0.00 sec)

+----+------+
| id | a    |
+----+------+
|  1 | AA   |
|  2 | BB   |
|  3 | CC   |
| 10 | DD   |
+----+------+
4 rows in set (0.00 sec)

--slave:
slave1 [localhost] {msandbox} (test) > select * from t1; select * from t2;
+----+------+
| id | a    |
+----+------+
|  1 | aa   |
|  2 | bb   |
|  3 | cc   |
| 10 | dd   |
+----+------+
4 rows in set (0.00 sec)

+----+------+
| id | a    |
+----+------+
|  1 | AA   |
|  2 | BB   |
|  3 | CC   |
+----+------+
3 rows in set (0.00 sec)

Suggested fix:
Fix the replication to be ACID compliant also when the relay_log_info_repository=TABLE

Hello Przemyslaw,

Thank you for the report and test case.
Observed this with 5.6.26 GA builds.

Thanks,
Umesh

5.6.26 test result

Attachment: 77826.5_6_26.results (application/octet-stream, text), 14.41 KiB.

Can we have innodb_rollback_on_timeout enabled by default in 5.6.28?