MySQL Bugs: #78389: 5.6.24: Lost data during master restart if partial transact has been downloaded

Bug #78389	5.6.24: Lost data during master restart if partial transact has been downloaded
Submitted:	10 Sep 2015 8:38	Modified:	14 Sep 2015 15:42
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.24,5.6.25,5.6.26	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	lost data

Description:
I do not have a 100% reproducible test case I can provide yet. I'm working on providing that later.

For now I'll give a description of the problem and results seen.

I've detected a problem where I insert a large transaction into the database and then while it's being downloaded by the slaves I shutdown the master (cleanly).  That is the slave downloads part of the transaction to its relaylogs. I then restart the master. The slaves eventually reconnect and continue replicating.

Result on boxes where I've seen this problem is that the master has a count(*) which is higher than the resulting count(*) on the slaves (after replication has caught up).  Inspection of the table contents indicate that the FIRST part of the large transaction is kept and the second part which hadn't been downloaded is lost.  Subsequent inserts of data into the master are inserted into the slave and arrive safely.

How to repeat:
Follow the procedure above.

Results I have tested show:
5.6.16: PASS
5.6.21: PASS
5.6.23: PASS
5.6.24: FAIL
5.6.26: FAIL

I have not tested 5.7.

master configuration  (I'm using MySQL sandbox) master show:

[mysqld]
user               = myuser
port               = 21489
socket             = /tmp/mysql_sandbox21489.sock
basedir            = /home/myuser/Downloads/5.6.24
datadir            = /home/myuser/sandboxes/test/master/data
tmpdir             = /home/myuser/sandboxes/test/master/tmp
lower_case_table_names = 0
pid-file           = /home/myuser/sandboxes/test/master/data/mysql_sandbox21489.pid
bind-address       = 127.0.0.1
relay-log-index=mysql-relay
relay-log=mysql-relay
log-bin=mysql-bin
server-id=1
log-error=msandbox.err

and one of the sandbox slaves show:

[mysqld]
user               = myuser
port               = 21490
socket             = /tmp/mysql_sandbox21490.sock
basedir            = /home/myuser/Downloads/5.6.24
datadir            = /home/myuser/sandboxes/test/node1/data
tmpdir             = /home/myuser/sandboxes/test/node1/tmp
lower_case_table_names = 0
pid-file           = /home/myuser/sandboxes/test/node1/data/mysql_sandbox21490.pid
bind-address       = 127.0.0.1
server-id=101
report-host=SBslave1
report-port=21490
relay-log-index=mysql-relay
relay-log=mysql-relay
log-bin=mysql-bin
log-error=msandbox.err

I'm not sure if this provides enough information for you to reproduce. hopefully it does.

What's critical is that the master is stopped while the "large transaction" is being downloaded. That's easy to verify after the master has stopped by looking at SHOW SLAVE STATUS output.

Suggested fix:
Do not lose data.
Even if a big transaction is being downloaded (binlog events) then the slave should collect all the data and insert it on the slave the same way.

To clarify the partial transaction does not need to be that large. You just need to stop the master while the slave is downloading some of the events for the transaction. This process is usually quite quick which is why it's sometimes hard to reproduce. I've managed to do this with a 60MB transaction (containing 1000 rows) which is tiny. It's just a matter of getting the timing right.

Visual inspection if replicating to various slaves shows that the amount of data "kept" varies as the slaves do not normally stop replicating at exactly the same point in time.

I have also not yet had time to replicate between different minor versions to try and confirm if the issue is due to the slave or the master. However, some other testing I have done seems to indicate the problem is on the master.

Thanks for the bug report.   On the 5.6.26 slave, after reconnect to master error log contains:

[Note] Slave SQL: Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is that the master died while writing the transaction to its binary log, thus rolled back too. Error_code: 0

5.6.24 does not show the message you indicate. Perhaps the message in 5.6.26 is new? Either way I see:

2015-09-10 09:53:55 6155 [Note] Slave: received end packet from server due to dump thread being killed on master. Dump threads are killed for example during master shutdown, explicitly by a user, or when the master receives a binlog send request from a duplicate server UUID <00481559-5791-11e5-8888-e25ec4bd5160> : Error 
2015-09-10 09:53:55 6155 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000005' at position 4  <====
2015-09-10 09:53:55 6155 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2015-09-10 09:53:55 6155 [ERROR] Slave I/O: error reconnecting to master 'rsandbox@127.0.0.1:21489' - retry-time: 60  retries: 1, Error_code: 2003
2015-09-10 09:54:55 6155 [Note] Slave: connected to master 'rsandbox@127.0.0.1:21489',replication resumed in log 'mysql-bin.000005' at position 4

So at least on 5.6.24 there appears to be no clue as to the fact that something bad has happened. The logging you show at least gives a clue to a problem.

The line indicated with a <<==== looks odd as the slave has decided to jump to the next binlog file and that seems wrong. I was also doing a simple bulk insert and looking at the auto-increment id column I see:

slave1 [localhost] {msandbox} ((none)) > select id from test.big_trx;
+------+
| id   |
+------+
|    1 |
|    2 |
|    3 |
...
|   86 |
|   87 |
| 1002 |
+------+
88 rows in set (0.00 sec)

So you clearly see the first single row, plus the first few rows get inserted (of the 1,000 row insert transaction) and finally a last "post insert single row" to give a reference after the 1,000 row insert.  To slow things down I was inserting a single id column + a few fat varchar(255) columns (not shown).

Anyway I look forward to feedback once the cause of the problem has been found.

In response to the message you posted:

"[Note] Slave SQL: Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is that the master died while writing the transaction to its binary log, thus rolled back too. Error_code: 0"

I did check in my environment and the binlog on the master was complete. The shutdown on the master was started after the commit had been acknowledged and as stated this was a clean shutdown.

Documented fix in the 5.6.27 and 5.7.9 changelogs as follows:

    When the dump thread was killed while dumping an inactive binary
    log, some events in this log could be skipped and thus not
    replicated.

Closed.

See also BUG#74607.