Bug #78389 | 5.6.24: Lost data during master restart if partial transact has been downloaded | ||
---|---|---|---|
Submitted: | 10 Sep 2015 8:38 | Modified: | 14 Sep 2015 15:42 |
Reporter: | Simon Mudd (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.6.24,5.6.25,5.6.26 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | lost data |
[10 Sep 2015 8:38]
Simon Mudd
[10 Sep 2015 8:47]
Simon Mudd
To clarify the partial transaction does not need to be that large. You just need to stop the master while the slave is downloading some of the events for the transaction. This process is usually quite quick which is why it's sometimes hard to reproduce. I've managed to do this with a 60MB transaction (containing 1000 rows) which is tiny. It's just a matter of getting the timing right.
[10 Sep 2015 8:49]
Simon Mudd
Visual inspection if replicating to various slaves shows that the amount of data "kept" varies as the slaves do not normally stop replicating at exactly the same point in time.
[10 Sep 2015 9:27]
Simon Mudd
I have also not yet had time to replicate between different minor versions to try and confirm if the issue is due to the slave or the master. However, some other testing I have done seems to indicate the problem is on the master.
[10 Sep 2015 10:18]
MySQL Verification Team
Thanks for the bug report. On the 5.6.26 slave, after reconnect to master error log contains: [Note] Slave SQL: Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is that the master died while writing the transaction to its binary log, thus rolled back too. Error_code: 0
[10 Sep 2015 14:33]
Simon Mudd
5.6.24 does not show the message you indicate. Perhaps the message in 5.6.26 is new? Either way I see: 2015-09-10 09:53:55 6155 [Note] Slave: received end packet from server due to dump thread being killed on master. Dump threads are killed for example during master shutdown, explicitly by a user, or when the master receives a binlog send request from a duplicate server UUID <00481559-5791-11e5-8888-e25ec4bd5160> : Error 2015-09-10 09:53:55 6155 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000005' at position 4 <==== 2015-09-10 09:53:55 6155 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2015-09-10 09:53:55 6155 [ERROR] Slave I/O: error reconnecting to master 'rsandbox@127.0.0.1:21489' - retry-time: 60 retries: 1, Error_code: 2003 2015-09-10 09:54:55 6155 [Note] Slave: connected to master 'rsandbox@127.0.0.1:21489',replication resumed in log 'mysql-bin.000005' at position 4 So at least on 5.6.24 there appears to be no clue as to the fact that something bad has happened. The logging you show at least gives a clue to a problem. The line indicated with a <<==== looks odd as the slave has decided to jump to the next binlog file and that seems wrong. I was also doing a simple bulk insert and looking at the auto-increment id column I see: slave1 [localhost] {msandbox} ((none)) > select id from test.big_trx; +------+ | id | +------+ | 1 | | 2 | | 3 | ... | 86 | | 87 | | 1002 | +------+ 88 rows in set (0.00 sec) So you clearly see the first single row, plus the first few rows get inserted (of the 1,000 row insert transaction) and finally a last "post insert single row" to give a reference after the 1,000 row insert. To slow things down I was inserting a single id column + a few fat varchar(255) columns (not shown). Anyway I look forward to feedback once the cause of the problem has been found.
[10 Sep 2015 14:35]
Simon Mudd
In response to the message you posted: "[Note] Slave SQL: Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is that the master died while writing the transaction to its binary log, thus rolled back too. Error_code: 0" I did check in my environment and the binlog on the master was complete. The shutdown on the master was started after the commit had been acknowledged and as stated this was a clean shutdown.
[14 Sep 2015 15:42]
Jon Stephens
Documented fix in the 5.6.27 and 5.7.9 changelogs as follows: When the dump thread was killed while dumping an inactive binary log, some events in this log could be skipped and thus not replicated. Closed.
[14 Sep 2015 15:43]
Jon Stephens
See also BUG#74607.