MySQL Bugs: #74204: Missing / duplicated records after master crash

Bug #74204	Missing / duplicated records after master crash
Submitted:	3 Oct 2014 7:38	Modified:	3 Oct 2014 9:14
Reporter:	Loïc Blot	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6.21	OS:	FreeBSD (9.3-amd64)
Assigned to:		CPU Architecture:	Any
Tags:	GTID, replication

Description:
Hi,
I'm experimenting a new infrastructure for our MySQL topology. We are trying to use a master multislave replication. Master is redundant by using HAST/DRBD technology in synchronous mode. Here is the topology:

    M1     |       M2 (standby)
     |
     =============        Replication (GTID)
     |     |     |
     S1    S2    S3

When master crash (unpluging the alimentation), M2 take the IP (by heartbeat) and mount HAST volume (after a little fsck) and then start MySQL service. On M2, database is fully consistent, every record was here.

On the slaves, we have replicated records, after 60 seconds they reconnect to master (M2 at this time) and continue the replication.

The problem is simple. Slaves continues replication and comes to the last GTID state but some records are missing. My test is simple, I inject 100k record in table

CREATE TABLE `inctest6` (
  `id` int(100) NOT NULL AUTO_INCREMENT,
  `toto` varchar(2) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

When i do  select max(id),count(id) from inctest6 i have on the master:

+---------+-----------+
| max(id) | count(id) |
+---------+-----------+
|    2828 |      2828 |
+---------+-----------+

and on the slaves (every slave have this state):

+---------+-----------+
| max(id) | count(id) |
+---------+-----------+
|    2828 |      2714 |
+---------+-----------+

Then 110 records are missing but the last records were replicated. Another time i have more records than the master.

Here is my.cf on masters:

[mysqld]
port		= 3306
socket		= /tmp/mysql.sock
skip-external-locking
key_buffer_size = 16K
max_allowed_packet = 1M
table_open_cache = 4
sort_buffer_size = 64K
read_buffer_size = 256K
read_rnd_buffer_size = 256K
net_buffer_length = 2K
thread_stack = 128K
server-id	= 1
log-bin=mysql-bin
slave-net-timeout=5
binlog_format=mixed
gtid_mode = ON
log-slave-updates = ON
enforce-gtid-consistency = true
sync_master_info = 1
master_info_repository=TABLE
relay_log_info_repository=TABLE
innodb_flush_log_at_trx_commit = 1

And on slaves:

[mysqld]
port		= 3306
socket		= /tmp/mysql.sock
skip-external-locking
key_buffer_size = 16K
max_allowed_packet = 1M
table_open_cache = 4
sort_buffer_size = 64K
read_buffer_size = 256K
read_rnd_buffer_size = 256K
net_buffer_length = 2K
thread_stack = 128K
server-id	= 3
log-bin=mysql-bin
slave-net-timeout=5
gtid_mode = ON
log-slave-updates = ON
enforce-gtid-consistency = true
sync_master_info=1
master_info_repository=TABLE
sync_relay_log_info=1
relay_log_info_repository=TABLE
replicate_ignore_db=mysql

On hast volumes, we have this mount options:

/dev/hast/dbtest on /mnt/dbtest (ufs, local, noatime, synchronous, soft-updates)

Hast is in memstate mode (sync memory fsync buffers).

Please note without HAST replication the bug is also present. My slave test servers are very slow (old machines to test, but replication must work on it in this case)

Am i doing something wrong or there is a bug?

Thanks in advance

How to repeat:
Use previous configuration and crash the master when a massive insert injection is coming.

Look the the test request.

It's a problem on master binlogs which aren't synced properly.