MySQL Bugs: #67360: MySQL 5.6, GTID and ib_log_file resizing, and losing data

Bug #67360	MySQL 5.6, GTID and ib_log_file resizing, and losing data
Submitted:	24 Oct 2012 14:37	Modified:	4 Dec 2012 18:11
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.7-rc	OS:	Linux (CentOS 6.3)
Assigned to:		CPU Architecture:	Any
Tags:	windmill

Description:
I built a master + 2 slaves using MySQL 5.6.7.
However, by mistake the ib_log_file size needed increasing from 2 x 512M to 16 x 1 GB, for performance reasons. Replication was working fine.

To change this the only way I know at the moment is to do: /etc/init.d/mysql stop; rm /path/to/datadir/ib_log_file*; /etc/init.d/mysql start
I did this on the boxes and replication seemed to be ok.

Of course this removes the GTID replication information, and I'm using the settings:

gtid_mode = ON
disable-gtid-unsafe-statements

How to repeat:
Later however I had a failure:

121023 9:15:10 [Note] Slave: connected to master 'repl_fav@fv204fav3mdb-01.example.com:3306',replication resumed in log 'binlog.000006' at position 8445026
121023 20:10:19 [ERROR] Slave SQL: Could not execute Update_rows event on table fav.My_Table; Can't find record in 'My_Table', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000009, end_log_pos 442755453, Error_code: 1032
121023 20:10:19 [Warning] Slave: Can't find record in 'My_Table' Error_code: 1032

I think therefore that somehow the GTID information became out of sync between master and slave and some "inserts" were silently ignored. Later the intended updates couldn't be executed as the rows were missing. There appears to be no logging of this "issue".

The end result has been a need for me to rebuild this master + slaves replication chain.

Suggested fix:
Questions:

1. How do I resize ib_log_files without losing the GTID data that's stored there, and to potentially avoid this issue?
- This seems to justify the approach of moving the GTID information into a InnoDB table like you've done with the replication position info. Then the information can not be lost.
2. How do I see if statements have been skipped because of the GTID checking?
- since you don't expect statements to be skipped like this (except perhaps on first connect to a master) please log something in the log file to indicate something has happened, and ideally have some sort of counter indicating how many statements have been "dropped".
3. The logging above gives information on the execution position on the master. That's not helpful if you want to check what fails on the slave.
- Please add logging to also indicate the position where the failure occurred on the local relay log.
4. Previously IIRC the error gave the start position of the problem, not it shows the end position. So
how can I correctly identify the "RBR" statement or event that actually failed?

The "build the slaves process", usually consists of build an empty master, stop mysql, rsync /path/to/datadir to the new server, remove auto.cnf (so it will get regenerated), then then check the binlog position of the master and do a CHANGE MASTER TO master_host = <master_server>, master_user = ...,  master_log_file = latest_binlog_file, master_log_pos = size of latest_binlog_file.  start slave.

That seems to work, or has worked fine for non-GTID setups. I may have have done something wrong, but certainly SHOW SLAVE STATUS after doing this on both slaves showed SQL and I/O threads up and no errors, or replication delays.

So while the cause of this problem may have been user error, the fact I managed to shoot myself silently in the foot is what concerns me.

Perhaps the resizing of the ib_log_files was not related, but something else was done incorrectly but given the logging shows _no_ errors between the times shown when replication was restarted (121023  9:15:10) and later in the evening (121023 20:10:19) I'm trying to figure out how all of a sudden this error happened, and it's currently not clear to me how to diagnose this and catch it before it's a real problem. 
The same issue happened identically on both slaves.

Simon has agreed to close this bug, as the original problem was discussed in an Oracle SR and other bugs and feature requests created from that. The missing slave data cannot be verified due to lack of original evidence or a test case at this time.