MySQL Bugs: #83000: Mysql 5.7.13 replication cannot keep up. 5.7.12 works fine.

Bug #83000	Mysql 5.7.13 replication cannot keep up. 5.7.12 works fine.
Submitted:	14 Sep 2016 19:31	Modified:	21 Dec 2016 11:20
Reporter:	Pete French	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Packaging	Severity:	S2 (Serious)
Version:	5.7.13 (and 5.7.15)	OS:	FreeBSD
Assigned to:		CPU Architecture:	Any

Description:
Upgrading our systems to 5.7.13 leads to a situation where slaves cannot keep up with replication even under moderate load. The seconds behind master keeps growing. It appears that it is thus the SQL thread which is causing the issue, not the Io thread. A 'show process list' doesn't how anything happening on the slave, even when it is several hours behind. We are replicating using GTIDs and the underlying filesystem is ZFS.

How to repeat:
Upgrade mastr/slave setup from 5.7.12 to 5.7.13

Suggested fix:
Rolling back to 5.7.12 fixes the issue, the lost hours are caught up in a matter of minutes.

Thank you for the bug report. Are you able to test the lastest 5.7.15 release presents the same issue than 5.7.13 on your environment?. Thanks.

I just tries this with 5.7.15 and it lags just the same - currently at 13 minutes behind and gettling slower and slower. Have updated the version accordingly

I can provide a machine exhibiting the problem for you to login and example if required.

I just rolled back to 5.7.12 - it went from being seventeen minutes behind to zero in about three to four seconds.

Is there any progress on this ? I can still easily reproduce the issue, and the offer of setting up a system for you to login to, with the problem in lace, is still open if required. I was going to try a trace to see if I can see any difference in the system calls that the two versions are making, would that help >

More tetsing on this - it is a FreeBSD specific problem. It occurs on both 10.3 and 11.0, and is independent of the underlying filesystem as I tried both ZFS and UFS.

I also tried Open Indiana (Solariis) with the same dataset and that works fine, hence I conclude its FreeBSD only.

I managed to isolate this behaviour on FreeBSD to a change in the mode of the installed example my.cnf file from 0640 to 0644. This makes no sense to me (it is not even the file I am using) but it is completely reproducible. FreeBSD bug is here:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=214566

So can you show us a diff of the output of SHOW GLOBAL VARIABLES between
the two versions,  or just the contents of that my.cnf..

Then we should close this bug?

Woah! Loads fo différences! But how ? Thats not my my,cnf files ? Is it reading that one and then applying my one over the top ? The differences are below for the record...

diff vars.fast vars.slow
11c11
< |binlog_cache_size|32768|
---
> |binlog_cache_size|16777216|
62c62
< |explicit_defaults_for_timestamp|OFF|
---
> |explicit_defaults_for_timestamp|ON|
76c76
< 449a194c-4623-11e3-8c37-001b783c1faf:1-25009041|
---
> 449a194c-4623-11e3-8c37-001b783c1faf:1-25009046|
80,81c80,82
< |gtid_purged|3c6b6d8b-7065-11e6-95fb-001d9263ab24:1-5,
< 449a194c-4623-11e3-8c37-001b783c1faf:1-23324369|
---
> |gtid_purged|2672cfab-b568-11e6-bf72-d0bf9c45a37c:1-5,
> 3c6b6d8b-7065-11e6-95fb-001d9263ab24:1-5,
> 449a194c-4623-11e3-8c37-001b783c1faf:1-24640111|
92c93
< |have_symlink|YES|
---
> |have_symlink|DISABLED|
111c112
< |innodb_autoinc_lock_mode|1|
---
> |innodb_autoinc_lock_mode|2|
132,133c133,134
< |innodb_data_file_path|ibdata1:12M:autoextend|
< |innodb_data_home_dir||
---
> |innodb_data_file_path|ibdata1:128M:autoextend|
> |innodb_data_home_dir|/var/db/mysql|
189c190
< |innodb_open_files|2000|
---
> |innodb_open_files|16384|
199c200
< |innodb_read_io_threads|4|
---
> |innodb_read_io_threads|8|
220c221
< |innodb_temp_data_file_path|ibtmp1:12M:autoextend|
---
> |innodb_temp_data_file_path|ibtmp1:128M:autoextend|
230c231
< |innodb_write_io_threads|4|
---
> |innodb_write_io_threads|8|
257c258
< |log_output|FILE|
---
> |log_output|TABLE|
270c271
< |long_query_time|10.000000|
---
> |long_query_time|0.500000|
306c307
< |myisam_recover_options|OFF|
---
> |myisam_recover_options|BACKUP,FORCE|
314c315
< |net_retry_count|10|
---
> |net_retry_count|16384|
433c434
< |slave_load_tmpdir|/usr/home/mysql/tmp|
---
> |slave_load_tmpdir|/var/db/mysql_tmpdir|
446c447
< |slow_query_log|OFF|
---
> |slow_query_log|ON|
473c474
< |sync_relay_log|10000|
---
> |sync_relay_log|1|
476,477c477,478
< |table_definition_cache|1400|
< |table_open_cache|2000|
---
> |table_definition_cache|8192|
> |table_open_cache|16384|

Different config will lead to different performance. hence, not a bug here.
Not sure what's to blame, but large binlog_cache_size isn't doing you any favors if most trx are small.

http://dev.mysql.com/doc/refman/5.7/en/option-files.html

Indeed. I didn't realise it was reading multiple config files, that does explain everything. Time to overhaul where I keep files and look again at some of the options. Sorry for the noise.