MySQL Bugs: #78352: Slow startup of 5.7.x slave with relay_log

Bug #78352	Slow startup of 5.7.x slave with relay_log_recovery = ON and many relay logs
Submitted:	7 Sep 2015 8:26	Modified:	29 Mar 2016 19:26
Reporter:	Valeriy Kravchuk	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.8	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	missing manual, regression, relay_log_recovery, replication

Description:
Manual for MySQL 5.7 (https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_relay_log_re...) explains how the relay log recovery feature works in the same way as for 5.6:

"The recovery process creates a new relay log file, initializes the SQL thread position to this new relay log, and initializes the I/O thread to the SQL thread position. Reading of the relay log from the master then continues. In MySQL 5.7, this global variable is read-only; its value can be changed by starting the slave with the --relay-log-recovery option, which should be used following a crash on the replication slave to ensure that no possibly corrupted relay logs are processed, and must be used in order to guarantee a crash-proof slave."

In reality, MySQL 5.7.8 seems to scan all the relay logs collected in this case and doing something slow with them. At least time spent is proportional to the size of the logs and can be huge (this is for some 30G of relay logs collected):

2015-09-04T13:27:47.877883Z 0 [Note] Server socket created on IP: '::'.
2015-09-04T13:35:27.679160Z 0 [Warning] Recovery from master pos 579120932 and file mysql-bin.006245 for channel ''. Previous relay log pos and relay log file had been set to 579121095, /mysqldata/mysql_relay_logs/relay-bin.000033 respectively.
2015-09-04T13:35:27.690236Z 0 [Note] Event Scheduler: Loaded 0 events

More than 7 minutes were spent on this, and it seems that 5.6.x was able to do this faster, so we can speak about a regression here.

How to repeat:
Set up replication (using make_replication_sandbox script of MySQL sandbox, for example) with 5.6.25 and 5.7.8. Make sure that binlog_format is ROW on master. Make sure that slave (node1) has the following settings in my.cnf:

master_info_repository              = TABLE
relay_log_info_repository           = TABLE
relay_log_recovery                  = ON
master_verify_checksum              = ON

so, it's a crash safe slave. Now, start master and slave ans stop SQL thread on slave:

stop slave sql_thread;

Now, connect to master and generate a log of events to the binary log, like this:

master [localhost] {msandbox} ((none)) > use test
Database changed
master [localhost] {msandbox} (test) > create table tbig(id int auto_increment primary key, data char(200));
Query OK, 0 rows affected (0.31 sec)

master [localhost] {msandbox} (test) > insert into tbig(data) values(repeat('a', 100));
Query OK, 1 row affected (0.06 sec)

master [localhost] {msandbox} (test) > insert into tbig(data) select repeat('b',200) from tbig;
Query OK, 1 row affected (0.07 sec)
Records: 1  Duplicates: 0  Warnings: 0

...

master [localhost] {msandbox} (test) > insert into tbig(data) select repeat('b',200) from tbig;
Query OK, 1048576 rows affected (54.27 sec)
Records: 1048576  Duplicates: 0  Warnings: 0

...

Check that notable size of relay logs is collected on slave:

master [localhost] {msandbox} (test) > \! ls -l ../node1/data
total 546704
-rw-r----- 1 openxs openxs        56 Sep  7 10:45 auto.cnf
-rw-r----- 1 openxs openxs       355 Sep  7 10:47 ib_buffer_pool
-rw-r----- 1 openxs openxs  12582912 Sep  7 10:53 ibdata1
-rw-r----- 1 openxs openxs  50331648 Sep  7 10:53 ib_logfile0
-rw-r----- 1 openxs openxs  50331648 Sep  7 10:46 ib_logfile1
...
-rw-r----- 1 openxs openxs       203 Sep  7 10:48 mysql-relay.000003
-rw-r----- 1 openxs openxs 433901115 Sep  7 10:53 mysql-relay.000004
-rw-r----- 1 openxs openxs        42 Sep  7 10:48 mysql-relay.index
-rw-r----- 1 openxs openxs         5 Sep  7 10:48 mysql_sandbox13254.pid
drwxr-x--- 2 openxs openxs      4096 Sep  7 10:45 performance_schema
drwxr-x--- 2 openxs openxs     12288 Sep  7 10:45 sys
drwxr-x--- 2 openxs openxs      4096 Sep  7 10:46 test

400M or so in my case for 1 million of rows inserted by the last INSERT ... SELECT. Now, restart slave and check error logs for these messages at startup:

2015-09-07T07:58:13.529178Z 0 [Note] Server socket created on IP: '127.0.0.1'.
2015-09-07T07:58:15.649109Z 0 [Warning] Recovery from master pos 5026 and file mysql-bin.000001 for channel ''. Previous relay log pos and relay log file had been set to 320, ./mysql-relay.000004 respectively.
...
2015-09-07T07:58:15.683334Z 1 [Note] Slave I/O thread for channel '': connected to master 'rsandbox@127.0.0.1:13253',replication started in log 'mysql-bin.000001' at position 5026

As you can see, we have 2+ seconds delay in this simple case. In 5.6.25 the same recovery process is almost instant (less than a second) for the same amount of redo logs collected:

2015-09-07 10:19:17 4296 [Note] Server socket created on IP: '127.0.0.1'.
2015-09-07 10:19:17 4296 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysql_sandbox21691-relay-bin' to avoid this problem.
2015-09-07 10:19:17 4296 [Warning] Recovery from master pos 5403 and file mysql-bin.000003.
2015-09-07 10:19:17 4296 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2015-09-07 10:19:17 4296 [Note] Slave I/O thread: connected to master 'rsandbox@127.0.0.1:21690',replication started in log 'mysql-bin.000003' at position 5403

Suggested fix:
Check if startup can be made faster in 5.7, so we do not hit this performance regression while upgrading slaves. 

In the meantime, explain in details in the manual what really happens when relay_log_recovery = ON in MySQL 5.7 (comparing to 5.6), so that we can set our expectations properly.

Hello Valeriy,

Thank you for the report.
Observed this latency in 5.7.8 slave instance.

Thanks,
Umesh

test results

Attachment: 78352.results (application/octet-stream, text), 11.69 KiB.

I had exactly same problem on 5.7.9.

Added the following entry to the MySQL 5.7.12 and 5.8.0 changelog:

'When relay_log-recovery was set to “1,” a slave server, at its initialization, still tried to scan the relay log files in order to update Retrieved_Gtid_Set and the transaction parser state, which was an unnecessary waste of resources because the slave I/O thread would just be initialized to the SQL thread position in the new relay log file. With this fix, the slave server skips scanning the relay log files when relay_log-recovery=1.'

Posted by developer: