Description:
Manual for MySQL 5.7 (https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_relay_log_re...) explains how the relay log recovery feature works in the same way as for 5.6:
"The recovery process creates a new relay log file, initializes the SQL thread position to this new relay log, and initializes the I/O thread to the SQL thread position. Reading of the relay log from the master then continues. In MySQL 5.7, this global variable is read-only; its value can be changed by starting the slave with the --relay-log-recovery option, which should be used following a crash on the replication slave to ensure that no possibly corrupted relay logs are processed, and must be used in order to guarantee a crash-proof slave."
In reality, MySQL 5.7.8 seems to scan all the relay logs collected in this case and doing something slow with them. At least time spent is proportional to the size of the logs and can be huge (this is for some 30G of relay logs collected):
2015-09-04T13:27:47.877883Z 0 [Note] Server socket created on IP: '::'.
2015-09-04T13:35:27.679160Z 0 [Warning] Recovery from master pos 579120932 and file mysql-bin.006245 for channel ''. Previous relay log pos and relay log file had been set to 579121095, /mysqldata/mysql_relay_logs/relay-bin.000033 respectively.
2015-09-04T13:35:27.690236Z 0 [Note] Event Scheduler: Loaded 0 events
More than 7 minutes were spent on this, and it seems that 5.6.x was able to do this faster, so we can speak about a regression here.
How to repeat:
Set up replication (using make_replication_sandbox script of MySQL sandbox, for example) with 5.6.25 and 5.7.8. Make sure that binlog_format is ROW on master. Make sure that slave (node1) has the following settings in my.cnf:
master_info_repository = TABLE
relay_log_info_repository = TABLE
relay_log_recovery = ON
master_verify_checksum = ON
so, it's a crash safe slave. Now, start master and slave ans stop SQL thread on slave:
stop slave sql_thread;
Now, connect to master and generate a log of events to the binary log, like this:
master [localhost] {msandbox} ((none)) > use test
Database changed
master [localhost] {msandbox} (test) > create table tbig(id int auto_increment primary key, data char(200));
Query OK, 0 rows affected (0.31 sec)
master [localhost] {msandbox} (test) > insert into tbig(data) values(repeat('a', 100));
Query OK, 1 row affected (0.06 sec)
master [localhost] {msandbox} (test) > insert into tbig(data) select repeat('b',200) from tbig;
Query OK, 1 row affected (0.07 sec)
Records: 1 Duplicates: 0 Warnings: 0
...
master [localhost] {msandbox} (test) > insert into tbig(data) select repeat('b',200) from tbig;
Query OK, 1048576 rows affected (54.27 sec)
Records: 1048576 Duplicates: 0 Warnings: 0
...
Check that notable size of relay logs is collected on slave:
master [localhost] {msandbox} (test) > \! ls -l ../node1/data
total 546704
-rw-r----- 1 openxs openxs 56 Sep 7 10:45 auto.cnf
-rw-r----- 1 openxs openxs 355 Sep 7 10:47 ib_buffer_pool
-rw-r----- 1 openxs openxs 12582912 Sep 7 10:53 ibdata1
-rw-r----- 1 openxs openxs 50331648 Sep 7 10:53 ib_logfile0
-rw-r----- 1 openxs openxs 50331648 Sep 7 10:46 ib_logfile1
...
-rw-r----- 1 openxs openxs 203 Sep 7 10:48 mysql-relay.000003
-rw-r----- 1 openxs openxs 433901115 Sep 7 10:53 mysql-relay.000004
-rw-r----- 1 openxs openxs 42 Sep 7 10:48 mysql-relay.index
-rw-r----- 1 openxs openxs 5 Sep 7 10:48 mysql_sandbox13254.pid
drwxr-x--- 2 openxs openxs 4096 Sep 7 10:45 performance_schema
drwxr-x--- 2 openxs openxs 12288 Sep 7 10:45 sys
drwxr-x--- 2 openxs openxs 4096 Sep 7 10:46 test
400M or so in my case for 1 million of rows inserted by the last INSERT ... SELECT. Now, restart slave and check error logs for these messages at startup:
2015-09-07T07:58:13.529178Z 0 [Note] Server socket created on IP: '127.0.0.1'.
2015-09-07T07:58:15.649109Z 0 [Warning] Recovery from master pos 5026 and file mysql-bin.000001 for channel ''. Previous relay log pos and relay log file had been set to 320, ./mysql-relay.000004 respectively.
...
2015-09-07T07:58:15.683334Z 1 [Note] Slave I/O thread for channel '': connected to master 'rsandbox@127.0.0.1:13253',replication started in log 'mysql-bin.000001' at position 5026
As you can see, we have 2+ seconds delay in this simple case. In 5.6.25 the same recovery process is almost instant (less than a second) for the same amount of redo logs collected:
2015-09-07 10:19:17 4296 [Note] Server socket created on IP: '127.0.0.1'.
2015-09-07 10:19:17 4296 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysql_sandbox21691-relay-bin' to avoid this problem.
2015-09-07 10:19:17 4296 [Warning] Recovery from master pos 5403 and file mysql-bin.000003.
2015-09-07 10:19:17 4296 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2015-09-07 10:19:17 4296 [Note] Slave I/O thread: connected to master 'rsandbox@127.0.0.1:21690',replication started in log 'mysql-bin.000003' at position 5403
Suggested fix:
Check if startup can be made faster in 5.7, so we do not hit this performance regression while upgrading slaves.
In the meantime, explain in details in the manual what really happens when relay_log_recovery = ON in MySQL 5.7 (comparing to 5.6), so that we can set our expectations properly.