MySQL Bugs: #100723: slow MTS recovery in startup

Bug #100723	slow MTS recovery in startup
Submitted:	3 Sep 2020 7:30	Modified:	7 Sep 2020 9:03
Reporter:	Qilu Wei	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
MTS group recovery in startup after a crash is very slow when there are lots of relay log to read. 

There are 14301 error reports of ER_RPL_MTS_GROUP_RECOVERY_RELAY_LOG_INFO like below: 
2020-09-01T11:36:11.916382+08:00 0 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 7997726.

It took over 5 min to finish the part of MTS recovery (function mts_recovery_groups()).

If there is something wrong that causes the MTS recovery is not complete and events in relay log are not applied, like check point not in the relay log or the events cannot be applied, slave SQL thread is stopped. 
Execute "start slave;" now. 
The whole execution which includes the same function mts_recovery_groups() took only about 0.5s, printing the same 14301 error reports of ER_RPL_MTS_GROUP_RECOVERY_RELAY_LOG_INFO as startup.

The error logging has different behavior before and after startup finish. And this may causes the slow MTS recovery in startup.
Try to prove it:
1) comment out the code of error log of ER_RPL_MTS_GROUP_RECOVERY_RELAY_LOG_INFO in mts_recovery_groups().
2) compile.
3) replace the mysqld of the slave and start it.
The MTS group recovery in startup only took about 0.3s.
The error logging causes the slow MTS recovery in startup is proved.

PS.
I have encountered this situation twice. 
One time it took 10 minutes to finish MTS recovery successfully.
Another time it took 2 hours to fail to finish MTS recovery 
(64417 error reports of ER_RPL_MTS_GROUP_RECOVERY_RELAY_LOG_INFO, 
finish with "Error looking for file after ../relaylog/relay-bin.000003", 
VM, 
"start slave" took about 1 min).

How to repeat:
The purpose is to create a situation to scan enough relay log in startup MTS recovery.

slave configuration:
gtid_mode = on
enforce_gtid_consistency=on
master-info-repository=TABLE
relay-log-info-repository=TABLE
slave_parallel_workers = 32
slave-parallel-type = LOGICAL_CLOCK
slave_preserve_commit_order = 1
relay_log_recovery = 0

1. Create replication between master and slave.

2. Master: 1000 concurrences to do small dml transactions.

3. kill -9 the pid of slave

4. Because of the crash, there is some problem with relay log.
2020-09-01T11:36:44.356869+08:00 6 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Worker 1 failed executing transaction 'b4e9ffb6-d3a6-11ea-9ae0-801844e6dd8c:5166' at master log mysql-bin.000001, end_log_pos 9356762; Error executing row event: 'When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@SESSION.GTID_NEXT is 'b4e9ffb6-d3a6-11ea-9ae0-801844e6dd8c:5166'.', Error_code: MY-001837
2020-09-01T11:36:44.357037+08:00 5 [Warning] [MY-010584] [Repl] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756

check relay log:
# at 9356826
#200901 11:34:58 server id 1  end_log_pos 9356696 CRC32 0x048d90c2      Table_map: `test_sj`.`user_tradingwater_info_hash` mapped to number 105
# at 9356910
#200901 11:34:58 server id 1  end_log_pos 9356762 CRC32 0x1b76018b      Write_rows: table id 105 flags: STMT_END_F

5. not to resolve it.
Set @@SESSION.GTID_NEXT='b4e9ffb6-d3a6-11ea-9ae0-801844e6dd8c:9356696';

6. now you can start slave or restart the mysqld many times.
The mts_recovery_groups() is done but the events apply is always stopped.

2020-09-01T11:58:04.716843+08:00 45 [ERROR] [MY-010584] [Repl] Slave SQL for channel '': Error executing row event: '@@SESSION.GTID_NEXT cannot be set to ANONYMOUS when @@GLOBAL.GTID_MODE = ON.', Error_code: MY-001782
2020-09-01T11:58:04.716890+08:00 45 [Warning] [MY-010584] [Repl] Slave: @@SESSION.GTID_NEXT cannot be set to ANONYMOUS when @@GLOBAL.GTID_MODE = ON. Error_code: MY-001782
2020-09-01T11:58:04.724727+08:00 45 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 9356696

Suggested fix:
Error logging in startup needs buffered logging (use function log_sink_buffer() in log_line_submit()). Please check why it takes so much time and whether there is something to improve.

Hi,

> slow when there are lots of relay log to read.

I'm not sure I understand what you consider to be a bug here.
Yes, it is slow when it needs to read huge log. We are working on making everything faster and more optimized but I don't see how is this a bug?

thanks

ERROR LOGGING causes the slow startup, NOT READING LOG.

Please pay attention to the difference of the situation of startup and after-startup.
It is NOT slow to read huge log because the "start slave" reading the SAME huge log took very little time. 
And I have PROVED that the huge ERROR LOGGING in mts recovery made it slow by changing the code and restarting it:
The code I changed is:
(Line 6083, rpl_slave.cc)
//          LogErr(INFORMATION_LEVEL, ER_RPL_MTS_GROUP_RECOVERY_RELAY_LOG_INFO,
//                 rli->get_group_master_log_name(), ev->common_header->log_pos);
Without this part of error logging and reading the same huge log, the mts recovery of startup became as fast as "start slave".

And please be aware, error logging in startup and after-startup have different procedure.

Amend previous comment：
Without this part of error logging and reading the same huge log, the mts recovery of startup became as fast as "start slave".

to:
Reading the same huge log when there is no huge error logging, the mts recovery in startup became as fast as "start slave".

I've once found error logging to be very slow,  but it was on Windows,  can you take a look if this is similar ?

https://bugs.mysql.com/bug.php?id=91487

The affect of Bug #91487 is similar to this case but the cause is not the same.
Bug #91487 is the result of writing 1 byte at a time to disk.
As you proved in Bug #91487, on Linux it makes a single write() per error log line.

Using strace, I found there are much time used in futex in startup.
The following is strace in startup. There are futex between the writes and the time interval between two write is about 0.02s:
23545 10:27:32.287594 write(2, "2020-09-07T10:27:32.262444+08:00 0 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9487925.\n", 172) = 172 <0.000030>
23590 10:27:32.289766 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.010084>
23590 10:27:32.289839 futex(0x16d72c00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000039>
23590 10:27:32.289943 futex(0x16d72c2c, FUTEX_WAIT_BITSET_PRIVATE, 20135, {3775589, 478319277}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.010123>
23590 10:27:32.300140 futex(0x16d72c00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000020>
23590 10:27:32.300222 futex(0x16d72c2c, FUTEX_WAIT_BITSET_PRIVATE, 20137, {3775589, 488595702}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.010113>
23590 10:27:32.310422 futex(0x16d72c00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000020>
23590 10:27:32.310503 futex(0x16d72c2c, FUTEX_WAIT_BITSET_PRIVATE, 20139, {3775589, 498878847}, ffffffff <unfinished ...>
23545 10:27:32.315253 write(2, "2020-09-07T10:27:32.287756+08:00 0 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9488754.\n", 172) = 172 <0.000029>
23590 10:27:32.320652 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.010124>
23590 10:27:32.320743 futex(0x16d72c00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000023>
23590 10:27:32.320816 futex(0x16d72c2c, FUTEX_WAIT_BITSET_PRIVATE, 20141, {3775589, 509190687}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.010082>
23590 10:27:32.330971 futex(0x16d72c00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000020>
23590 10:27:32.331052 futex(0x16d72c2c, FUTEX_WAIT_BITSET_PRIVATE, 20143, {3775589, 519426527}, ffffffff <unfinished ...>
23545 10:27:32.340663 write(2, "2020-09-07T10:27:32.315383+08:00 0 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9489583.\n", 172) = 172 <0.000040>
23590 10:27:32.341161 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.010084>

The following is strace in "start slave". There is no futex between the writes and the time interval between two write is about 0.0001s:
23858 10:34:14.188853 write(2, "2020-09-07T10:34:14.188823+08:00 42 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9356762.\n", 173) = 173 <0.000028>
23858 10:34:14.188945 write(2, "2020-09-07T10:34:14.188922+08:00 42 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9356846.\n", 173) = 173 <0.000022>
23858 10:34:14.189020 write(2, "2020-09-07T10:34:14.188999+08:00 42 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9356912.\n", 173) = 173 <0.000022>
23858 10:34:14.189093 write(2, "2020-09-07T10:34:14.189072+08:00 42 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9356943.\n", 173) = 173 <0.000022>
23858 10:34:14.189200 write(2, "2020-09-07T10:34:14.189178+08:00 42 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9357772.\n", 173) = 173 <0.000023>
23858 10:34:14.189291 write(2, "2020-09-07T10:34:14.189269+08:00 42 [Note] [MY-010574] [Repl] Slave: MTS group recovery relay log info group_master_log_name mysql-bin.000001, event_master_log_pos 9358601.\n", 173) = 173 <0.000024>

Before startup, error logging using log_sink_buffer() takes more locks than after-startup. It is probably the reason why the startup is slow when huge amounts of log lines getting written to the error log.

Hi,

I still believe this is not a bug but since I can reproduce that 

> The code I changed is:
> (Line 6083, rpl_slave.cc)
> //          LogErr(INFORMATION_LEVEL, ER_RPL_MTS_GROUP_RECOVERY_RELAY_LOG_INFO,
> //                 rli->get_group_master_log_name(), ev->common_header->log_pos);

does take a lot of time I'm verifying the bug but I'm dropping severity down to S5-Performance

thanks for the report and the analysis
Bogdan