MySQL Bugs: #101990: MySQL replica fails to recover after an unsafe shutdown

Bug #101990	MySQL replica fails to recover after an unsafe shutdown
Submitted:	14 Dec 2020 21:40	Modified:	28 Dec 2020 11:47
Reporter:	Sergiu Hlihor	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	8.0.22	OS:	Ubuntu (20.04 LTS)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86 (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz)
Tags:	crash, replica, replication, unsafe shutdown

Description:
I have a master database with over 200K tables in one schema and the following configuration:
log-bin			= mysql-bin
binlog-format		= row
binlog_cache_size	= 65536
binlog_transaction_dependency_tracking  = WRITESET
binlog_transaction_compression  = ON
binlog-transaction-dependency-history-size      = 1000000
sync_binlog		= 1
binlog_expire_logs_seconds	= 2592000
max_binlog_size		= 1000M
gtid_mode		= ON
enforce_gtid_consistency	= ON
binlog-ignore-db	= information_schema
binlog-ignore-db	= performance_schema
binlog-ignore-db	= mysql
log-slave-updates	= 1

And slave: 
relay_log_info_repository	= TABLE
relay-log		= mysql-relay-bin
log-slave-updates	= 0
replicate-ignore-db	= information_schema
replicate-ignore-db	= performance_schema
replicate-ignore-db	= mysql
slave-parallel-type	= LOGICAL_CLOCK
slave-parallel-workers	= 24
slave_pending_jobs_size_max	= 4096M

The replica was catching and after about 2 days worth of replication caught up, I executed a restart cycle in which it froze, had to kill it and then database was never able to start again, as being victim of bugs:
https://bugs.mysql.com/bug.php?id=100118
https://bugs.mysql.com/bug.php?id=101717

I have installed a version maintained by Percona which already contains the patch recommended, however the database startup takes more than 24 hours and never finishes, just crashes. Upon investigation with perf top, I have found out that it stays stuck in Gtid_set::add_gno_interval for 70-80% of the time. 
To be mentioned that I had about 4-5 billions rows added in at least 50-100 million transactions executed. 

How to repeat:
Due to the large database size and due to the fact that the server was victim of the previously mentioned bugs (both master & slave were running on 8.0.21), not sure if it easy to reproduce. Suggested method would be to:
 - Setup master replica using the settings I have above
 - Setup master with 200K tables
 - Insert into master about 5 billion rows spread evenly over all 200K tables in batch transactions containing 10 rows per transaction
 - Start replica
 - Kill replica after a large number of transaction, recommended in the middle of the test and monitor the recovery.

Hi,

I cannot reproduce this. The percona server that has the patches from .23 is maybe ok or maybe not I cannot say, not our product, no clue how they fixed them, but you really need to set this up with our binaries from scratch, when 8.0.23 is really out.

>  - Setup master replica using the settings I have above
>  - Setup master with 200K tables
>  - Insert into master about 5 billion rows spread evenly over all 200K tables in batch transactions containing 10 rows per transaction
>  - Start replica
>  - Kill replica after a large number of transaction, recommended in the middle of the test and monitor the recovery.

Done, Done, Done, Done, Done -> multiple times, did not manage to bring it to state where it will not start again. 

Now, you do understand that "killing replica" can mean many things and that there is always a possibility replica will die in a way that will corrupt the file system (e.g. you have faulty memory on your cache controller) and that will not allow the replica to start. There's no way to prevent this. But new replica should always be able to catch up, but doing normal "kill replica" by doing kill -9, shutting down VM where it runs, stopping VM where it runs did not managed to bring the replica to state where it would not recover.

All best
Bogdan

Some extra infromation. The master - replica was switched live to GTID based replication with a few days prior to the incident. Could this affect it?

Hi,

> The master - replica was switched live to GTID based replication with a few days prior to the incident. Could this affect it?

I need to check with my colleagues. From the top of my head no, but again, restarting replica from scratch - there might be some issues, will have to check and get back to you

Will also redo the test doing this change at some point.

all best
Bogdan

Hi,

> The master - replica was switched live to GTID based replication with a few days prior to the incident. Could this affect it?

I retested with this too but no luck reproducing.

Is there any way to reset the replication on replica server and start it without letting it go through the flow that ends in "Gtid_set::add_gno_interval" ?

I have large amount of data on that replica and rebuilding from gound up is quite time consuming.