Bug #101990 | MySQL replica fails to recover after an unsafe shutdown | ||
---|---|---|---|
Submitted: | 14 Dec 2020 21:40 | Modified: | 28 Dec 2020 11:47 |
Reporter: | Sergiu Hlihor | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 8.0.22 | OS: | Ubuntu (20.04 LTS) |
Assigned to: | MySQL Verification Team | CPU Architecture: | x86 (Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz) |
Tags: | crash, replica, replication, unsafe shutdown |
[14 Dec 2020 21:40]
Sergiu Hlihor
[21 Dec 2020 19:53]
MySQL Verification Team
Hi, I cannot reproduce this. The percona server that has the patches from .23 is maybe ok or maybe not I cannot say, not our product, no clue how they fixed them, but you really need to set this up with our binaries from scratch, when 8.0.23 is really out. > - Setup master replica using the settings I have above > - Setup master with 200K tables > - Insert into master about 5 billion rows spread evenly over all 200K tables in batch transactions containing 10 rows per transaction > - Start replica > - Kill replica after a large number of transaction, recommended in the middle of the test and monitor the recovery. Done, Done, Done, Done, Done -> multiple times, did not manage to bring it to state where it will not start again. Now, you do understand that "killing replica" can mean many things and that there is always a possibility replica will die in a way that will corrupt the file system (e.g. you have faulty memory on your cache controller) and that will not allow the replica to start. There's no way to prevent this. But new replica should always be able to catch up, but doing normal "kill replica" by doing kill -9, shutting down VM where it runs, stopping VM where it runs did not managed to bring the replica to state where it would not recover. All best Bogdan
[21 Dec 2020 20:08]
Sergiu Hlihor
Some extra infromation. The master - replica was switched live to GTID based replication with a few days prior to the incident. Could this affect it?
[21 Dec 2020 20:12]
MySQL Verification Team
Hi, > The master - replica was switched live to GTID based replication with a few days prior to the incident. Could this affect it? I need to check with my colleagues. From the top of my head no, but again, restarting replica from scratch - there might be some issues, will have to check and get back to you Will also redo the test doing this change at some point. all best Bogdan
[25 Dec 2020 8:50]
MySQL Verification Team
Hi, > The master - replica was switched live to GTID based replication with a few days prior to the incident. Could this affect it? I retested with this too but no luck reproducing.
[28 Dec 2020 11:47]
Sergiu Hlihor
Is there any way to reset the replication on replica server and start it without letting it go through the flow that ends in "Gtid_set::add_gno_interval" ? I have large amount of data on that replica and rebuilding from gound up is quite time consuming.