Bug #102742 | Distributed recovery fails if performed while many rows are being inserted | ||
---|---|---|---|
Submitted: | 25 Feb 2021 18:27 | Modified: | 25 May 2021 16:05 |
Reporter: | Keith Lammers | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S3 (Non-critical) |
Version: | 8.0.23 | OS: | Windows (Windows Server 2019) |
Assigned to: | CPU Architecture: | x86 |
[25 Feb 2021 18:27]
Keith Lammers
[26 Feb 2021 16:58]
MySQL Verification Team
Hi, Can you share your config, I'm having issues reproducing this: > > * Wait some time (I waited 5 or 6 minutes) > I tried with up to 30minutes and I never reproduced the problem. > Last resort if the workaround fails is to perform removeInstance and then addInstance('INSTANCENAME',{recoveryMethod:'clone'}) addinstance with recoverymethod clone is a proper way to recover from this situation. I'm not sure how you get into this situation in 5-6 minutes, this is something I'm failing to reproduce. all best Bogdan
[26 Feb 2021 17:00]
Keith Lammers
Sure thing! You want the my.ini from each instance?
[26 Feb 2021 17:17]
MySQL Verification Team
Hi, Yes, please. Also - how did you create the cluster? all best Bogdan
[26 Feb 2021 17:51]
MySQL Verification Team
Important values to check - How is Group_replication_consistency configured (is it AFTER ?)? - what node is shut down? Primary or one of the other nodes - How is cluster configured? RW/RO/RO or ? Is the RO node shutdown? all best Bogdan
[26 Feb 2021 18:09]
Keith Lammers
Ok, config files attached. To answer your other questions: * Cluster was created using MySQL Shell AdminAPI * group_replication_consistency is set to AFTER * The node that is shutdown during this test is a R/O node * The cluster is 3 instance, single-primary
[27 Mar 2021 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[27 Mar 2021 15:15]
Keith Lammers
I had replied with the answers to the questions and sent in the config files as a private attachment, so I'm not sure why this is being marked as no feedback. If there's anything else you need, please let me know. Thanks!
[30 Mar 2021 1:24]
Keith Lammers
To add another update to this, I've setup a new InnoDB cluster on Linux instead of Windows, and I can still reproduce this behaviour. Interestingly, with this cluster, instead of just automatically disabling GR after distributed recovery fails with that error, it falls back to clone recovery automatically. The instance wasn't down for more than 2 minutes when I did this test. These are the errors reported in the log: 2021-03-30T01:18:53.040387Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.' 2021-03-30T01:18:53.044236Z 10 [ERROR] [MY-013309] [Repl] Plugin group_replication reported: 'Transaction '1:10578' does not exist on Group Replication consistency manager while receiving remote transaction prepare.' 2021-03-30T01:18:53.044295Z 10 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.' 2021-03-30T01:18:53.044479Z 10 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.' 2021-03-30T01:18:56.591557Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.' It actually looks kind of like distributed recovery finished based on that first log line, then it hits the "'Transaction '1:10578' does not exist on Group Replication consistency manager" error immediately after.
[31 Mar 2021 8:05]
Bin Wang
Don't use "after" consistency until now. Lots of bugs will occur when view is changed.
[31 Mar 2021 10:34]
MySQL Verification Team
Hi, Thanks for the report, I verified the behavior, we are already working on the fix but the dev team might have additional questions so please monitor this report. all best Bogdan
[31 Mar 2021 11:00]
Keith Lammers
Excellent, thanks Bogdan! I've been working the last couple of days on trying to narrow down some more specific repeatable steps, and also trying to find a workaround. If you need any other info at all, please don't hesitate to ask. Bin mentioned not using the AFTER consistency level, so I will re-test with EVENTUAL and BEFORE_AND_AFTER as well.
[31 Mar 2021 11:37]
MySQL Verification Team
Hi, check out https://mysqlhighavailability.com/group-replication-consistent-reads/ EVENTUAL is the safest but you can have stale reads. We do appreciate any bugs reported for AFTER, BEFORE, BEFORE_AND_AFTER .. especially if we can reproduce them. Thanks Bogdan
[31 Mar 2021 18:18]
Keith Lammers
Can confirm that this bug does not occur with the consistency set to EVENTUAL. I need the reads to be 100% consistent so I think a potential workaround is to use EVENTUAL consistency, but have the app servers use the R/W port (6446) on MySQL router so that all connections are made to the primary only.
[25 May 2021 16:05]
Nuno Carvalho
Changelog entry added for MySQL 8.0.26: With the Group Replication system variable group_replication_consistency = AFTER set, if a view change event was delayed until after a locally prepared transaction was completed, a different GTID could be applied to it, causing errors in replication. The data is now processed in the same sequence it is received to avoid the situation.