MySQL Bugs: #100381: Group replication primary got stuck after attempting to rejoin a group member

Bug #100381	Group replication primary got stuck after attempting to rejoin a group member
Submitted:	30 Jul 2020 15:07	Modified:	14 Oct 2020 17:19
Reporter:	Eduardo Ortega	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.21	OS:	CentOS (CentOS Linux release 7.7.1908 (Core))
Assigned to:	MySQL Verification Team	CPU Architecture:	x86 ( Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz)

Description:
MySQL replication group primary got stuck (deadlocked?) after successively stopping and starting group replication on a secondary. The description of actions taken and what was observed at the time is quite long so I am attaching that instead of writing it here. But the bottomline is this:

* The group primary got stuck.
* This seems to have prevented it from noticing that a host was joining back.
* The group failed to notice the group primary was not healthy and to elect a new primary. Only after I attached gdb to the mysqld process and triggered getting the backtrace (which took time and presumably made the group notice that the primary was gone), it seems the group would still have kept the same primary.

How to repeat:
I have not been able to reliably repeat this, however, I did capture a backtrace, which I will attach in the bug. From the backtrace, I suspect this is due to some deadlock involving the ACL cache metadata lock.

Suggested fix:
For now, my workaround if I hit this again is to force the current primary to become unresponsive (by attaching gdb, killing the process, blocking port 33061 or any other means) so that the group is forced to elect a new primary.

Hi Eduardo,

Thanks for the report.

bogdan

Hi,

I see the backtraces and the logs. Can you also share all the config files?

thanks
Bogdan

Hi,

> I find it a bit odd that it leads to what seems to be high contention or deadlock on the ACL cache lock. Is this expected?

I'm waiting on updates from AdminAPI team.

> it is not obvious to me from the documentation that these operations should not be mixed up.

When everything is "ok" then it should not be a problem but when something is off it can lead to more problems. So it's not that "you must not" but better, golden rule is not to.

I have more questions from our dev team

With regards to "IP Whitelist setting of a member is lost somehow when it leaves the cluster" part of the problem we need to know more details about

- How did you create the cluster? From the bits of information so far it seems you did it by doing: "mysqlsh dba create-cluster <cluster_name> --ipWhitelist=10.0.0.0/8", which is perfectly fine. Please confirm.

- But... how did you add each instance to the cluster? Which steps were done?

- Which Shell version are you using?

- Can you share the Shell logs? It would be very helpful to understand what are you doing and how. And to assess if there's a bug somewhere or not. As we can't reproduce nor find the bug at this moment.

Thanks

Hi:

> - How did you create the cluster? From the bits of information so far it seems you did it by doing: "mysqlsh dba create-cluster <cluster_name> --ipWhitelist=10.0.0.0/8", which is perfectly fine. Please confirm.

This is accurate

> - But... how did you add each instance to the cluster? Which steps were done?

Initial addition of mebers to the group was done with "mysqlsh cluster add-instance user:pass@new_member_host:port --recoveryMethod=incremental --ipWhitelist=10.0.0.0/8 --memberWeight=50" Later on, when we hit the issue, we were direclty issuing 'STOP GROUP_REPLICATION' and 'START GROUP_REPLICATION' on the MySQL client, because we didn't want the host to be removed from the InnoDB cluster metadata.

> - Which Shell version are you using?

The group was created with MySQL shell (and MySQL version) 8.0.18, but when we had this issue, both had been upgraded to 8.0.21.

> - Can you share the Shell logs? It would be very helpful to understand what are you doing and how. And to assess if there's a bug somewhere or not. As we can't reproduce nor find the bug at this moment.

As stated above, we used the mysqlsh to create the group, but that was last year; we don't have them anymore.

Hi,

We can't reproduce this and our dev team can't find enough usable info in the existing traces so for now this is on-hold. Let's see if this happens again so we can grab more logs.

all best
Bogdan