MySQL Bugs: #102556: Apparent deadlock involving group replication applier threads

Bug #102556	Apparent deadlock involving group replication applier threads
Submitted:	10 Feb 2021 15:48	Modified:	6 Apr 2021 11:34
Reporter:	Eduardo Ortega	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.21	OS:	CentOS (Release: 7.7.1908)
Assigned to:		CPU Architecture:	x86 (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)

Description:
One of our replication group members just got stuck. Most of the replication applier threads are waiting on preceding transaction to complete:

Id      User    Host    db      Command Time    State   Info
10      system user             NULL    Connect 597123  waiting for handler commit      Group replication applier module
13      system user             NULL    Query   3085    Waiting for slave workers to process their queues       NULL
14      system user             NULL    Query   3085    Waiting for commit lock NULL
15      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
16      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
17      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
18      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
19      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
20      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
21      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
22      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL
23      system user             NULL    Query   3085    Waiting for preceding transaction to commit     NULL

See the attachment for the full processlist, backtrace, my.cnf and mysqld.log. 

How to repeat:
No clear directions on how to repeat. But we seem to have hit this twice in different hosts of the same replication group today.

Hi Eduardo,

I went through all the data you provided and with all that I can't reproduce the problem nor I'm seeing what exactly happened so I have to involve the GR Dev team to help me out. Can you, please, share the SR number?

all best
Bogdan

Hi Eduardo,

The dev team would like to know if you know answer to these questions:

1) what are the queries causing this?

2) is the member suffering network issues when this happens?

They have some ideas what might be causing this but need more data. Please open SR for this so we can expedite it too.

All best
Bogdan

Hi,

We see this:

745269  root  localhost mysql Query 1635  Waiting for global read lock  INSTALL PLUGIN clone SONAME 'mysql_clone.so'
746794  mysql.session localhost NULL  Query 459 Waiting for commit lock PLUGIN: SET GLOBAL super_read_only= 1
New

Are you doing "install plugin" on a live server? Is this something you did on both systems that got deadlocked?

Thanks
Bogdan

Hi Eduardo,

If you have some automation steps you are doing here, can you share the script?

all best
Bogdan

Hi, Bogdan:

This is the SR number: SR 3-25184026951 . It was opened on the same day as the bug.

As for your questions: 

> 1) what are the queries causing this?

One of the attached files has the output of show processlist at the time of the issue.

> 2) is the member suffering network issues when this happens?

None that we are aware. Do you see anything in the log that would suggest so? If so, and you can point me to a specific date and time, I can look at our network metrics to see whether I can identify anything.

> Are you doing "install plugin" on a live server? Is this something you did on both systems that got deadlocked?

Yes, sometimes we do INSTALL PLUGIN on live servers. This happens when we add a plugin to the configuration file. There is a script that ensures that every plugin defined in the configuration is loaded on the running instance. I don't think this is related to the issue, though. From the processlist, that query has been stuck for 1635 seconds, whereas
the GR appliers have been stuck for 3085 seconds - that is way before INSTALL PLUGIN.

> If you have some automation steps you are doing here, can you share the script?

From what I see in our logs, nothing seems to  have been running at the time in which the issue started (~15:41 for the 1004 host and ~15:48 for the 8001 host).

I have a core dump for one of the hosts (1004) in case it is useful.

Hi,

> None that we are aware. Do you see anything in the log that would suggest so?
> If so, and you can point me to a specific date and time, I can look at our 
> network metrics to see whether I can identify anything.

No, I don't see anything that states that network was down/problematic but have to ask as it might be part of the problem. 

> Yes, sometimes we do INSTALL PLUGIN on live servers. This happens when we 
> add a plugin to the configuration file. There is a script that ensures that
> every plugin defined in the configuration is loaded on the running instance.
> I don't think this is related to the issue, though. 
> From the processlist, that query has been stuck for 1635 seconds, whereas
> the GR appliers have been stuck for 3085 seconds - that is way before INSTALL PLUGIN.

I see that install plugin is also deadlocking here, but possible it is not related.
I'm doing a test now executing install plugin to see if it will do anything to cause this.

> From what I see in our logs, nothing seems to  have been running at the time in which 
> the issue started (~15:41 for the 1004 host and ~15:48 for the 8001 host).

We don't see anything we can use at this point and it worries me, I hoped you are running
some script at that time that you can share but if this happens during "normal hours" it
is harder to track.

> I have a core dump for one of the hosts (1004) in case it is useful.

I'll see with the GR DEV team if they have a specific request from that core dump, please
keep it a while.

thanks
Bogdan

Hi Eduardo,

I'm having issues reproducing this and GR DEV looked at the data provided and are not sure what is happening but the code involved is changed in both .22 and .23 and as I understand you already have plans to upgrade to .23. I think that would be the best course of action as we believe this will not happen again with .23 (but we can't be sure as we can't reproduce with .21 ).

all best
Bogdan

Verified.

Confirmed workaround - slave_parallel_workers=1

Bogdan

Hi Eduardo,

The core dump you had, do you still have access to it? Devs would like to have access to it.

thanks
Bogdan

Hi, Bogdan:

I am afraid that, since it has been a while and there was no interest on it when originally mentioned, we no longer have access to it. Sorry about that :-(

Hi Eduardo,

Nothing to be sorry about, I did not expect it to still be available but had to ask since I know sometimes you keep it for a while.

At the time we did not see a need for it, now GR team believed they might extract something valuable from it, but, too late.

Thanks,
Bogdan

Hi,

One question, you said you upgraded to .23 and that you are still seeing deadlocks. Are you noticing that in performance or you are only seeing log entries as the messages you mentioned com from the previous cases that on 8.0.21 were deadlocks and now are automatically retried and eventually successful.

So if I understand correctly, you believe these messages are just remnants of the problem from old version. Is there a way to "clear them" ? so everything is applied and no more errors/warnings in the log?

Is there a real issue now after upgrade to .23 or there is still a problem present

Thanks
Bogdan