Bug #114624 Deadlock on primary replica when adding instance to replica cluster with writes
Submitted: 11 Apr 13:37 Modified: 8 Aug 21:01
Reporter: Paulo Machado Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.39 OS:Ubuntu (22.04)
Assigned to: CPU Architecture:x86

[11 Apr 13:37] Paulo Machado
Description:
Using group replication and cluster-set, with a primary cluster A with 3 nodes and replica cluster B with 1 node (B1, primary).

Cluster A has small write load (60 records/s), through its r/w node A1 (single-primary mode)

When trying to join a new instance (B2) to cluster B, using mysql-shell Cluster.add_instance method from B1, the process hangs forever in a deadlock.

Repeating the process stopping the write load on cluster A before adding the new instance works as expected.

How to repeat:
Using mysql-shell, through python API:
1. Deploy Group Replication cluster (A)
2. Create ClusterSet on A
3. Deploy standalone instance B1
4. Create ReplicaCluster on B1 from A (cluster_set.create_replica_cluster)
5. Add some write load to cluster A
6. Deploy standalone instance B2
7. Join instance B2 to cluster B from B1 (cluster.add_instance)

Add instance hangs at this point
[11 Apr 14:46] Paulo Machado
mysqlsh log where add instance is executed

Attachment: mysqlsh.log (text/x-log), 185.41 KiB.

[11 Apr 14:47] Paulo Machado
GDB stack trace from primary instance of replica cluster

Attachment: gdb_stack_trace.txt (text/plain), 90.56 KiB.

[11 Apr 14:47] Paulo Machado
Thread on primary instance of replica cluster

Attachment: performance_schema.threads.txt (text/plain), 52.10 KiB.

[11 Apr 15:43] MySQL Verification Team
Thanks for the report
[15 Apr 14:12] Paulo Machado
Extra context:

The issue is being consistently replicated in the context of a operator for mysql (same behavior on vms, bare metal and kubernetes).
That said, the sequence of events in the replication steps happens in quick succession, which may be playing a role in triggering the deadlock
[8 Aug 21:01] Paulo Machado
Reproducible using:

dba.deploy_sandbox_instance(3310, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3320, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3330, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 1. Deploy Group Replication cluster (A)
shell.connect("root@127.0.0.1:3310", "")
cluster = dba.create_cluster("my_cluster")
cluster.add_instance("127.0.0.1:3320", {'recoveryMethod':'incremental'});
cluster.add_instance("127.0.0.1:3330", {'recoveryMethod':'incremental'});

# 2. Create ClusterSet on A
clusterset = cluster.create_cluster_set("foobar")

# 3. Deploy standalone instance B1
dba.deploy_sandbox_instance(3340, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 4. Create ReplicaCluster on B1 from A (cluster_set.create_replica_cluster)
clusterset.create_replica_cluster("127.0.0.1:3340", "replica", {'recoveryMethod':'incremental'});

# 5. Add some write load to cluster A

# \sql
my_session = mysql.get_classic_session('root@localhost:3310', '');
my_session.run_sql("CREATE DATABASE test");
my_session.run_sql("USE test;");
my_session.run_sql("CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY, c2 INT) ENGINE=InnoDB;");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");

On other terminal generate load with mysqlslap:

mysqlslap -S /tmp/sandbox/3310/sandboxdata/mysqld.sock --create-schema=test --delimiter=";" --iterations=300000 --query="INSERT INTO t1 (c2) SELECT c2 FROM t1 LIMIT 100" --concurrency=100 &

And on mysqlshell execute final steps:

shell.connect("root@127.0.0.1:3340", "")
dba.deploy_sandbox_instance(3350, {'password': "", 'sandboxDir': "/tmp/sandbox"});
# First difference - I'm adding instances to the replica cluster
cluster = dba.get_cluster("replica")
cluster.add_instance("127.0.0.1:3350", {'recoveryMethod':'incremental'})

# Second difference - sometimes It was needed to add a third instance to the replica cluster
dba.deploy_sandbox_instance(3360, {'password': "", 'sandboxDir': "/tmp/sandbox"});
cluster.add_instance("127.0.0.1:3360", {'recoveryMethod':'incremental'})

session.run_sql("SELECT max(c1) FROM test.t1")
# PK stop incrementing while mysqlslap is running

# connect to `my_cluster` primary
shell.connect("root@127.0.0.1:3310", "")
# PK still incrementing while mysqlslap is running
session.run_sql("SELECT max(c1) FROM test.t1")