MySQL Bugs: #112324: Innodb cluster Primary writes impacted due to addinstance

Bug #112324	Innodb cluster Primary writes impacted due to addinstance
Submitted:	12 Sep 2023 13:37	Modified:	14 Nov 2023 14:02
Reporter:	lalit Choudhary	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S5 (Performance)
Version:	8.0.33	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Issue : Innodb cluster Primary writes impacted due to addInstance() and removeInstance() 

How to repeat:
Test Case:

Started InnoDB cluster with one node, with Single-Primary mode.

 MySQL  127.0.0.1:30308 ssl  JS > cluster.status() 
{
    "clusterName": "testCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:30308", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures.", 
        "topology": {
            "127.0.0.1:30308": {
                "address": "127.0.0.1:30308", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.33"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "127.0.0.1:30308"

### Sarted sysbench WRITE load running  on primary 
$ sysbench /usr/share/sysbench/oltp_write_only.lua --mysql_storage_engine=innodb --table-size=20000 --tables=10 --mysql-host=127.0.0.1 --mysql-db=test --mysql-port=30308  --mysql-user=msandbox --mysql-password=msandbox  --threads=8 --time=0 --report-interval=1 --events=0 --db-driver=mysql --db-ps-mode=disable run

###### Adding new  node in InnoDB cluster, 

MySQL  127.0.0.1:30308 ssl  JS > cluster.addInstance('icadmin@127.0.0.1:30309')

NOTE: The target instance '127.0.0.1:30309' has not been pre-provisioned (GTID set is empty). The Shell is unable to decide whether incremental state recovery can correctly provision it.
The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of '127.0.0.1:30309' with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

The incremental state recovery may be safely used if you are sure all updates ever executed in the cluster were done with GTIDs enabled, there are no purged transactions and the new instance contains the same GTID set as the cluster or a subset of it. To use this method by default, set the 'recoveryMethod' option to 'incremental'.

Please select a recovery method [C]lone/[I]ncremental recovery/[A]bort (default Clone): 
Validating instance configuration at 127.0.0.1:30309...

This instance reports its own address as 127.0.0.1:30309

Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using '127.0.0.1:30309'. Use the localAddress option to override.

* Checking connectivity and SSL configuration...
A new instance will be added to the InnoDB Cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Adding instance to the cluster...

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 127.0.0.1:30309 is being cloned from 127.0.0.1:30308
** Stage DROP DATA: Completed
** Clone Transfer  
    FILE COPY  ######################################################  100%  Compl    PAGE COPY  ######################################################  100%  Compl    REDO COPY  ######################################################  100%  Completed
NOTE: 127.0.0.1:30309 is shutting down...

* Waiting for server restart... ready 
* 127.0.0.1:30309 has restarted, waiting for clone to finish...
** Stage RESTART: Completed
* Clone process has finished: 240.93 MB transferred in about 1 second (~240.93 MB/s)

Incremental state recovery is now in progress.

* Waiting for distributed recovery to finish...
NOTE: '127.0.0.1:30309' is being recovered from '127.0.0.1:30308'

* Distributed recovery has finished

The instance '127.0.0.1:30309' was successfully added to the cluster.

 MySQL  127.0.0.1:30308 ssl  JS > 

As we can see below, write load impacted on primary due to addInstance()

[ 39s ] thds: 8 tps: 284.00 qps: 1703.98 (r/w/o: 0.00/1135.99/567.99) lat (ms,95%): 38.94 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 8 tps: 255.00 qps: 1530.00 (r/w/o: 0.00/1020.00/510.00) lat (ms,95%): 41.10 err/s: 0.00 reconn/s: 0.00
[ 41s ] thds: 8 tps: 272.00 qps: 1632.03 (r/w/o: 0.00/1088.02/544.01) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 42s ] thds: 8 tps: 257.99 qps: 1547.97 (r/w/o: 0.00/1031.98/515.99) lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[ 43s ] thds: 8 tps: 261.93 qps: 1571.57 (r/w/o: 0.00/1047.72/523.86) lat (ms,95%): 46.63 err/s: 0.00 reconn/s: 0.00
[ 44s ] thds: 8 tps: 69.84 qps: 419.03 (r/w/o: 0.00/279.36/139.68) lat (ms,95%): 66.84 err/s: 0.00 reconn/s: 0.00   ####
[ 45s ] thds: 8 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 1708.63 err/s: 0.00 reconn/s: 0.00     ####  Here write load impacted on primary due to add instance
[ 46s ] thds: 8 tps: 264.63 qps: 1587.77 (r/w/o: 0.00/1058.52/529.26) lat (ms,95%): 42.61 err/s: 0.00 reconn/s: 0.00
[ 47s ] thds: 8 tps: 239.02 qps: 1434.11 (r/w/o: 0.00/956.08/478.04) lat (ms,95%): 47.47 err/s: 0.00 reconn/s: 0.00
[ 48s ] thds: 8 tps: 252.98 qps: 1515.87 (r/w/o: 0.00/1009.92/505.96) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 49s ] thds: 8 tps: 230.01 qps: 1382.08 (r/w/o: 0.00/922.05/460.03) lat (ms,95%): 52.89 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 8 tps: 227.00 qps: 1362.02 (r/w/o: 0.00/908.01/454.01) lat (ms,95%): 61.08 err/s: 0.00 reconn/s: 0.00
[ 51s ] thds: 8 tps: 249.99 qps: 1495.92 (r/w/o: 0.00/995.95/499.97) lat (ms,95%): 45.79 err/s: 0.00 reconn/s: 0.00

 MySQL  127.0.0.1:30308 ssl  JS > cluster.status()
{
    "clusterName": "testCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:30308", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures.", 
        "topology": {
            "127.0.0.1:30308": {
                "address": "127.0.0.1:30308", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.33"
            }, 
            "127.0.0.1:30309": {
                "address": "127.0.0.1:30309", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.33"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "127.0.0.1:30308"

#### Added one more node,

 MySQL  127.0.0.1:30308 ssl  JS > cluster.addInstance('icadmin@127.0.0.1:30310')

NOTE: The target instance '127.0.0.1:30310' has not been pre-provisioned (GTID set is empty). The Shell is unable to decide whether incremental state recovery can correctly provision it.
The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of '127.0.0.1:30310' with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

The incremental state recovery may be safely used if you are sure all updates ever executed in the cluster were done with GTIDs enabled, there are no purged transactions and the new instance contains the same GTID set as the cluster or a subset of it. To use this method by default, set the 'recoveryMethod' option to 'incremental'.

Please select a recovery method [C]lone/[I]ncremental recovery/[A]bort (default Clone): 
Validating instance configuration at 127.0.0.1:30310...

This instance reports its own address as 127.0.0.1:30310

Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using '127.0.0.1:30310'. Use the localAddress option to override.

* Checking connectivity and SSL configuration...
A new instance will be added to the InnoDB Cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Adding instance to the cluster...

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 127.0.0.1:30310 is being cloned from 127.0.0.1:30308
** Stage DROP DATA: Completed
** Clone Transfer  
    FILE COPY  ######################################################  100%  Compl    PAGE COPY  ######################################################  100%  Compl    REDO COPY  ######################################################  100%  Completed
NOTE: 127.0.0.1:30310 is shutting down...

* Waiting for server restart... ready 
* 127.0.0.1:30310 has restarted, waiting for clone to finish...
** Stage RESTART: Completed
* Clone process has finished: 217.01 MB transferred in about 1 second (~217.01 MB/s)

Incremental state recovery is now in progress.

* Waiting for distributed recovery to finish...
NOTE: '127.0.0.1:30310' is being recovered from '127.0.0.1:30308'
* Distributed recovery has finished

The instance '127.0.0.1:30310' was successfully added to the cluster.

 MySQL  127.0.0.1:30308 ssl  JS > 

[ 23s ] thds: 8 tps: 247.05 qps: 1482.28 (r/w/o: 0.00/988.19/494.09) lat (ms,95%): 46.63 err/s: 0.00 reconn/s: 0.00
[ 24s ] thds: 8 tps: 258.98 qps: 1553.89 (r/w/o: 0.00/1035.92/517.96) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 8 tps: 239.01 qps: 1434.08 (r/w/o: 0.00/956.05/478.03) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 8 tps: 232.01 qps: 1392.04 (r/w/o: 0.00/928.03/464.01) lat (ms,95%): 49.21 err/s: 0.00 reconn/s: 0.00
[ 27s ] thds: 8 tps: 82.96 qps: 497.78 (r/w/o: 0.00/331.85/165.93) lat (ms,95%): 484.44 err/s: 0.00 reconn/s: 0.00
[ 28s ] thds: 8 tps: 24.01 qps: 144.05 (r/w/o: 0.00/96.04/48.02) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00  ####  Here write load impacted on primary due to add instance 
[ 29s ] thds: 8 tps: 191.00 qps: 1146.02 (r/w/o: 0.00/764.01/382.01) lat (ms,95%): 50.11 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 8 tps: 213.99 qps: 1283.93 (r/w/o: 0.00/855.96/427.98) lat (ms,95%): 53.85 err/s: 0.00 reconn/s: 0.00
[ 31s ] thds: 8 tps: 219.01 qps: 1314.04 (r/w/o: 0.00/876.03/438.01) lat (ms,95%): 53.85 err/s: 0.00 reconn/s: 0.00
[ 32s ] thds: 8 tps: 222.98 qps: 1337.91 (r/w/o: 0.00/891.94/445.97) lat (ms,95%): 49.21 err/s: 0.00 reconn/s: 0.00
[ 33s ] thds: 8 tps: 224.01 qps: 1344.08 (r/w/o: 0.00/896.05/448.03) lat (ms,95%): 50.11 err/s: 0.00 reconn/s: 0.00
[ 34s ] thds: 8 tps: 189.91 qps: 1139.44 (r/w/o: 0.00/759.62/379.81) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 8 tps: 217.01 qps: 1302.08 (r/w/o: 0.00/868.05/434.03) lat (ms,95%): 51.94 err/s: 0.00 reconn/s: 0.00

 MySQL  127.0.0.1:30308 ssl  JS > cluster.status()
{
    "clusterName": "testCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:30308", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "127.0.0.1:30308": {
                "address": "127.0.0.1:30308", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.33"
            }, 
            "127.0.0.1:30309": {
                "address": "127.0.0.1:30309", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.33"
            }, 
            "127.0.0.1:30310": {
                "address": "127.0.0.1:30310", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.33"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "127.0.0.1:30308"

###########
removeInstance() also has some performance impact on Primary writes 

 MySQL  127.0.0.1:30308 ssl  JS > cluster.removeInstance('icadmin@127.0.0.1:30310')
The instance will be removed from the InnoDB Cluster.

* Waiting for instance '127.0.0.1:30310' to synchronize with the primary...
** Transactions replicated  ###############################################  100% 

* Instance '127.0.0.1:30310' is attempting to leave the cluster...

The instance '127.0.0.1:30310' was successfully removed from the cluster.
[ 34s ] thds: 8 tps: 222.01 qps: 1332.05 (r/w/o: 0.00/888.04/444.02) lat (ms,95%): 53.85 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 8 tps: 213.98 qps: 1283.87 (r/w/o: 0.00/855.91/427.96) lat (ms,95%): 53.85 err/s: 0.00 reconn/s: 0.00
[ 36s ] thds: 8 tps: 233.01 qps: 1398.05 (r/w/o: 0.00/932.03/466.02) lat (ms,95%): 47.47 err/s: 0.00 reconn/s: 0.00
[ 37s ] thds: 8 tps: 202.98 qps: 1217.88 (r/w/o: 0.00/811.92/405.96) lat (ms,95%): 54.83 err/s: 0.00 reconn/s: 0.00
[ 38s ] thds: 8 tps: 225.04 qps: 1347.24 (r/w/o: 0.00/897.16/450.08) lat (ms,95%): 51.02 err/s: 0.00 reconn/s: 0.00
[ 39s ] thds: 8 tps: 92.00 qps: 555.02 (r/w/o: 0.00/371.02/184.01) lat (ms,95%): 580.02 err/s: 0.00 reconn/s: 0.00  #### Here
[ 40s ] thds: 8 tps: 219.00 qps: 1313.98 (r/w/o: 0.00/875.98/437.99) lat (ms,95%): 46.63 err/s: 0.00 reconn/s: 0.00
[ 41s ] thds: 8 tps: 216.98 qps: 1301.90 (r/w/o: 0.00/867.93/433.97) lat (ms,95%): 63.32 err/s: 0.00 reconn/s: 0.00
[ 42s ] thds: 8 tps: 214.03 qps: 1284.21 (r/w/o: 0.00/856.14/428.07) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00
[ 43s ] thds: 8 tps: 221.95 qps: 1331.70 (r/w/o: 0.00/887.80/443.90) lat (ms,95%): 51.02 err/s: 0.00 reconn/s: 0.00
[ 44s ] thds: 8 tps: 222.01 qps: 1332.05 (r/w/o: 0.00/888.03/444.02) lat (ms,95%): 55.82 err/s: 0.00 reconn/s: 0.00

Suggested fix:
Adding or removing a secondary instance in InnoDB cluster should not impact the Primary.

Hi,

I did some extensive testing on this and I'm not sure this is a bug.

> Adding or removing a secondary instance in InnoDB cluster should not impact the Primary.

Well if you are adding instance that need data to be transferred there you cannot do that without affecting primary you are taking data from. 

I did not manage to reproduce any significant impact while removing a secondary. I will have replication team check this out once more but so far this does not look like a bug.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".