MySQL Bugs: #103421: DDL in group replication cluster also if online cause full stop of write operati

Bug #103421	DDL in group replication cluster also if online cause full stop of write operati
Submitted:	22 Apr 2021 8:22	Modified:	15 Jul 2021 8:46
Reporter:	Marco Tusa	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:		OS:	CentOS
Assigned to:		CPU Architecture:	Any

Description:
When using online DDL and Group Replication Online DDL concept is partially applicable, and all write operation got HOLD when replicating. Also in case of Primary crash cluster remain without Primary for the duration of the Alter

How to repeat:
Folks, this is a bit composite issue.
I don't know if we can consider this as a bug but for sure something is acting weird.

Scenario:
    3 MySQL nodes 8.0.23 in cluster using Group replication
    
   DC1-3(root@localhost) [windmills_s]>SELECT * FROM performance_schema.replication_group_members order by 3;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

Populate A table and make it large enough to take some time to perform an ALTER.
Like :
DC1-1(root@localhost) [windmills_s]>show create table windmills_test\G
*************************** 1. row ***************************
       Table: windmills_test
Create Table: CREATE TABLE `windmills_test` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `uuid` char(36) COLLATE utf8_bin NOT NULL,
  `millid` smallint NOT NULL,
  `kwatts_s` int NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) COLLATE utf8_bin NOT NULL,
  `active` tinyint NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`),
  KEY `kuuid_x` (`uuid`),
  KEY `millid_x` (`millid`),
  KEY `active_x` (`active`)
) ENGINE=InnoDB AUTO_INCREMENT=8195530 DEFAULT CHARSET=utf8 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC
1 row in set (0.00 sec)

From another terminal run on PRIMARY an insert command as:
while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done 

Then from another terminal insert using different tables:
while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills8  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills6 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done 

On SECONDARY nodes to check WHEN the Alter is going to be executed:
while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show processlist;"|grep -i "alter";sleep 1;done

You can try to use:
set global  group_replication_consistency= 'EVENTUAL';
set global  group_replication_consistency= 'AFTER';
Or others nothing will change.

At this point you can run th alter command:
    ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;

During the alter INSERTS on that table will continue... 
As soon as the Alter is complete and it starts on the SECONDARY nodes a meta lock and waiting for handler to commit is raised in the PRIMARY and INSERTS stop
| 1030 | root            | localhost           | windmills_s | Query   |   25 | waiting for handler commit                             | insert into windmills8  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype fro |
| 1031 | root            | localhost           | windmills_s | Query   |   24 | Waiting for table metadata lock                        | insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype |

Once the SLOWEST table had complete the execution of the alter , the activity restart.

When the alter is executed:
DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;
Query OK, 0 rows affected (3 min 22.85 sec)
Records: 0  Duplicates: 0  Warnings: 0

We do not get warning and/or errors about the fact the ALTER is not fully supported as online operation.

At the same time IF the PRIMARY will crash after the operation (while SECONDARY apply the ALTER) the cluster will remain without PRIMARY until the fastest node is able to complete. 

Also if I use ALGORITH=COPY, first as expected I cannot insert data while inserting on that table but I can on the others.
Then I have a full stop again!

My Variables related to GR and Replica
DC1-1(root@localhost) [(none)]>show global variables like '%replica%';
+-----------------------------------------------------+----------------------------------------------------------+
| Variable_name                                       | Value                                                    |
+-----------------------------------------------------+----------------------------------------------------------+
| group_replication_advertise_recovery_endpoints      | DEFAULT                                                  |
| group_replication_allow_local_lower_version_join    | OFF                                                      |
| group_replication_auto_increment_increment          | 7                                                        |
| group_replication_autorejoin_tries                  | 3                                                        |
| group_replication_bootstrap_group                   | OFF                                                      |
| group_replication_clone_threshold                   | 9223372036854775807                                      |
| group_replication_communication_debug_options       | GCS_DEBUG_NONE                                           |
| group_replication_communication_max_message_size    | 10485760                                                 |
| group_replication_components_stop_timeout           | 31536000                                                 |
| group_replication_compression_threshold             | 1000000                                                  |
| group_replication_consistency                       | AFTER                                                    |
| group_replication_enforce_update_everywhere_checks  | OFF                                                      |
| group_replication_exit_state_action                 | READ_ONLY                                                |
| group_replication_flow_control_applier_threshold    | 25000                                                    |
| group_replication_flow_control_certifier_threshold  | 25000                                                    |
| group_replication_flow_control_hold_percent         | 10                                                       |
| group_replication_flow_control_max_quota            | 0                                                        |
| group_replication_flow_control_member_quota_percent | 0                                                        |
| group_replication_flow_control_min_quota            | 0                                                        |
| group_replication_flow_control_min_recovery_quota   | 0                                                        |
| group_replication_flow_control_mode                 | QUOTA                                                    |
| group_replication_flow_control_period               | 1                                                        |
| group_replication_flow_control_release_percent      | 50                                                       |
| group_replication_force_members                     |                                                          |
| group_replication_group_name                        | DC1aaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa                     |
| group_replication_group_seeds                       | 192.168.4.81:33061,192.168.4.82:33061,192.168.4.83:33061 |
| group_replication_gtid_assignment_block_size        | 1000000                                                  |
| group_replication_ip_allowlist                      | AUTOMATIC                                                |
| group_replication_ip_whitelist                      | AUTOMATIC                                                |
| group_replication_local_address                     | 192.168.4.81:33061                                       |
| group_replication_member_expel_timeout              | 60                                                       |
| group_replication_member_weight                     | 50                                                       |
| group_replication_message_cache_size                | 315360004294967295                                       |
| group_replication_poll_spin_loops                   | 0                                                        |
| group_replication_recovery_complete_at              | TRANSACTIONS_APPLIED                                     |
| group_replication_recovery_compression_algorithms   | uncompressed                                             |
| group_replication_recovery_get_public_key           | OFF                                                      |
| group_replication_recovery_public_key_path          |                                                          |
| group_replication_recovery_reconnect_interval       | 60                                                       |
| group_replication_recovery_retry_count              | 10                                                       |
| group_replication_recovery_ssl_ca                   |                                                          |
| group_replication_recovery_ssl_capath               |                                                          |
| group_replication_recovery_ssl_cert                 |                                                          |
| group_replication_recovery_ssl_cipher               |                                                          |
| group_replication_recovery_ssl_crl                  |                                                          |
| group_replication_recovery_ssl_crlpath              |                                                          |
| group_replication_recovery_ssl_key                  |                                                          |
| group_replication_recovery_ssl_verify_server_cert   | OFF                                                      |
| group_replication_recovery_tls_ciphersuites         |                                                          |
| group_replication_recovery_tls_version              | TLSv1,TLSv1.1,TLSv1.2,TLSv1.3                            |
| group_replication_recovery_use_ssl                  | OFF                                                      |
| group_replication_recovery_zstd_compression_level   | 3                                                        |
| group_replication_single_primary_mode               | ON                                                       |
| group_replication_ssl_mode                          | DISABLED                                                 |
| group_replication_start_on_boot                     | OFF                                                      |
| group_replication_tls_source                        | MYSQL_MAIN                                               |
| group_replication_transaction_size_limit            | 2147483647                                               |
| group_replication_unreachable_majority_timeout      | 0                                                        |
| innodb_replication_delay                            | 0                                                        |
| replication_optimize_for_static_plugin_config       | OFF                                                      |
| replication_sender_observe_commit_only              | OFF                                                      |
+-----------------------------------------------------+----------------------------------------------------------+
61 rows in set (0.01 sec)

DC1-1(root@localhost) [(none)]>show global variables like '%slave%';
+------------------------------+----------------------------------+
| Variable_name                | Value                            |
+------------------------------+----------------------------------+
| init_slave                   |                                  |
| log_slave_updates            | ON                               |
| log_slow_slave_statements    | OFF                              |
| rpl_stop_slave_timeout       | 31536000                         |
| slave_allow_batching         | OFF                              |
| slave_checkpoint_group       | 512                              |
| slave_checkpoint_period      | 300                              |
| slave_compressed_protocol    | OFF                              |
| slave_exec_mode              | STRICT                           |
| slave_load_tmpdir            | /opt/mysql_instances/grtest/temp |
| slave_max_allowed_packet     | 1073741824                       |
| slave_net_timeout            | 60                               |
| slave_parallel_type          | LOGICAL_CLOCK                    |
| slave_parallel_workers       | 4                                |
| slave_pending_jobs_size_max  | 134217728                        |
| slave_preserve_commit_order  | ON                               |
| slave_rows_search_algorithms | INDEX_SCAN,HASH_SCAN             |
| slave_skip_errors            | OFF                              |
| slave_sql_verify_checksum    | ON                               |
| slave_transaction_retries    | 10                               |
| slave_type_conversions       |                                  |
| sql_slave_skip_counter       | 0                                |
+------------------------------+----------------------------------+
22 rows in set (0.00 sec)

Summarizing:
- Using ONLINE DDL I am able to insert in the ALTERED table and others while the alter is executed on the PRIMARY node
- As soon ALTER reaches the SECONDARY all my inserts are on hold
- No difference about the group_replication_consistency used
- If PRIMARY crashes after when secondary are executing the alter, cluster remain without PRIMARY for the duration of the ALTER on the fastest node 

Let me know if you need more details I can provide a full scenario/video if you are not able to reproduce

Hi Marco,

Thanks for the report and the reproduction process. 
I went through it all and I reproduced it enough to come to a conclusion that you are on to something. To me, at first look, this did not look like a bug. Second pass through it I'm not so sure anymore so I'm verifying this for GR team to check in-depth.

all best
Bogdan

Hi Bogdan,
any news from the replication team? I am quite surprised to find the described behaviour in the server, and to be honest I was expecting the replication team to be able to explain if this is on purpose (if so why) or not. 
As indicated, have customers to accept twice the time for an alter is something difficult to propose, especially when competitors do not have that.
 
Also potentially suffer for absence of a primary for long time is, to be nice, "weird". 

Please let us know because this is impacting the solution credibility.
thanks

Hi Marco,

Thank you for your bug report.

You present the scenario:
* 3 members on a Single-Primary Group Replication
* create a table and populate with data enough that an ALTER
TABLE take some time
* Execute
ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`),
ALGORITHM=INPLACE, LOCK=NONE;
concurrently with INSERTs on the same table.

When this scenario is run with default configuration, that is with
`group_replication_consistency = EVENTUAL`, the observed behavior is:
1) the primary is capable of running the ONLINE DDL and DML
concurrently.
2) the secondaries are not capable of running the ONLINE DDL and
DML concurrently, but that does not have any impact on the
primary, neither on primary elections.

This behavior on the secondaries is common on all replication
replicas, that is, on Asynchronous and Group Replication.
https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-limitations.html
```
Long running online DDL operations can cause replication lag. An
online DDL operation must finish running on the source before it is
run on the replica. Also, DML that was processed concurrently on the
source is only processed on the replica after the DDL operation on
the replica is completed.
```

There is a feature request to extend the ONLINE DDLs to replicas
BUG#73196: Allow ALTER TABLE to run concurrently on master and slave

When we run the above scenario with
`group_replication_consistency = AFTER`
we are instructing the primary to wait until each transaction is
applied on the secondaries before reply back to the client.

Thence we observe the behavior:
1) on the primary, despite the DDL is ONLINE, once it reaches
the commit stage, all transactions that will start after this
point will wait for the DDL to complete, which is waiting for
its apply on the secondaries.
2) the secondaries are not capable of running the ONLINE DDL and
DML concurrently, which means that the primary will wait for
this sequential apply on the secondaries.

This means that a ONLINE DDL on a primary will be affected by the
secondaries, more precisely:
a) the ONLINE DDL will allow concurrent DMLs until it reaches the
commit stage.
b) once the ONLINE DDL reaches the commit stage, DMLs will wait
until the ONLINE DDL is complete before proceeding.

Since `group_replication_consistency = AFTER` does include the
behavior of `BEFORE_ON_PRIMARY_FAILOVER`, if the primary changes,
only after all the ongoing transactions are complete the primary
will allow new transactions.
https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-consistency-guarante...

`group_replication_consistency = AFTER` is doing the needed work to
ensure that no session reads outdated data on the secondaries.

It is possible to get read consistency from the group, if you use
`group_replication_consistency = BEFORE`. This will not block the
primary.
The primary will externalize the effects of the DDL (or any
transaction for that matter) as soon as it is replicated
successfully (but not applied).
However, if you go and read from the secondary, you will still
observe latency, as the DDL is only replicated once the primary
finished its execution.
https://dev.mysql.com/doc/refman/8.0/en/group-replication-configuring-consistency-guarante...

Summary
-------

Long running online DDL operations can cause replication lag. An
online DDL operation must finish running on the primary before it is
run on the secondaries. Also, DML that was processed concurrently on
the primary is only processed on the secondaries after the DDL
operation on the secondaries is completed.
https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-limitations.html
and
BUG#73196: Allow ALTER TABLE to run concurrently on master and slave

`group_replication_consistency = AFTER` is doing the needed work to
ensure that no session reads outdated data on the secondaries.
`group_replication_consistency = EVENTUAL` does not prevent DML on
primary during online DDL.

Please let us know if everything is clarified.

Best regards,
Nuno Carvalho

Duplicate of BUG#73196: Allow ALTER TABLE to run concurrently on master and slave