Bug #105436 | GRepl APPLIER_QUEUE=1 after 8.0.27 upgrade, even on primary node | ||
---|---|---|---|
Submitted: | 3 Nov 2021 4:06 | Modified: | 17 Feb 2022 20:46 |
Reporter: | NOC NOC | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S3 (Non-critical) |
Version: | 8.0.28 | OS: | Debian |
Assigned to: | CPU Architecture: | Any |
[3 Nov 2021 4:06]
NOC NOC
[3 Nov 2021 4:06]
NOC NOC
group repl vars | group_concat_max_len | 1024 | | group_replication_advertise_recovery_endpoints | DEFAULT | | group_replication_allow_local_lower_version_join | OFF | | group_replication_auto_increment_increment | 7 | | group_replication_autorejoin_tries | 3 | | group_replication_bootstrap_group | OFF | | group_replication_clone_threshold | 9223372036854775807 | | group_replication_communication_debug_options | GCS_DEBUG_NONE | | group_replication_communication_max_message_size | 10485760 | | group_replication_communication_stack | XCOM | | group_replication_components_stop_timeout | 300 | | group_replication_compression_threshold | 1000000 | | group_replication_consistency | BEFORE_ON_PRIMARY_FAILOVER | | group_replication_enforce_update_everywhere_checks | OFF | | group_replication_exit_state_action | READ_ONLY | | group_replication_flow_control_applier_threshold | 25000 | | group_replication_flow_control_certifier_threshold | 25000 | | group_replication_flow_control_hold_percent | 10 | | group_replication_flow_control_max_quota | 0 | | group_replication_flow_control_member_quota_percent | 0 | | group_replication_flow_control_min_quota | 0 | | group_replication_flow_control_min_recovery_quota | 0 | | group_replication_flow_control_mode | QUOTA | | group_replication_flow_control_period | 1 | | group_replication_flow_control_release_percent | 50 | | group_replication_force_members | | | group_replication_group_name | bfd5c347-c26e-42a1-af4b-fed4aabd117e | | group_replication_group_seeds | 10.5.2.50:33061,10.5.2.60:33061,10.5.2.70:33061 | | group_replication_gtid_assignment_block_size | 1000000 | | group_replication_ip_allowlist | AUTOMATIC | | group_replication_ip_whitelist | AUTOMATIC | | group_replication_local_address | 10.5.2.50:33061 | | group_replication_member_expel_timeout | 5 | | group_replication_member_weight | 50 | | group_replication_message_cache_size | 1073741824 | | group_replication_paxos_single_leader | OFF | | group_replication_poll_spin_loops | 0 | | group_replication_recovery_complete_at | TRANSACTIONS_APPLIED | | group_replication_recovery_compression_algorithms | uncompressed | | group_replication_recovery_get_public_key | ON | | group_replication_recovery_public_key_path | | | group_replication_recovery_reconnect_interval | 60 | | group_replication_recovery_retry_count | 10 | | group_replication_recovery_tls_ciphersuites | | | group_replication_recovery_tls_version | TLSv1,TLSv1.1,TLSv1.2,TLSv1.3 | | group_replication_recovery_use_ssl | OFF | | group_replication_recovery_zstd_compression_level | 3 | | group_replication_single_primary_mode | ON | | group_replication_ssl_mode | DISABLED | | group_replication_start_on_boot | OFF | | group_replication_tls_source | MYSQL_MAIN | | group_replication_transaction_size_limit | 150000000 | | group_replication_unreachable_majority_timeout | 0 | | group_replication_view_change_uuid | AUTOMATIC |
[15 Nov 2021 14:25]
MySQL Verification Team
Hi, Thanks for the report, I'm trying to reproduce but did not so far ... Some info that might be relevant - how did you originally deploy the cluster - how did you exactly perform the upgrade of the cluster - how much load you had on the cluster while upgrading - can you share full proxySQL config ( 'max_transactions_behind' to '0' is not going to work for sure till the system is up at full capacity so that's expected) - was COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE ever higher than 1 ? - did you try stopping traffic completely for a minute to see if the system would "settle" (COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE become 0, trans..behind to go to 0 ..) Thanks
[29 Nov 2021 2:54]
NOC NOC
Cluster was deployed manually (i.e. not using InnoDB cluster). Bootstrapping one node, then joining the others. We're using GTIDs btw. Upgrade of the cluster was performed by shutting down mysql on a node (non-promaries first), upgrading it (debian package upgrade), then starting mysql and then rejoining the GR cluster, waiting until out of 'recovery', then moving onto the next node. I didn't notice the applier queue value at the time, so I'm not sure when that started occurring. Cluster was only lightly loaded while upgrading. It doesn't get a lot of use yet. Our proxysql config is not running any clever query rules, so only the primary mysql server is utilised anyway. max_transactions_behind is sort of only a diagnostic interest, rather than performative. I never saw COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE higher that 1. Traffic was not entirely stopped to see if it would catch up, but there would absolutely have been periods where no queries were being replicated due to the general load on the cluster. Hope that helps...
[17 Feb 2022 19:57]
NOC NOC
I've just run into this problem again, however this time without the upgrade being part of the equation (instance already on 8.0.28). Same cluster (3-node, single primary). An unrelated package update required each instance to be restarted (OS-level). The first two nodes went perfectly fine. The last node was the primary, so I manually changed the primary to another node using 'group_replication_set_as_primary'. Then, I set innodb_fast_shutdown = 0 Stopped mysql Restarted node rejoined instance into GR using start group_replication Instance successfully joined/became ONLINE/SECONDARY However, again 'select * from performance_schema.replication_group_member_stats \G' always shows a query in the applier queue: *************************** 2. row *************************** CHANNEL_NAME: group_replication_applier VIEW_ID: 16041527307045398:122 MEMBER_ID: cc0105fa-1f98-11eb-bdef-525400149847 COUNT_TRANSACTIONS_IN_QUEUE: 0 COUNT_TRANSACTIONS_CHECKED: 31718 COUNT_CONFLICTS_DETECTED: 0 COUNT_TRANSACTIONS_ROWS_VALIDATING: 24 TRANSACTIONS_COMMITTED_ALL_MEMBERS: bfd5c347-c26e-42a1-af4b-fed4aabd117e:1-12956641:13827893-13828502:14828082-14828097 LAST_CONFLICT_FREE_TRANSACTION: bfd5c347-c26e-42a1-af4b-fed4aabd117e:12956655 COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 1 COUNT_TRANSACTIONS_REMOTE_APPLIED: 31718 COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0 COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0 (I *have* once seen the value jump to a 2, but it's always at least 1) This time, I've tried what worked previously - promote the instance, demote the instance, restart mysql - however this hasn't helped. I can't get this instance back to a 0 applier queue. Proxysql therefore won't consider it a valid backend server. https://lefred.be/content/mysql-innodb-cluster-how-to-handle-fc/ suggests to try set the group_replication_flow_control_mode to DISABLED, that did not help. I can't find any way of inspecting the applier queue to learn more about the query that is in it - is it always the same query? Or does it change?
[17 Feb 2022 20:46]
NOC NOC
Ok, may have found at least a tangential correlation. replica_parallel_workers = 3 replica_preserve_commit_order = 1 replica_parallel_type = LOGICAL_CLOCK 8.0.27 changed the default replica_parallel_workers to '4' which had me thinking. I set replica_parallel_workers=0, restarted mysql, and my applier queue went down to 0. Reverted to 3, restarted mysql, and applier queue jumped back up to 1. replica_parallel_workers is actually a dynamic value, so I set it to 0 and stop/started group_replication, and now it applier queue is back to 0. So at least my current woe is 'sorted', at the expense of replica parallel workers...
[31 May 2022 14:11]
Keith Lammers
I may have found an easier workaround for this. The last couple of times that I have run into it, restarting the mysql service with systemctl doesn't resolve it. However, if I connect with MySQL Shell and issue a "restart;", it seems to come back and properly clear out the applier queue.