Bug #97293 group replication plugin memory leak
Submitted: 18 Oct 2019 16:51 Modified: 20 Nov 2019 16:46
Reporter: Benoît Guyard Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:at least 8.0.16, 8.0.17 & 8.0.18 OS:Ubuntu (Ubuntu 18.04.2 LTS)
Assigned to: MySQL Verification Team CPU Architecture:x86 (GNU/Linux 4.15.0 x86_64)
Tags: group replication, Leak, Memory

[18 Oct 2019 16:51] Benoît Guyard
Description:
- 3-nodes multi-primary group replication cluster
- mysql seems to be eating mem indefinitely (same happens on all 3 nodes in the cluster)
- behavior started when upgrading this cluster from 5.7 to 8.0.16
- upgrading to 8.0.17 and 8.0.18 did not make this behavior stop/change
- only cure found so far is to restart the engine (merely restarting group_replication does not release mem)

- main reason for suspecting group replication plugin so far is:

~# while true ; do echo "$(date) - memory/group_rpl current_alloc:" $(mysql -BNe "SELECT sys.format_bytes(SUM(current_alloc)) AS current_alloc FROM sys.x\$memory_global_by_current_bytes WHERE event_name like 'memory/group_rpl%' GROUP BY SUBSTRING_INDEX(event_name,'/',2)") "- total alloc:" $(mysql -BNe "select * from sys.memory_global_total") ; sleep 600 ; done
Wed Oct 16 21:16:35 UTC 2019 - memory/group_rpl current_alloc: 181.22 MiB - total alloc: 1.64 GiB
Wed Oct 16 21:26:35 UTC 2019 - memory/group_rpl current_alloc: 182.26 MiB - total alloc: 1.64 GiB
[..]
Thu Oct 17 21:26:38 UTC 2019 - memory/group_rpl current_alloc: 336.36 MiB - total alloc: 1.79 GiB
Thu Oct 17 21:36:38 UTC 2019 - memory/group_rpl current_alloc: 337.41 MiB - total alloc: 1.80 GiB
[..]
Fri Oct 18 15:46:44 UTC 2019 - memory/group_rpl current_alloc: 459.77 MiB - total alloc: 1.91 GiB
Fri Oct 18 15:56:44 UTC 2019 - memory/group_rpl current_alloc: 460.87 MiB - total alloc: 1.91 GiB
etc.

- group replication config currently used is:

#
# * Group Replication Requirements
#
gtid_mode                                           = ON
enforce_gtid_consistency                            = ON
master_info_repository                              = TABLE
relay_log_info_repository                           = TABLE
binlog_checksum                                     = NONE
log_slave_updates                                   = ON
binlog_format                                       = ROW
binlog_checksum                                     = NONE
# prevent use of non-transactional storage engines
disabled_storage_engines                            = 'MyISAM,BLACKHOLE,FEDERATED,ARCHIVE'
# InnoDB gap locks are problematic for multi-primary conflict detection; none are used with READ-COMMITTED
# So if you don't rely on REPEATABLE-READ semantics and/or wish to use multi-primary mode then this
# isolation level is recommended 
transaction-isolation                               = 'READ-COMMITTED'
#
# * Group Replication Settings
#
plugin-load                                         = group_replication.so
transaction_write_set_extraction                    = XXHASH64
group_replication_group_name                        = '5e340231-4b60-11e9-be11-005056b8127c'
group_replication_start_on_boot                     = ON
group_replication_bootstrap_group                   = OFF
group_replication_ssl_mode                          = REQUIRED
group_replication_recovery_use_ssl                  = 1
group_replication_local_address                     = '<NODE_1_IP>:33061'
group_replication_group_seeds                       = '<NODE_1_IP>:33061,<NODE_2_IP>:33061,<NODE_3_IP>:33061'
group_replication_ip_whitelist                      = '<NODE_1_IP>,<NODE_2_IP>,<NODE_3_IP>'
group_replication_single_primary_mode               = OFF
group_replication_enforce_update_everywhere_checks  = ON
group_replication_unreachable_majority_timeout      = 30
group_replication_exit_state_action                 = 'READ_ONLY'
group_replication_autorejoin_tries                  = 3
report_host                                         = '<NODE_1_IP>'
super_read_only                                     = ON

- lastly, there is very little activity in this cluster (very very low WRITE rate): mostly mentioning this to say that this mem leak behavior does not seem to be tied to any real workload whatsoever

How to repeat:
No specific scenario, just start group replication process and eat popcorn while mem usage goes up and up until exhaustion / oom-killer creeps out from its cave.
[21 Oct 2019 17:26] MySQL Verification Team
Thanks for the report, I can't find anything with valgrind but I'll leave it running now for a few days with your script & config to see if I can reproduce this behavior.
[25 Oct 2019 10:16] MySQL Verification Team
Hi,
four days later and I still don't see any ram issues with 8.0.18
[30 Oct 2019 17:02] Benoît Guyard
Hello Bogdan,

indeed, it seems i've been jumping to conclusions a bit hastily: looking more closely at monitoring metrics, the mem leak behavior seems to have actually started after an upgrade of libssl & openssl pkgs, and not when this group replication cluster got upgraded from 5.7 to 8.0.16 (which happened before).

libssl/openssl package upgrades i'm referring too is:
> Upgrade: libssl1.1:amd64 (1.1.1-1ubuntu2.1~18.04.3, 1.1.1-1ubuntu2.1~18.04.4)
> Upgrade: openssl:amd64 (1.1.1-1ubuntu2.1~18.04.3, 1.1.1-1ubuntu2.1~18.04.4)

Then the cluster got upgraded to 8.0.17 and the mem leak issue was still happening, but i had not started digging deeper at this stage yet and was just witnessing that mysql was still eating ram up until exhaustion.

Then the cluster got upgraded to 8.0.18 and upon still witnessing the same behavior i started digging.. but did not wait long enough before filing in this bug report: it turns out that with v8.0.18, mysql mem usage actually gets stable at some point and also, the group replication xcom cache only grows until it reaches 1G, exactly as it is documented there: https://dev.mysql.com/doc/refman/8.0/en/group-replication-performance-xcom-cache.html

The script i was using to monitor this shows it very clearly:

Tue Oct 29 16:07:22 UTC 2019 - memory/group_rpl current_alloc: 968.37 MiB - total alloc: 2.40 GiB
[..]
Tue Oct 29 20:37:24 UTC 2019 - memory/group_rpl current_alloc: 1000.91 MiB - total alloc: 2.43 GiB
[..]
Tue Oct 29 23:57:24 UTC 2019 - memory/group_rpl current_alloc: 1024.00 MiB - total alloc: 2.46 GiB
[..]
Wed Oct 30 16:07:27 UTC 2019 - memory/group_rpl current_alloc: 1024.00 MiB - total alloc: 2.46 GiB

Bottom line is that i still don't really understand what was making mysql so unhappy before, but somehow it seems that it was related to libssl/openssl & mysql versions < v8.0.18.
 
Sorry for the noise and thank you for your time!
Benoît
[4 Nov 2019 14:46] MySQL Verification Team
thanks for the update!