Bug #114286 Using Clone Recovery for addInstance on MySQL Innodb Cluster uses 100% Memory
Submitted: 8 Mar 2024 18:15 Modified: 9 Apr 2024 18:49
Reporter: Marvin Warble Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server Severity:S1 (Critical)
Version:8.0.36 OS:Debian (12)
Assigned to: MySQL Verification Team CPU Architecture:x86
Tags: InnoDB Cluster

[8 Mar 2024 18:15] Marvin Warble
Description:
When trying to add a new node to a MySQL Innodb cluster using Cluster.addInstance() with the Clone recovery method, the donor node begins consuming memory at a rate of about 2Gb per minute, eventually crashing the donor node.

Stopping the recipient node has no affect on the memory consumption was it begins.

I have not encountered this problem setting up a test cluster with a small amount of data, which I have done several times to ensure I’m doing everything correctly. However on my Production server with about 125Gb of data, this problem occurs consistently.

Test cluster and production server are configured exactly the same.  The only difference is the number of databases and amount of data present.

It should be mentioned that the production server is setup as a single tenant server with about 9,000 databases and innodb_file_per_table is set to OFF.

The production server has 36Gb of memory allocated to it.  When the clone process starts the donor node is only using about 12Gb of the 36Gb available.  Within 7 minutes the donor consumes all available memory and crashes.

I am currently stuck using Incremental as the recovery method which takes several days to bring 1 node online.

How to repeat:
I don't know what the data requirements would be to repeat this problem.  I'm assuming the memory consumption is somehow related to the number of databases that need to cloned.
[11 Mar 2024 11:26] MySQL Verification Team
Hi Mr. Warble,

This is a forum for the reports with a fully repeatable test cases.

Until we get a fully repeatable test case from you, we can not start to process your report.

Thanks in advance ......
[12 Mar 2024 3:38] MySQL Verification Team
Are all MySQL Servers in question 8.0.36?

I just recently tested something else and I cloned a 300G database no problem. I do not recommend myself cloning but it did work for me with a 300G database and some 5000 databases. It was a "test setup" so not a real data but nothing much better than that I could recreate for this test. 

Can you share a full configuration?
[12 Mar 2024 11:58] Marvin Warble
Thanks.  Due to some serious time constraints at the moment, unable to setup a full test setup myself.

Yes, all servers are 8.0.36

If you don't mind, could you test one other thing? I believe I noticed the problem  when I switched innodb_files_per_table = OFF.
[12 Mar 2024 13:52] MySQL Verification Team
I'll test, but with that many databases I would not recommend that.

Since this is not a 10min test, can you share a full config so we have higher chance of reproduction?
[12 Mar 2024 17:17] Marvin Warble
All servers configured the same, except server ID.

[mysql]
default-character-set = utf8mb4

[mysqld]
server-id=1
datadir=/data/mysql
socket=/var/run/mysqld/mysqld.sock
log-error=/var/log/mysql/error.log
pid-file=/var/run/mysqld/mysqld.pid
early-plugin-load=keyring_file.so
keyring_file_data=/var/lib/mysql/keyring-data/keyring
log-bin
skip-name-resolve
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
federated
wait_timeout = 120
lock_wait_timeout = 60
ssl-ca=/data/mysql/ca.pem
ssl-cert=/data/mysql/server-cert.pem
ssl-key=/data/mysql/server-key.pem
log-error-verbosity=1

# Changed for Innodb Cluster
binlog_transaction_dependency_tracking = WRITESET
enforce_gtid_consistency = ON
gtid_mode = ON
#group_replication_single_primary_mode = OFF

#
# * Fine Tuning
#
key_buffer_size         = 500M
max_allowed_packet      = 500M

# CACHES AND LIMITS #
tmp_table_size                 = 1G
max_heap_table_size            = 1G
max_connections                = 20000
thread_cache_size              = 2500
open_files_limit               = 1024000
table_definition_cache         = 1024
table_open_cache               = 50000      
table_open_cache_instances     = 16         

# INNODB #
default_storage_engine=InnoDB

innodb_flush_method            = O_DIRECT
innodb_redo_log_capacity       = 10737418240
innodb_flush_log_at_trx_commit = 1          
innodb_log_buffer_size         = 10G        
innodb_log_file_size           = 1G         
innodb_file_per_table          = OFF          
innodb_buffer_pool_size        = 16G          
innodb_buffer_pool_instances   = 15           
innodb_read_io_threads         = 12           
innodb_write_io_threads        = 12
innodb_commit_concurrency      = 0
innodb_thread_concurrency      = 0
innodb_doublewrite             = ON          
innodb_lru_scan_depth          = 256
innodb_autoinc_lock_mode       = 2
[12 Mar 2024 17:19] Marvin Warble
The reason I have files per table turned off is that server initialization was taking an extremely long time with it turned on; 1 to 2 hours.

With it turned off it takes less than 5 minutes.
[9 Apr 2024 18:49] MySQL Verification Team
I did not manage to reproduce this. Might be more combination of filesystem and huge numbers of files than a bug in MySQL.

Testing with filesystems that can handle more files like reiserfs/xfs/zfs might solve the problem. I cannot reproduce the issue myself.