MySQL Bugs: #106537: mysql innodb deadlock memory leak

Bug #106537	mysql innodb deadlock memory leak
Submitted:	22 Feb 2022 13:56	Modified:	18 Oct 2022 11:44
Reporter:	pavel rekun	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S1 (Critical)
Version:	5.7.40	OS:	Windows
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
We have an innodb cluster consisting of 4 nodes. single primary . the issue that we experience for the last probably 2 years is that over time the master is slowed down to the point that a simple select of 1000 records can take over a minute. tried checking innodb engine status, full process list and can't find any issue.

hardware:
virtual machine with 16 cores (xeon 8270@2.70) 
ram 48gb. 
windows server 2016 standard

mysql configured with 21gb for innodb buffer pool.

When the issue occurs the following happens:
db1 (master) - select * from users limit 1000 takes more than 1 minutes.
db2/3/4 (slave) same select returns instantly.

Whole cluster is still in sync. the only solution is to kill db1, db2 becomes master, then starting db1 again which instantly syncs.
When the issue happens the server still has plenty of resouces (low cpu and probably 40% ram free)

Tried to update to the latest mysql 5.7.36 and still the same. looks like this issue happens more often with time. in the past it was happening maybe once in 6 months. now it happens every week.

I can also see small jump in master's memory usage when the issue happens - hits 27 gb while slaves all on 24gb.

db takes around 1.5tb. 

now way of reproducing.

Is this some obvious behavior and I'm missing something? 
Only strange thing i see in the slaves logs (not sure it's related to the issue) 
2022-02-22T08:35:19.635351Z 42 [Note] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 120; events ssigned = 342669313; worker queues filled over overrun level = 1844; waited due a Worker queue full = 41; waited due the total size = 0; waited t lock conflicts = 21398150794200 waited (count) when Workers occupied = 438439 waited when Workers occupied = 1001930351500

attached config.

port=3306
no-beep
default-character-set=utf8
port=3306
basedir="C:/Program Files/MySQL/MySQL Server 5.7/"
datadir=d:/MySQL Server 5.7/Data
default-storage-engine=INNODB
sql-mode="STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"
log-output=FILE
general-log=0
general_log_file="D:\MySQL Server 5.7\Data\db02.log"
slow-query-log=1
slow_query_log_file="D:\MySQL Server 5.7\Data\db02-slow.log"
long_query_time=10
log-error="D:\MySQL Server 5.7\Data\db02.err"
lower_case_table_names=1
secure-file-priv="D:/MySQL Server 5.7/Uploads"
max_connections=5000
table_open_cache=2000
thread_cache_size=10
myisam_max_sort_file_size=100G
myisam_sort_buffer_size=12G
key_buffer_size=10M
read_buffer_size=64K
read_rnd_buffer_size=256K
innodb_log_buffer_size=1M
innodb_thread_concurrency=0
innodb_autoextend_increment=64
innodb_buffer_pool_instances=8
innodb_concurrency_tickets=5000
innodb_old_blocks_time=1000
innodb_open_files=300
innodb_stats_on_metadata=0
innodb_file_per_table=1
innodb_checksum_algorithm=0
back_log=80
flush_time=0
join_buffer_size=256K
max_allowed_packet=128M
max_connect_errors=100
open_files_limit=4161
sort_buffer_size=256K
table_definition_cache=1400
binlog_row_event_max_size=8K
sync_master_info=10000
sync_relay_log=10000
sync_relay_log_info=10000
server-id=102
innodb_buffer_pool_size=21G
innodb_log_file_size=1G
innodb_write_io_threads=64
innodb_read_io_threads=64
bulk_insert_buffer_size=256M
tmp_table_size=2G
max_heap_table_size=2G
innodb_data_home_dir="D:\MySQL Server 5.7\Data"
innodb_log_group_home_dir="D:\MySQL Server 5.7\Data"
skip-slave-start
gtid_mode = on
enforce_gtid_consistency = on
sync_binlog=0
innodb_flush_log_at_trx_commit=1
innodb_autoinc_lock_mode=0
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=4
slave_preserve_commit_order=1
slave_pending_jobs_size_max=157286400
log-slave-updates = ON
log-bin=db02-bin
binlog_format=ROW
master_info_repository=TABLE
relay_log_info_repository=TABLE
transaction_write_set_extraction = XXHASH64
report_port = 3306
binlog_checksum = NONE
report_host=db02
relay_log=db02-relay-bin
pid-file=db02.pid
default-time-zone='+00:00'
expire_logs_days=7
auto_increment_increment = 1
auto_increment_offset = 2
group_replication_member_weight = 80
tmpdir = "D:/tmp"
super-read-only
group_replication_allow_local_disjoint_gtids_join = OFF
group_replication_allow_local_lower_version_join = OFF
group_replication_auto_increment_increment = 7
group_replication_bootstrap_group = OFF
group_replication_components_stop_timeout = 31536000
group_replication_compression_threshold = 1000000
group_replication_enforce_update_everywhere_checks = OFF
group_replication_exit_state_action = READ_ONLY
group_replication_flow_control_applier_threshold = 25000
group_replication_flow_control_certifier_threshold = 25000
group_replication_flow_control_mode = QUOTA
group_replication_force_members
group_replication_group_name = ab4c9894-7f55-11eb-b514-0050562f0c04
group_replication_group_seeds = db01:33061,db03:33061,db04:33061
group_replication_gtid_assignment_block_size = 1000000
group_replication_ip_whitelist = AUTOMATIC
group_replication_local_address = db02:33061
group_replication_poll_spin_loops = 0
group_replication_recovery_complete_at = TRANSACTIONS_APPLIED
group_replication_recovery_reconnect_interval = 60
group_replication_recovery_retry_count = 10
group_replication_recovery_ssl_ca
group_replication_recovery_ssl_capath
group_replication_recovery_ssl_cert
group_replication_recovery_ssl_cipher
group_replication_recovery_ssl_crl
group_replication_recovery_ssl_crlpath
group_replication_recovery_ssl_key
group_replication_recovery_ssl_verify_server_cert = OFF
group_replication_recovery_use_ssl = ON
group_replication_single_primary_mode = ON
group_replication_ssl_mode = REQUIRED
group_replication_start_on_boot = ON
group_replication_transaction_size_limit = 0
group_replication_unreachable_majority_timeout = 0

How to repeat:
not sure.

few other things i forgot to mention:
all disk are ssd.
when the issue happens all queries to master gets locked even a simple update/select.
slaves performing normally.

Hi,

I cannot reproduce this, there is no data about the problem we can even start to figure out what the problem could be. You need to setup some monitoring tools (MySQL Enterprise Monitor is great monitoring tool) to monitor status of your MySQL server and your OS (disk, ram, cpu..). Usually running MySQL Enterprise Monitor will help you solve this issue.

ok. i'm not monitoring the memory usage. the master node keeps growing in memory. 
currently memory/innodb/std keeps growing (don't see this on the slaves) is that normal?

see the output

select event_name, current_alloc, high_alloc
from sys.memory_global_by_current_bytes
where current_count > 0;

'memory/innodb/std', '163.48 MiB', '193.50 MiB'
'memory/sql/String::value', '90.67 MiB', '222.66 MiB'
'memory/performance_schema/events_statements_summary_by_thread_by_event_name', '43.35 MiB', '43.35 MiB'
'memory/performance_schema/table_handles', '36.25 MiB', '36.25 MiB'
'memory/performance_schema/memory_summary_by_thread_by_event_name', '28.12 MiB', '28.12 MiB'
'memory/mysys/IO_CACHE', '24.02 MiB', '311.56 MiB'
'memory/performance_schema/events_statements_history', '17.48 MiB', '17.48 MiB'
'memory/performance_schema/events_statements_current', '17.48 MiB', '17.48 MiB'
'memory/performance_schema/events_waits_summary_by_thread_by_event_name', '16.95 MiB', '16.95 MiB'
'memory/performance_schema/events_statements_history_long', '13.66 MiB', '13.66 MiB'
'memory/performance_schema/events_statements_history.tokens', '12.50 MiB', '12.50 MiB'
'memory/performance_schema/events_statements_history.sqltext', '12.50 MiB', '12.50 MiB'
'memory/performance_schema/events_statements_current.tokens', '12.50 MiB', '12.50 MiB'
'memory/performance_schema/events_statements_current.sqltext', '12.50 MiB', '12.50 MiB'
'memory/performance_schema/events_statements_summary_by_digest.tokens', '9.77 MiB', '9.77 MiB'
'memory/performance_schema/events_statements_history_long.tokens', '9.77 MiB', '9.77 MiB'
'memory/performance_schema/events_statements_history_long.sqltext', '9.77 MiB', '9.77 MiB'
'memory/innodb/row0sel', '8.96 MiB', '2.56 GiB'
'memory/sql/Gtid_set::Interval_chunk', '6.53 MiB', '22.57 MiB'
'memory/performance_schema/events_stages_summary_by_thread_by_event_name', '5.86 MiB', '5.86 MiB'

ok going over the "SHOW ENGINE INNODB STATUS;" output i attached before - i noticed that we have dead locks happening - these deadlocks happen only on the master (this explains why the memory leak happens only on the primary node and never on the slaves).
So i think as time goes by, as more deadlocks happen on the master the memory leaks.
As of now - server runs for 7 days and after 613 deadlock (SELECT `count` FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME="lock_deadlocks";) the server leaked about 1.2gb compared to the slaves.
Is this making sense to you? are you aware of such scenario?

Hi Pavel,

I still can not reproduce. Even forcing deadlocks (a lot of them) I do not see the leaks nor I see the slowdown (I cannot get "select * from users limit 1000" to be slow whatever I do).

Three things
1. I think opening a ticket with MySQL Support would be beneficial to you
2. If you can test this with 8.0 would be interesting
3. Gathering more data - namely profile the mysql query ( https://dev.mysql.com/doc/refman/5.7/en/show-profile.html ) this might give us idea what is going on, as even with mem leak you say you have more than 40% free ram so it does not look like mem leak is the problem

thanks

ok few updates:

all deadlocks were basically removed (code side solution) - we'll see if it affects the reported issue.

As to the memory leak, I'll rephrase - it's not an actual memory leak but looks like a resource exhaustion within the allowed allocation of the mysql master instance.
So in our case each instance it set to 21GB innodb buffer, over time (takes about a week) the memory usage on the master node climbs to about 27.5GB (commit size) and then we start to notice the slowdowns with each queries, while on slaves it stays on 25GB .

Both slave and master have about 700 connections at any given time.
The only difference load wise is that in the master we run about 5k queries/sec while on slaves around 700 queries/sec.

Any obvious setting (settings included in the initial post) that can cause the master run out of resources? machine still plenty of memory (around 20gb free).

I also used the memory statistics of mysql - didn't see any memory leak.

Other things we changed - 

#changed from 2GB to 210MB on 25/5/22 (should be 1% of mysql memory) .per connection
tmp_table_size=210M
max_heap_table_size=210M 

also noticed strange behavior where several of our apps were constantly closing and reopening connections - also fixed (might be related?)

as to your previous points:
testing with MySQL 8 - a bit of an issue since we can't reproduce the case it will require us upgrading to mysql 8.0 (not sure if the issue will be solved or even worse) so we'll wait with this until we try all other options (but we'll need to upgrade it at some point)

As to the profiler - I'll run it when the issue happens (can takes weeks so lets see)

Appreciate the help.

Hi,

Thanks for the updates.

As I mentioned, you would really benefit from the MySQL Support team as they can help you track issue in your code and issue with your configuration and make you get max out of your system. I know MySQL Support is not free but IMHO it is best value for the money in the entire RDBMS ecosystem. Through the bugs system we can only track and fix bugs in MySQL code.

I cannot get a simple query like that to be slow no matter what I do so it is important to find out what is making it slow. Query profiler will help, also, check all the server stat (IO and CPU load, RAM status etc..) at the moment of the slow query.

kind regards

SUCCESS :)

Managed to reproduce the memory leak (yes, it's indeed a memory leak)
Tested with mysql 5.7.36 x64.
Jdk 1.7

Basiclly i've set innodb_lock_wait_timeout=1; to speed up the deadlocks and this is the code I've run. with innodb_buffer_pool_size=8M after about 10 hours, mysql instance went from 150MB to 3GB.

private static void testMysqlLeak() {
        int threads = 50;
        final CountDownLatch latch = new CountDownLatch(threads);
        final AtomicInteger activeThreads = new AtomicInteger();
        while (true) {

            for (int i = 0; i < threads; i++) {

                final int threadId = i;
                activeThreads.incrementAndGet();
                new Thread(new Runnable() {
                    public void run() {
                        try {

                            ResultSet res = null;
                            Statement stat = null;
                            Connection con = null;

                            try {
                                String dbURL = "jdbc:mysql://localhost:3307/test";
                                String username = "root";
                                String password = "123";

                                Class.forName("com.mysql.jdbc.Driver");
                                con = DriverManager.getConnection(dbURL, username, password);
                                con.setTransactionIsolation(Connection.TRANSACTION_REPEATABLE_READ);

                                int i = 0;

                                con.setAutoCommit(false);

                                stat = con.createStatement();
                                System.out.println("thread:" + threadId);
                                latch.countDown();
                                latch.await();

                                if (threadId < 5) {
                                    stat.execute("update admin_users set last_login=now() where id=1");
                                } else {
                                    stat.execute("update admin_users set last_login=now() where id=2");
                                    stat.execute("update admin_users set password=now() where id=1");
                                }

                                sleep(5000);
                                con.commit();
                                System.out.println("done");

                                i++;
                                if (i % 1000 == 0)
                                    System.out.println("count:" + i);

                            } catch (Exception e) {
                                e.printStackTrace();
                            } finally {
                                closeResultSet(res);
                                closeStatement(stat);
                                closeConnection(con);
                                activeThreads.decrementAndGet();
                            }

                        } catch (Exception e) {
                            e.printStackTrace();
                        }
                    }

                }).start();
            }

            while (activeThreads.get() != 0) {
                try {
                    System.out.println("active threads:" + activeThreads.get());
                    Thread.sleep(1000);

                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }

//        System.out.println("done");

    }

DROP TABLE IF EXISTS `test`.`admin_users`;
CREATE TABLE  `test`.`admin_users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_name` varchar(255) DEFAULT NULL,
  `magic` varchar(255) DEFAULT NULL,
  `password` varchar(255) DEFAULT NULL,
  `last_login` datetime DEFAULT NULL,
  `session_id` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uniqe` (`user_name`),
  KEY `user_name` (`user_name`),
  KEY `magic` (`magic`),
  KEY `password` (`password`),
  KEY `last_login` (`last_login`),
  KEY `session_id` (`session_id`)
) ENGINE=InnoDB AUTO_INCREMENT=456339 DEFAULT CHARSET=utf8;

Let me know if you need anything else on your end to run it.
P.S - it has nothing to do with innodb cluster...

tested on 8.0.29 - NOT reproducible.

I'm closing this. I cannot reproduce this with latest source/binaries. Since you cannot reproduce with 8.0.29 neither I'm considering this closed

thanks

Hi.

Still happening with latest 5.7.40.

Just noticed i forgot to add the rows you need in the table to create the actual deadlock.
Also let me remind you - testing this on windows - not sure if it happens on linux...

insert into test.admin_users(id,user_name) values(1,'user1');
insert into test.admin_users(id,user_name) values(2,'user2');

Please test it again and let me know - this is a very simple test that's easily reproducible each time and exposes a huge issue.

If anything further needed please let me know.