Bug #95115 | mysqld deadlock of all client threads originating from 3-way deadlock | ||
---|---|---|---|
Submitted: | 24 Apr 2019 17:16 | Modified: | 18 Jul 2019 13:18 |
Reporter: | Sandeep Dube | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
Version: | 5.7.24 | OS: | CentOS (CentOS Linux release 7.6.1810 (Core)) |
Assigned to: | CPU Architecture: | x86 (x86_64, ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked (uses shared libs) | |
Tags: | deadlock, slave thread initialization |
[24 Apr 2019 17:16]
Sandeep Dube
[24 Apr 2019 17:18]
Sandeep Dube
Complete discription with trace
Attachment: mysql_dedlock_trace.rtf (text/rtf), 20.84 KiB.
[24 Apr 2019 17:28]
Sandeep Dube
mysqld deadlock of all client threads originating from 3-way deadlock related to slave thread initialization
[24 Apr 2019 17:40]
Sandeep Dube
Complete discription with trace pdf format
Attachment: mysql_dedlock_trace_pdf.pdf (application/pdf, text), 73.89 KiB.
[24 Apr 2019 17:46]
Sandeep Dube
Complete discription with trace html
Attachment: mysql_dedlock_trace_html.html (text/html), 29.87 KiB.
[24 Apr 2019 17:56]
Sandeep Dube
Complete discription with trace txt
Attachment: mysql_deadlock_trace_txt.txt (text/plain), 16.30 KiB.
[25 Apr 2019 13:03]
MySQL Verification Team
Hi, Thank you for your bug report. I truly do not understand what is the bug that you are reporting. Deadlock detection is a proof that InnoDB transactional engine works great. Actually, your example shows more then that. InnoDB uses theory of graphs in order to connect as many concurrent transactions as possible. Hence, you are reporting that our server functions correctly. Let me know if I misunderstood anything.
[25 Apr 2019 21:49]
Jacek Cencek
Sinisa Milivojevic, I believe there might be some misunderstanding. Please double check that you have viewed the correct bug report. There is no deadlock detected here by the mysql server nor by any of its tooling. The results we have posted are from gdb and strace analysis. Second, this is a Linux process/thread deadlock that is unrelated to MySQL table/record locking. InnoDB doesn't come into this equation at all, I'm not sure why it is being referenced. Third, the MySQL server process is stuck in a perpetual deadlock that will remain unresolved regardless of how much time has gone by. There is nothing being waited on and no work being done by MySQL. All of MySQL's client threads are locked up indefinitely, resulting in all clients being rejected with the following message: root@sevone:~ [5.7.2.12] [21:42:01] $ mysqldata ERROR 1040 (HY000): Too many connections In summary: we have encountered a situation where a MySQL server, under normal running circumstances, enters into a state where it becomes indefinitely unresponsive and performs no work. No query is running or being waited for. Manual restart of the server process is the only remedy to this situation. Furthermore, we have posted an analysis with source code reference that indicates that this situation is a result of a race condition between three threads. Please view Sandeep Dube's txt or pdf attachments for the details. Let me know if any further clarification is required. Regards,
[26 Apr 2019 0:57]
Jacek Cencek
Small typo: at the very bottom, the 3rd label states 'THREAD 9775', which should be 'THREAD 14000'. Attaching fixed version
Attachment: backtraces.txt (text/plain), 16.30 KiB.
[26 Apr 2019 6:53]
Dmitry Lenev
Hello! So to sum up what happens in this case is that: THREAD 9775 is executing START SLAVE, which in start_slave_cmd() has acquired channel_map.wrlock(), and now has called start_slave_thread() to start replication SQL-thread, has acquired Relay_log_info::run_lock, called mysql_thread_create() which created new thread executing handle_slave_sql() and then has called mysql_cond_wait() which has unlocked Relay_log_info::run_lock, got signal and now tries to acquire Relay_log_info::run_lock back before returning from mysql_cond_wait(). THREAD 16089 is replication SQL-thread which terminates its work, so at the end of handle_slave_sql() it has acquired Relay_log_info::run_lock, and now has called THD::release_resources() which tries to acquire LOCK_status. THREAD 14000 executes something like SELECT * FROM INFORMATION_SCHEMA.STATUS_VARIABLES, calls fill_status(), which acquires LOCK_status mutex, and then eventually calls, show_heartbead_period() which tries to do channel_map.rdlock(). So THREAD 9775 waits for Relay_log_info::run_lock, which is owned by THREAD 16089, which waits for LOCK_status, which is owned by THREAD 14000, which waits for channel_map.rdlock(), which is write locked by THREAD 9775. Which looks like a classical deadlock situation to me. I have inspected code in 5.7.27-git to confirm that such situation is possible, so I am verifying this bug "by code inspection".
[26 Apr 2019 17:09]
Robert Albrecht
I have been following this from the start. 1) Jacek has identified 40 lines of code that are vulnerable to this timing related deadlock. It is triggered by errors in the replication log. 2) The severity of this issue needs to be raised above S3 because the result that the MySQL server is completely inaccessible.
[7 May 2019 20:31]
Leo Lin
Hey folks, we ran into this issue recently at our firm as well. We are running percona server 5.7.24-27 on Centos 6.9. We took a core dump and found the following three way deadlock: Thread 55 <github orchestrator running "show variables like 'maxscale%'"> owns &LOCK_global_system_variables waits PolyLock_lock_log::rdlock this is actually the &LOCK_log according to binlog.h Thread 60 <mysql collectd plugin running 'SHOW BINARY LOGS'> owns &LOCK_log waits &LOCK_index Thread 94 <our backup_binlogs script running "FLUSH BINARY LOGS"> owns &LOCK_index waits &thd->LOCK_thd_data from one of the theads --- #6 Adjust_offset::operator() (this=0x7f47cad38f20, thd=0x7f4754000ae0) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:2908 #7 0x0000000000d1f029 in operator() (this=0x359cd00, func=0x7f47cad38f20) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/mysqld_thd_manager.cc:46 #8 for_each<THD**, Do_THD> (this=0x359cd00, func=0x7f47cad38f20) at /opt/percona-devtoolset/root/usr/include/c++/4.8.2/bits/stl_algo.h:4417 #9 Global_THD_manager::do_for_all_thd (this=0x359cd00, func=0x7f47cad38f20) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/mysqld_thd_manager.cc:273 #10 0x0000000000e6e3de in adjust_linfo_offsets (this=0x1dca5c0, log_info=0x7f47cad38fc0, need_update_threads=) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:2950 #11 MYSQL_BIN_LOG::remove_logs_from_index (this=0x1dca5c0, log_info=0x7f47cad38fc0, need_update_threads=) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:6444 #12 0x0000000000e77ab2 in MYSQL_BIN_LOG::purge_logs (this=0x1dca5c0, to_log=0x7f47cad39380 "/vt/tablet/bin-logs/vt-0388620601-bin.063604", included=) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:6561 #13 0x0000000000e78ca8 in MYSQL_BIN_LOG::purge_logs_before_date (this=0x1dca5c0, purge_time=) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:7257 #14 0x0000000000e78eba in MYSQL_BIN_LOG::purge (this=0x1dca5c0) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:8060 #15 0x0000000000e78f78 in MYSQL_BIN_LOG::rotate_and_purge (this=0x1dca5c0, thd=) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/binlog.cc:8114 #16 0x0000000000c75caf in reload_acl_and_cache (thd=0x7f46b80123e0, options=1024, tables=0x0, write_to_binlog=0x7f47cad39d20) at /usr/src/debug/percona-server-5.7.24-27/percona-server-5.7.24-27/sql/sql_reload.cc:162 --- print (*(THD *)0x7f4754000ae0).LOCK_thd_data {m_mutex = {__data = {__lock = 2, __count = 0, __owner = 1656, __nusers = 1, __kind = 3, __spins = 76, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000x\006\000\000\001\000\000\000\003\000\000\000L", '\000' <repeats 18 times>, __align = 2}, m_psi = 0x7f4893b2b4c0} --- LWP 1656 is Thread 55
[7 May 2019 20:33]
Leo Lin
thread dump of the same deadlock we've found
Attachment: thread_dump.txt (text/plain), 690.42 KiB.
[8 May 2019 12:50]
MySQL Verification Team
Thank you for the additional feedback. Your info will be copied to our internal bugs database.
[7 Jun 2019 21:17]
Leo Lin
We've also found this in the 5.7.24 branch, not sure if this is related. https://github.com/mysql/mysql-server/blob/d2029238d6d9f648077664e4cdd611e231a6dc14/sql/sy... PolyLock_lock_log::rdlock and PolyLock_lock_log::wrlock both acquires a wrlock when the interface promises a read lock.
[10 Jun 2019 12:41]
MySQL Verification Team
Hi, Thank you for that link too.... To me it looks like related, hence, developers will be notified about it ...
[24 Jun 2019 12:57]
Rodrigo Alonso
I am having very similar issues with MySQL 5.7.26-29 ( Percona Server ) I was wondering if someone could confirm this: I am getting this error message. ERROR 1040 (HY000): Too many connections and only killing the process fixes the issue. I can hundreds of connections stuck threads trying to run SHOW GLOBAL STATUS / statistics (state) the stack trace I can see from the core file is similar: #0 0x00007fb144d9842d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fb144d93e01 in _L_lock_1022 () from /lib64/libpthread.so.0 #2 0x00007fb144d93da2 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000000000c291c2 in native_mutex_lock (mutex=<optimized out>) at /usr/src/debug/percona-server-5.7.26-29/percona-server-5.7.26-29/include/thr_mutex.h:84 #4 my_mutex_lock (mp=<optimized out>) at /usr/src/debug/percona-server-5.7.26-29/percona-server-5.7.26-29/include/thr_mutex.h:182 #5 inline_mysql_mutex_lock (src_file=0x137e4a8 "/mnt/workspace/percona-server-5.7-redhat-binary-rocks-new/label_exp/min-centos-7-x64/test/rpmbuild/BUILD/percona-server-5.7.26-29/percona-server-5.7.26-29/sql/sql_class.cc", src_line=2058, that=<optimized out>) at /usr/src/debug/percona-server-5.7.26-29/percona-server-5.7.26-29/include/mysql/psi/mysql_thread.h:715 #6 THD::release_resources (this=this@entry=0x7faf381b99e0) at /usr/src/debug/percona-server-5.7.26-29/percona-server-5.7.26-29/sql/sql_class.cc:2058 #7 0x0000000000d352de in handle_connection (arg=arg@entry=0x302d490) at /usr/src/debug/percona-server-5.7.26-29/percona-server-5.7.26-29/sql/conn_handler/connection_handler_per_thread.cc:326 #8 0x0000000000f2c6b4 in pfs_spawn_thread (arg=0x3118070) at /usr/src/debug/percona-server-5.7.26-29/percona-server-5.7.26-29/storage/perfschema/pfs.cc:2190 #9 0x00007fb144d91e25 in start_thread () from /lib64/libpthread.so.0 #10 0x00007fb142f6534d in clone () from /lib64/libc.so.6 I will attach the core file. There is not replication involved in my case. Thanks, Rodrigo.
[24 Jun 2019 13:20]
Rodrigo Alonso
I just uploaded to the sftp the following file: core.115145.gz
[25 Jun 2019 13:16]
Satya Bodapati
Just one comment (more from improvement angle) and to avoid these kinds of things in future. How about using the InnoDB Latch order mechanism? Define Latch Levels for mutexes and rw locks and add checks to detect violations. For InnoDB, there is --innodb-sync-debug=1. I know it might be a bit complicated (because of engines etc) but it should be possible for all Runtime + optimizer + replication defined latches.
[25 Jun 2019 13:42]
MySQL Verification Team
Satya, My dear friend, thank you for your contribution ....
[26 Jun 2019 12:36]
Rodrigo Alonso
I've compiled percona-server-5.7.26-29.tar.gz WITH_DEBUG turn on. I cannot reproduce the issue..
[26 Jun 2019 12:41]
MySQL Verification Team
Actually, the last comment reported is not related to this bug at all.
[4 Jul 2019 5:58]
Satya Bodapati
Looks like my wish comes true (8.0): https://dev.mysql.com/worklog/task/?id=3262
[4 Jul 2019 13:07]
MySQL Verification Team
Satya, my friend, You could just be right on this one !!!
[18 Jul 2019 13:18]
Margaret Fisher
Posted by developer: Changelog entry added for MySQL 8.0.18 and 5.7.28: A deadlock involving three threads could occur if a START SLAVE statement was issued to start the SQL thread on a slave while the SQL thread was still in the process of being stopped, and a request for the slave status was made at the same time. The issue has now been fixed by releasing a lock earlier while the SQL thread is being stopped.