Bug #112277 MySQL block forever
Submitted: 6 Sep 2023 15:10 Modified: 8 Sep 2023 3:11
Reporter: zetang zeng (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S1 (Critical)
Version:5.7.43 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[6 Sep 2023 15:10] zetang zeng
Description:
After some failures of three nodes mysql cluster, one nodes block all writes forever. 

Some processlist:

21708	streambe_w	192.168.1.161:32962	streambe	Query	523146	starting	commit
21716	minibase_w	192.168.1.161:33122	minibase	Query	523890	starting	commit
21723	monitor_w	192.168.1.161:33142	monitor	Query	523859	query end	UPDATE /* psm=toutiao.unknown.unknown,ip=10.42.3.81,pid=14 */ `alert_frequency` SET `next_time` = 16
21728	monitor_w	192.168.1.166:47470	monitor	Query	523856	starting	commit
21731	cluster	192.168.1.234:41648	NULL	Query	523907	System lock	CREATE USER IF NOT EXISTS 'router_192.168.1.234'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*30

All these threads block with following stack:

Thread 3155 (Thread 0x7fa104113700 (LWP 233677)):
#0  0x00007fa1ebd24a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fa0c58a97ae in native_cond_wait (mutex=0x7f9ff85e9928, cond=0x7f9ff85e9958) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/include/thr_cond.h:147
#2  my_cond_wait (mp=0x7f9ff85e9928, cond=0x7f9ff85e9958) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/include/thr_cond.h:202
#3  inline_mysql_cond_wait (src_line=201, src_file=0x7fa0c5b07270 "/var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/rapid/plugin/group_replication/include/plugin_utils.h", mutex=0x7f9ff85e9928, that=0x7f9ff85e9958) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/include/mysql/psi/mysql_thread.h:1191
#4  wait (this=0x7f9ff85e9920) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/rapid/plugin/group_replication/include/plugin_utils.h:201
#5  Wait_ticket<unsigned int>::waitTicket (this=0x7f9fc0005b20, key=<optimized out>) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/rapid/plugin/group_replication/include/plugin_utils.h:341
#6  0x00007fa0c58aa5d2 in group_replication_trans_before_commit (param=0x7fa104110330) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/rapid/plugin/group_replication/src/observer_trans.cc:630
#7  0x0000000000cd4ec6 in Trans_delegate::before_commit (this=0x218f240 <delegates_init()::trans_mem>, thd=<optimized out>, all=<optimized out>, trx_cache_log=<optimized out>, stmt_cache_log=<optimized out>, cache_log_max_size=<optimized out>) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/rpl_handler.cc:657
#8  0x0000000000f4d9ed in MYSQL_BIN_LOG::commit (this=0x21b35c0 <mysql_bin_log>, thd=0x7f9ff82144f0, all=true) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/binlog.cc:8836
#9  0x000000000087c6e8 in ha_commit_trans (thd=0x7f9ff82144f0, all=true, ignore_global_read_lock=false) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/handler.cc:1807
#10 0x0000000000e25d09 in trans_commit (thd=0x7f9ff82144f0) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/transaction.cc:246
#11 0x0000000000d6b0d9 in mysql_execute_command (thd=0x7f9ff82144f0, first_level=true) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/sql_parse.cc:4244
#12 0x0000000000d6e12d in mysql_parse (thd=0x7f9ff82144f0, parser_state=<optimized out>) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/sql_parse.cc:5584
#13 0x0000000000d6f968 in dispatch_command (thd=0x7f9ff82144f0, com_data=0x7fa104112da0, command=COM_QUERY) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/sql_parse.cc:1492
#14 0x0000000000d70514 in do_command (thd=0x7f9ff82144f0) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/sql_parse.cc:1031
#15 0x0000000000e43a7c in handle_connection (arg=<optimized out>) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/sql/conn_handler/connection_handler_per_thread.cc:313
#16 0x000000000123a504 in pfs_spawn_thread (arg=0x9a92e20) at /var/lib/pb2/sb_1-11862956-1687353311.91/mysql-5.7.43/storage/perfschema/pfs.cc:2197
#17 0x00007fa1ebd20ea5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007fa1ea7d9b0d in clone () from /lib64/libc.so.6

I think it is a bug which mysql wait not-exist group member certification.

How to repeat:
Not cleared yet.

Suggested fix:
this wait process should have a timeout?
[7 Sep 2023 1:25] zetang zeng
all threads stack

Attachment: stacktrace.txt.zip (application/zip, text), 454.17 KiB.

[7 Sep 2023 1:58] zetang zeng
After reading some source code, I think it may be some cases, we forget to release ticket.

          certification_latch->releaseTicket(thread_id);
[7 Sep 2023 2:55] zetang zeng
(gdb) print *certification_latch
$2 = {_vptr.Wait_ticket = 0x7fa0c5e2fe30 <vtable for Wait_ticket<unsigned int>+16>, lock = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 3, __spins = 0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\003", '\000' <repeats 22 times>, __align = 0}, m_psi = 0x7fa01db905c0}, cond = {m_cond = {__data = {__lock = 0, __futex = 0, 
        __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, m_psi = 0x42504c0}, 
Python Exception <type 'exceptions.ValueError'> Cannot find type std::map<unsigned int, CountDownLatch*, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, CountDownLatch*> > >::_Rep_type: 
  map = std::map with 178 elements, blocked = false, waiting = false}
[7 Sep 2023 3:48] zetang zeng
the certification_latch->map content for one process, the count is totally wrong:

elem[4].left: $17 = 21731
elem[4].right: $18 = {
  _vptr.CountDownLatch = 0x1414d71000007fa0, 
  lock = {
    m_mutex = {
      __data = {
        __lock = 32671, 
        __count = 0, 
        __owner = 512, 
        __nusers = 0, 
        __kind = 65536, 
        __spins = 80, 
        __list = {
          __prev = 0x1b500000000, 
          __next = 0x64f077a800000000
        }
      }, 
      __size = "\237\177\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\000\001\000P\000\000\000\000\000\000\000\265\001\000\000\000\000\000\000\250w\360d", 
      __align = 32671
    }, 
    m_psi = 0x8900b134de24
  }, 
  cond = {
    m_cond = {
      __data = {
        __lock = 35072, 
        __futex = 604012544, 
        __total_seq = 70368744199395, 
        __wakeup_seq = 0, 
        __woken_seq = 4049354197659235840, 
        __mutex = 0x312d326563332d64, 
        __nwaiters = 761619761, 
        __broadcast_seq = 959931489
      }, 
      __size = "\000\211\000\000\000\200\000$\343T\000\000\000@", '\000' <repeats 11 times>, "\066\067\071\060\062\062\070d-3ce2-11ee-ad79", 
      __align = 2594214122853796096
    }, 
    m_psi = 0x306533363130302d
  }, 
  count = 808991033
[7 Sep 2023 3:49] zetang zeng
But even it count down to zero, this process is still blocked.

elem[12].left: $33 = 22017
elem[12].right: $34 = {
  _vptr.CountDownLatch = 0x48a4cad000007fa0, 
  lock = {
    m_mutex = {
      __data = {
        __lock = 32672, 
        __count = 64, 
        __owner = 0, 
        __nusers = 69, 
        __kind = 0, 
        __spins = 0, 
        __list = {
          __prev = 0xd000004d2, 
          __next = 0x0
        }
      }, 
      __size = "\240\177\000\000@\000\000\000\000\000\000\000E", '\000' <repeats 11 times>, "\322\004\000\000\r\000\000\000\000\000\000\000\000\000\000", 
---Type <return> to continue, or q <return> to quit---
      __align = 274877939616
    }, 
    m_psi = 0x0
  }, 
  cond = {
    m_cond = {
      __data = {
        __lock = 0, 
        __futex = 810632026, 
        __total_seq = 4426847640188039011, 
        __wakeup_seq = 140600049401856, 
        __woken_seq = 227633266688, 
        __mutex = 0x0, 
        __nwaiters = 1234, 
        __broadcast_seq = 8
      }, 
      __size = "\000\000\000\000ZCQ0c//lyRo=\000\000\000\000\340\177\000\000\000\000\000\000\065", '\000' <repeats 11 times>, "\322\004\000\000\b\000\000", 
      __align = 3481638040760221696
    }, 
    m_psi = 0x0
  }, 
  count = 0
}
[7 Sep 2023 3:54] zetang zeng
certification_latch->map content

Attachment: 1694058685253-7663-countDown.txt (text/plain), 259.52 KiB.

[7 Sep 2023 14:30] MySQL Verification Team
Hi,
Thanks for your analysis but I'm having issues reproducing this and arriving to same situation you have. Is there a way you can reproduce this?

thanks
[8 Sep 2023 2:44] zetang zeng
Last time we produced this scenario by making primary nodes network dropping 90% packet randomly (by linux cmd `tc`) for 1 hour and later it blocked all writes.
 

LSB Version:    :core-4.1-amd64:core-4.1-noarch

cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[8 Sep 2023 3:11] zetang zeng
I am wrong, not only by `tc`, but with `iptables`