Bug #112277 | MySQL block forever | ||
---|---|---|---|
Submitted: | 6 Sep 2023 15:10 | Modified: | 8 Sep 2023 3:11 |
Reporter: | zetang zeng (OCA) | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S1 (Critical) |
Version: | 5.7.43 | OS: | Any |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[6 Sep 2023 15:10]
zetang zeng
[7 Sep 2023 1:25]
zetang zeng
all threads stack
Attachment: stacktrace.txt.zip (application/zip, text), 454.17 KiB.
[7 Sep 2023 1:58]
zetang zeng
After reading some source code, I think it may be some cases, we forget to release ticket. certification_latch->releaseTicket(thread_id);
[7 Sep 2023 2:55]
zetang zeng
(gdb) print *certification_latch $2 = {_vptr.Wait_ticket = 0x7fa0c5e2fe30 <vtable for Wait_ticket<unsigned int>+16>, lock = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 3, __spins = 0, __list = { __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\003", '\000' <repeats 22 times>, __align = 0}, m_psi = 0x7fa01db905c0}, cond = {m_cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, m_psi = 0x42504c0}, Python Exception <type 'exceptions.ValueError'> Cannot find type std::map<unsigned int, CountDownLatch*, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, CountDownLatch*> > >::_Rep_type: map = std::map with 178 elements, blocked = false, waiting = false}
[7 Sep 2023 3:48]
zetang zeng
the certification_latch->map content for one process, the count is totally wrong: elem[4].left: $17 = 21731 elem[4].right: $18 = { _vptr.CountDownLatch = 0x1414d71000007fa0, lock = { m_mutex = { __data = { __lock = 32671, __count = 0, __owner = 512, __nusers = 0, __kind = 65536, __spins = 80, __list = { __prev = 0x1b500000000, __next = 0x64f077a800000000 } }, __size = "\237\177\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\000\001\000P\000\000\000\000\000\000\000\265\001\000\000\000\000\000\000\250w\360d", __align = 32671 }, m_psi = 0x8900b134de24 }, cond = { m_cond = { __data = { __lock = 35072, __futex = 604012544, __total_seq = 70368744199395, __wakeup_seq = 0, __woken_seq = 4049354197659235840, __mutex = 0x312d326563332d64, __nwaiters = 761619761, __broadcast_seq = 959931489 }, __size = "\000\211\000\000\000\200\000$\343T\000\000\000@", '\000' <repeats 11 times>, "\066\067\071\060\062\062\070d-3ce2-11ee-ad79", __align = 2594214122853796096 }, m_psi = 0x306533363130302d }, count = 808991033
[7 Sep 2023 3:49]
zetang zeng
But even it count down to zero, this process is still blocked. elem[12].left: $33 = 22017 elem[12].right: $34 = { _vptr.CountDownLatch = 0x48a4cad000007fa0, lock = { m_mutex = { __data = { __lock = 32672, __count = 64, __owner = 0, __nusers = 69, __kind = 0, __spins = 0, __list = { __prev = 0xd000004d2, __next = 0x0 } }, __size = "\240\177\000\000@\000\000\000\000\000\000\000E", '\000' <repeats 11 times>, "\322\004\000\000\r\000\000\000\000\000\000\000\000\000\000", ---Type <return> to continue, or q <return> to quit--- __align = 274877939616 }, m_psi = 0x0 }, cond = { m_cond = { __data = { __lock = 0, __futex = 810632026, __total_seq = 4426847640188039011, __wakeup_seq = 140600049401856, __woken_seq = 227633266688, __mutex = 0x0, __nwaiters = 1234, __broadcast_seq = 8 }, __size = "\000\000\000\000ZCQ0c//lyRo=\000\000\000\000\340\177\000\000\000\000\000\000\065", '\000' <repeats 11 times>, "\322\004\000\000\b\000\000", __align = 3481638040760221696 }, m_psi = 0x0 }, count = 0 }
[7 Sep 2023 3:54]
zetang zeng
certification_latch->map content
Attachment: 1694058685253-7663-countDown.txt (text/plain), 259.52 KiB.
[7 Sep 2023 14:30]
MySQL Verification Team
Hi, Thanks for your analysis but I'm having issues reproducing this and arriving to same situation you have. Is there a way you can reproduce this? thanks
[8 Sep 2023 2:44]
zetang zeng
Last time we produced this scenario by making primary nodes network dropping 90% packet randomly (by linux cmd `tc`) for 1 hour and later it blocked all writes. LSB Version: :core-4.1-amd64:core-4.1-noarch cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
[8 Sep 2023 3:11]
zetang zeng
I am wrong, not only by `tc`, but with `iptables`