Your file was attached to the bug successfully.
| Bug #112277 | MySQL block forever | ||
|---|---|---|---|
| Submitted: | 6 Sep 2023 15:10 | Modified: | 8 Sep 2023 3:11 |
| Reporter: | zetang zeng (OCA) | Email Updates: | |
| Status: | Can't repeat | Impact on me: | |
| Category: | MySQL Server: Group Replication | Severity: | S1 (Critical) |
| Version: | 5.7.43 | OS: | Any |
| Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[6 Sep 2023 15:10]
zetang zeng
[7 Sep 2023 1:25]
zetang zeng
all threads stack
Attachment: stacktrace.txt.zip (application/zip, text), 454.17 KiB.
[7 Sep 2023 1:58]
zetang zeng
After reading some source code, I think it may be some cases, we forget to release ticket.
certification_latch->releaseTicket(thread_id);
[7 Sep 2023 2:55]
zetang zeng
(gdb) print *certification_latch
$2 = {_vptr.Wait_ticket = 0x7fa0c5e2fe30 <vtable for Wait_ticket<unsigned int>+16>, lock = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 3, __spins = 0, __list = {
__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\003", '\000' <repeats 22 times>, __align = 0}, m_psi = 0x7fa01db905c0}, cond = {m_cond = {__data = {__lock = 0, __futex = 0,
__total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, m_psi = 0x42504c0},
Python Exception <type 'exceptions.ValueError'> Cannot find type std::map<unsigned int, CountDownLatch*, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, CountDownLatch*> > >::_Rep_type:
map = std::map with 178 elements, blocked = false, waiting = false}
[7 Sep 2023 3:48]
zetang zeng
the certification_latch->map content for one process, the count is totally wrong:
elem[4].left: $17 = 21731
elem[4].right: $18 = {
_vptr.CountDownLatch = 0x1414d71000007fa0,
lock = {
m_mutex = {
__data = {
__lock = 32671,
__count = 0,
__owner = 512,
__nusers = 0,
__kind = 65536,
__spins = 80,
__list = {
__prev = 0x1b500000000,
__next = 0x64f077a800000000
}
},
__size = "\237\177\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\000\001\000P\000\000\000\000\000\000\000\265\001\000\000\000\000\000\000\250w\360d",
__align = 32671
},
m_psi = 0x8900b134de24
},
cond = {
m_cond = {
__data = {
__lock = 35072,
__futex = 604012544,
__total_seq = 70368744199395,
__wakeup_seq = 0,
__woken_seq = 4049354197659235840,
__mutex = 0x312d326563332d64,
__nwaiters = 761619761,
__broadcast_seq = 959931489
},
__size = "\000\211\000\000\000\200\000$\343T\000\000\000@", '\000' <repeats 11 times>, "\066\067\071\060\062\062\070d-3ce2-11ee-ad79",
__align = 2594214122853796096
},
m_psi = 0x306533363130302d
},
count = 808991033
[7 Sep 2023 3:49]
zetang zeng
But even it count down to zero, this process is still blocked.
elem[12].left: $33 = 22017
elem[12].right: $34 = {
_vptr.CountDownLatch = 0x48a4cad000007fa0,
lock = {
m_mutex = {
__data = {
__lock = 32672,
__count = 64,
__owner = 0,
__nusers = 69,
__kind = 0,
__spins = 0,
__list = {
__prev = 0xd000004d2,
__next = 0x0
}
},
__size = "\240\177\000\000@\000\000\000\000\000\000\000E", '\000' <repeats 11 times>, "\322\004\000\000\r\000\000\000\000\000\000\000\000\000\000",
---Type <return> to continue, or q <return> to quit---
__align = 274877939616
},
m_psi = 0x0
},
cond = {
m_cond = {
__data = {
__lock = 0,
__futex = 810632026,
__total_seq = 4426847640188039011,
__wakeup_seq = 140600049401856,
__woken_seq = 227633266688,
__mutex = 0x0,
__nwaiters = 1234,
__broadcast_seq = 8
},
__size = "\000\000\000\000ZCQ0c//lyRo=\000\000\000\000\340\177\000\000\000\000\000\000\065", '\000' <repeats 11 times>, "\322\004\000\000\b\000\000",
__align = 3481638040760221696
},
m_psi = 0x0
},
count = 0
}
[7 Sep 2023 3:54]
zetang zeng
certification_latch->map content
Attachment: 1694058685253-7663-countDown.txt (text/plain), 259.52 KiB.
[7 Sep 2023 14:30]
MySQL Verification Team
Hi, Thanks for your analysis but I'm having issues reproducing this and arriving to same situation you have. Is there a way you can reproduce this? thanks
[8 Sep 2023 2:44]
zetang zeng
Last time we produced this scenario by making primary nodes network dropping 90% packet randomly (by linux cmd `tc`) for 1 hour and later it blocked all writes. LSB Version: :core-4.1-amd64:core-4.1-noarch cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
[8 Sep 2023 3:11]
zetang zeng
I am wrong, not only by `tc`, but with `iptables`
