Bug #90083 | mysqld got signal 11, _ZN16NdbScanOperation18receiver_completedEP11NdbReceiver | ||
---|---|---|---|
Submitted: | 15 Mar 2018 7:22 | Modified: | 18 Apr 2018 5:57 |
Reporter: | Hendrik Woltersdorf | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 7.4.19 | OS: | CentOS (6.4) |
Assigned to: | MySQL Verification Team | CPU Architecture: | x86 (x86_64) |
[15 Mar 2018 7:22]
Hendrik Woltersdorf
[15 Mar 2018 11:11]
MySQL Verification Team
Hi, You have to collect the logs using ndb_error_reporter and upload them as from what you provided we can't conclude anything. thanks in advance Bogdan
[15 Mar 2018 11:30]
Hendrik Woltersdorf
the ndb_error_reporter tarball as requested
Attachment: ndb_error_report_20180315121741.tar.bz2 (application/octet-stream, text), 948.76 KiB.
[15 Mar 2018 11:38]
MySQL Verification Team
Thanks for the logs, please, while I examine the cluster logs, upload also the full mysql log from the node that crashed (unfortunately ndb_error_reporter does not collect logs from sql nodes, I apologize I forgot to tell you to get those in previous note). all best Bogdan
[15 Mar 2018 11:44]
MySQL Verification Team
So the node 21 ( 10.40.55.131 ) is the one that crashed. Unfortunately cluster logs do not show anything useful. Let's see if we can find anything in the mysqld logs all best Bogdan 2018-03-14 18:26:44 [MgmtSrvr] INFO -- Node 10: Local checkpoint 1280166 completed 2018-03-14 19:01:24 [MgmtSrvr] INFO -- Node 10: Local checkpoint 1280167 started. Keep GCI = 138823143 oldest restorable GCI = 138823193 2018-03-14 19:01:59 [MgmtSrvr] ALERT -- Node 11: Node 21 Disconnected 2018-03-14 19:01:59 [MgmtSrvr] INFO -- Node 11: Communication to Node 21 closed 2018-03-14 19:01:59 [MgmtSrvr] INFO -- Node 10: Communication to Node 21 closed 2018-03-14 19:02:00 [MgmtSrvr] ALERT -- Node 10: Node 21 Disconnected 2018-03-14 19:02:01 [MgmtSrvr] INFO -- Node 11: Node 22 Connected 2018-03-14 19:02:01 [MgmtSrvr] INFO -- Node 10: Node 22 Connected 2018-03-14 19:02:01 [MgmtSrvr] INFO -- Node 10: Node 22: API mysql-5.6.39 ndb-7.4.19 2018-03-14 19:02:01 [MgmtSrvr] INFO -- Node 11: Node 22: API mysql-5.6.39 ndb-7.4.19 2018-03-14 19:02:03 [MgmtSrvr] INFO -- Node 11: Communication to Node 21 opened 2018-03-14 19:02:03 [MgmtSrvr] INFO -- Node 10: Communication to Node 21 opened 2018-03-14 19:02:36 [MgmtSrvr] INFO -- Node 10: LDM(0): Completed LCP, #frags = 250 #records = 7866390, #bytes = 1055204136 2018-03-14 19:02:36 [MgmtSrvr] INFO -- Node 11: LDM(0): Completed LCP, #frags = 250 #records = 7866390, #bytes = 1055204136 2018-03-14 19:02:36 [MgmtSrvr] INFO -- Node 10: Local checkpoint 1280167 completed 2018-03-14 19:02:58 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=2040B(82%) alloc=2480B(0%) max=0B apply_epoch=138825334/2 latest_epoch=138825334/5 2018-03-14 19:02:58 [MgmtSrvr] INFO -- Node 22: Event buffer status: used=2492B(100%) alloc=2492B(0%) max=0B apply_epoch=138825334/3 latest_epoch=138825334/6
[15 Mar 2018 11:46]
Hendrik Woltersdorf
the complete log file of the crashed sql node
Attachment: eti-dus-ha-cl2.20180314.zip (application/x-zip-compressed, text), 1.18 MiB.
[15 Mar 2018 11:48]
Hendrik Woltersdorf
Note, that this is a small test system, where we test (among other things) new versions of MySQL Cluster. The crash happened after a downtime of all applications , when we started every (micro-) service, that is accessing the cluster. So it is hard to tell which statement did this.
[15 Mar 2018 11:55]
MySQL Verification Team
Thanks for the log, let's see if we can figure out anything from it. It's better that it's a test then production setup for sure :D. So what you do, if I remember correctly is you had "idle cluster" (seems so from the cluster log) and you then started the application(s) and put some load on to it. From the cluster log the cluster behaved properly, but let's see why sql node died. all best Bogdan
[15 Mar 2018 12:02]
Hendrik Woltersdorf
I roamed through all the logs of all services, that where started at this time. The earliest statement, that suffered from the crash is: delete from JOB_EXEC where ID_JOB_CONTROL = 23 and DATE_START < 2018-03-14 19:01:00.0 and COUNT_START = COUNT_END and ID_JOB_EXEC < 32118149 the table: CREATE TABLE `JOB_EXEC` ( `ID_JOB_EXEC` int(11) NOT NULL AUTO_INCREMENT, `ID_JOB_CONTROL` int(11) NOT NULL, `DATE_START` datetime DEFAULT NULL, `COUNT_START` int(11) DEFAULT NULL, `COUNT_END` int(11) DEFAULT NULL, `DATE_END` datetime DEFAULT NULL, PRIMARY KEY (`ID_JOB_EXEC`), KEY `REFERENCE_12_FK` (`ID_JOB_CONTROL`), KEY `IDX_START` (`DATE_START`), KEY `IDX_JC_DT` (`ID_JOB_CONTROL`,`DATE_START`) ) ENGINE=ndbcluster AUTO_INCREMENT=32130198 DEFAULT CHARSET=latin1; But, as already said, I don't know, if this is the one causing the crash. Such statements run about 20 times every minute.
[17 Apr 2018 15:15]
MySQL Verification Team
Hi, I went trough and trough and trough this but I can't reproduce this nor I see from the log anything that's helping me reproduce this. Did you by any chance experience this issue again on your test platform? thanks Bogdan
[18 Apr 2018 5:57]
Hendrik Woltersdorf
It happened a few times again. Last time 2018-04-04 15:20:31: 13:20:30 UTC - mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=268435456 read_buffer_size=262144 max_used_connections=260 max_threads=2000 thread_count=196 connection_count=195 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1823737 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x7fd0a80009b0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 7fd0c54b2e00 thread_stack 0x40000 /opt/mysql/bin/mysqld(my_print_stacktrace+0x35)[0x953a15] /opt/mysql/bin/mysqld(handle_fatal_signal+0x494)[0x6e0034] /lib64/libpthread.so.0[0x3c6180f500] /opt/mysql/bin/mysqld(_ZN16NdbScanOperation18receiver_deliveredEP11NdbReceiver+0x2a)[0xa41e5a] /opt/mysql/bin/mysqld(_ZN14NdbTransaction19receiveSCAN_TABCONFEPK12NdbApiSignalPKjj+0x16a)[0xa4bcaa] /opt/mysql/bin/mysqld(_ZN7NdbImpl18trp_deliver_signalEPK12NdbApiSignalPK16LinearSectionPtr+0xc27)[0x9e9657] /opt/mysql/bin/mysqld(_ZN17TransporterFacade14deliver_signalEP12SignalHeaderhPjP16LinearSectionPtr+0xda)[0x9f785a] /opt/mysql/bin/mysqld(_ZN19TransporterRegistry6unpackER24TransporterReceiveHandlePjjt7IOStateRb+0x6c1)[0xa6f411] /opt/mysql/bin/mysqld(_ZN19TransporterRegistry14performReceiveER24TransporterReceiveHandle+0x2e2)[0xa6bf72] /opt/mysql/bin/mysqld(_ZN17TransporterFacade7do_pollEP10trp_clientjbb+0x56)[0x9f6cf6] /opt/mysql/bin/mysqld(_ZN9PollGuard22wait_for_input_in_loopEib+0x6f)[0xa2ccff] /opt/mysql/bin/mysqld(_ZN9PollGuard13wait_n_unlockEijjb+0x2a)[0xa2cf0a] /opt/mysql/bin/mysqld(_ZN3Ndb11pollEvents2EiPy+0x75)[0xa4ca75] /opt/mysql/bin/mysqld(_ZN3Ndb10pollEventsEiPy+0x6d)[0xa4cb5d] /opt/mysql/bin/mysqld(_ZN17Ndb_binlog_thread6do_runEv+0x1afa)[0x9cb73a] /opt/mysql/bin/mysqld(_ZN13Ndb_component8run_implEv+0x78)[0x9e0a78] /opt/mysql/bin/mysqld(Ndb_component_run_C+0x19)[0x9e0bf9] /lib64/libpthread.so.0[0x3c61807851] /lib64/libc.so.6(clone+0x6d)[0x3c614e890d] Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (0): is an invalid pointer Connection ID (thread ID): 1 Status: NOT_KILLED After this incident I stopped and restarted the mysqld process manually. Since than it is running ...