Bug #90083 mysqld got signal 11, _ZN16NdbScanOperation18receiver_completedEP11NdbReceiver
Submitted: 15 Mar 2018 7:22 Modified: 18 Apr 2018 5:57
Reporter: Hendrik Woltersdorf Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.4.19 OS:CentOS (6.4)
Assigned to: MySQL Verification Team CPU Architecture:x86 (x86_64)

[15 Mar 2018 7:22] Hendrik Woltersdorf
Description:
We experienced a crash with signal 11 in one of two SQL nodes of a MySQL-Cluster:

18:01:59 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=268435456
read_buffer_size=262144
max_used_connections=286
max_threads=2000
thread_count=86
connection_count=85
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1823737 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f1ed80009b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f1ef7edae00 thread_stack 0x40000
/opt/mysql/bin/mysqld(my_print_stacktrace+0x35)[0x953a15]
/opt/mysql/bin/mysqld(handle_fatal_signal+0x494)[0x6e0034]
/lib64/libpthread.so.0[0x3c6180f500]
/opt/mysql/bin/mysqld(_ZN16NdbScanOperation18receiver_completedEP11NdbReceiver+0x2a)[0xa41eba]
/opt/mysql/bin/mysqld(_ZN14NdbTransaction19receiveSCAN_TABCONFEPK12NdbApiSignalPKjj+0x133)[0xa4bc73]
/opt/mysql/bin/mysqld(_ZN7NdbImpl18trp_deliver_signalEPK12NdbApiSignalPK16LinearSectionPtr+0xc27)[0x9e9657]
/opt/mysql/bin/mysqld(_ZN17TransporterFacade14deliver_signalEP12SignalHeaderhPjP16LinearSectionPtr+0xda)[0x9f785a]
/opt/mysql/bin/mysqld(_ZN19TransporterRegistry6unpackER24TransporterReceiveHandlePjjt7IOStateRb+0x6c1)[0xa6f411]
/opt/mysql/bin/mysqld(_ZN19TransporterRegistry14performReceiveER24TransporterReceiveHandle+0x2e2)[0xa6bf72]
/opt/mysql/bin/mysqld(_ZN17TransporterFacade7do_pollEP10trp_clientjbb+0x56)[0x9f6cf6]
/opt/mysql/bin/mysqld(_ZN9PollGuard22wait_for_input_in_loopEib+0x6f)[0xa2ccff]
/opt/mysql/bin/mysqld(_ZN9PollGuard13wait_n_unlockEijjb+0x2a)[0xa2cf0a]
/opt/mysql/bin/mysqld(_ZN3Ndb11pollEvents2EiPy+0x75)[0xa4ca75]
/opt/mysql/bin/mysqld(_ZN3Ndb10pollEventsEiPy+0x6d)[0xa4cb5d]
/opt/mysql/bin/mysqld(_ZN17Ndb_binlog_thread6do_runEv+0x1afa)[0x9cb73a]
/opt/mysql/bin/mysqld(_ZN13Ndb_component8run_implEv+0x78)[0x9e0a78]
/opt/mysql/bin/mysqld(Ndb_component_run_C+0x19)[0x9e0bf9]
/lib64/libpthread.so.0[0x3c61807851]
/lib64/libc.so.6(clone+0x6d)[0x3c614e890d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

How to repeat:
I don't know.
[15 Mar 2018 11:11] MySQL Verification Team
Hi,

You have to collect the logs using ndb_error_reporter and upload them as from what you provided we can't conclude anything.

thanks in advance
Bogdan
[15 Mar 2018 11:30] Hendrik Woltersdorf
the ndb_error_reporter tarball as requested

Attachment: ndb_error_report_20180315121741.tar.bz2 (application/octet-stream, text), 948.76 KiB.

[15 Mar 2018 11:38] MySQL Verification Team
Thanks for the logs, please, while I examine the cluster logs, upload also the full mysql log from the node that crashed (unfortunately ndb_error_reporter does not collect logs from sql nodes, I apologize I forgot to tell you to get those in previous note).

all best
Bogdan
[15 Mar 2018 11:44] MySQL Verification Team
So the node 21 ( 10.40.55.131 ) is the one that crashed. Unfortunately cluster logs do not show anything useful. Let's see if we can find anything in the mysqld logs

all best
Bogdan

2018-03-14 18:26:44 [MgmtSrvr] INFO     -- Node 10: Local checkpoint 1280166 completed
2018-03-14 19:01:24 [MgmtSrvr] INFO     -- Node 10: Local checkpoint 1280167 started. Keep GCI = 138823143 oldest restorable GCI = 138823193
2018-03-14 19:01:59 [MgmtSrvr] ALERT    -- Node 11: Node 21 Disconnected
2018-03-14 19:01:59 [MgmtSrvr] INFO     -- Node 11: Communication to Node 21 closed
2018-03-14 19:01:59 [MgmtSrvr] INFO     -- Node 10: Communication to Node 21 closed
2018-03-14 19:02:00 [MgmtSrvr] ALERT    -- Node 10: Node 21 Disconnected
2018-03-14 19:02:01 [MgmtSrvr] INFO     -- Node 11: Node 22 Connected
2018-03-14 19:02:01 [MgmtSrvr] INFO     -- Node 10: Node 22 Connected
2018-03-14 19:02:01 [MgmtSrvr] INFO     -- Node 10: Node 22: API mysql-5.6.39 ndb-7.4.19
2018-03-14 19:02:01 [MgmtSrvr] INFO     -- Node 11: Node 22: API mysql-5.6.39 ndb-7.4.19
2018-03-14 19:02:03 [MgmtSrvr] INFO     -- Node 11: Communication to Node 21 opened
2018-03-14 19:02:03 [MgmtSrvr] INFO     -- Node 10: Communication to Node 21 opened
2018-03-14 19:02:36 [MgmtSrvr] INFO     -- Node 10: LDM(0): Completed LCP, #frags = 250 #records = 7866390, #bytes = 1055204136
2018-03-14 19:02:36 [MgmtSrvr] INFO     -- Node 11: LDM(0): Completed LCP, #frags = 250 #records = 7866390, #bytes = 1055204136
2018-03-14 19:02:36 [MgmtSrvr] INFO     -- Node 10: Local checkpoint 1280167 completed
2018-03-14 19:02:58 [MgmtSrvr] INFO     -- Node 22: Event buffer status: used=2040B(82%) alloc=2480B(0%) max=0B apply_epoch=138825334/2 latest_epoch=138825334/5
2018-03-14 19:02:58 [MgmtSrvr] INFO     -- Node 22: Event buffer status: used=2492B(100%) alloc=2492B(0%) max=0B apply_epoch=138825334/3 latest_epoch=138825334/6
[15 Mar 2018 11:46] Hendrik Woltersdorf
the complete log file of the crashed sql node

Attachment: eti-dus-ha-cl2.20180314.zip (application/x-zip-compressed, text), 1.18 MiB.

[15 Mar 2018 11:48] Hendrik Woltersdorf
Note, that this is a small test system, where we test (among other things) new versions of MySQL Cluster. The crash happened after a downtime of all applications , when we started every (micro-) service, that is accessing the cluster.
So it is hard to tell which statement did this.
[15 Mar 2018 11:55] MySQL Verification Team
Thanks for the log, let's see if we can figure out anything from it.

It's better that it's a test then production setup for sure :D. So what you do, if I remember correctly is you had "idle cluster" (seems so from the cluster log) and you then started the application(s) and put some load on to it.

From the cluster log the cluster behaved properly, but let's see why sql node died.

all best
Bogdan
[15 Mar 2018 12:02] Hendrik Woltersdorf
I roamed through all the logs of all services, that where started at this time.
The earliest statement, that suffered from the crash is:
delete from JOB_EXEC where ID_JOB_CONTROL = 23 and DATE_START < 2018-03-14 19:01:00.0 and COUNT_START = COUNT_END and ID_JOB_EXEC < 32118149

the table:
CREATE TABLE `JOB_EXEC` (
  `ID_JOB_EXEC` int(11) NOT NULL AUTO_INCREMENT,
  `ID_JOB_CONTROL` int(11) NOT NULL,
  `DATE_START` datetime DEFAULT NULL,
  `COUNT_START` int(11) DEFAULT NULL,
  `COUNT_END` int(11) DEFAULT NULL,
  `DATE_END` datetime DEFAULT NULL,
  PRIMARY KEY (`ID_JOB_EXEC`),
  KEY `REFERENCE_12_FK` (`ID_JOB_CONTROL`),
  KEY `IDX_START` (`DATE_START`),
  KEY `IDX_JC_DT` (`ID_JOB_CONTROL`,`DATE_START`)
) ENGINE=ndbcluster AUTO_INCREMENT=32130198 DEFAULT CHARSET=latin1;

But, as already said, I don't know, if this is the one causing the crash. Such statements run about 20 times every minute.
[17 Apr 2018 15:15] MySQL Verification Team
Hi,

I went trough and trough and trough this but I can't reproduce this nor I see from the log anything that's helping me reproduce this.

Did you by any chance experience this issue again on your test platform?

thanks
Bogdan
[18 Apr 2018 5:57] Hendrik Woltersdorf
It happened a few times again. Last time 2018-04-04 15:20:31:

13:20:30 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=268435456
read_buffer_size=262144
max_used_connections=260
max_threads=2000
thread_count=196
connection_count=195
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1823737 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fd0a80009b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fd0c54b2e00 thread_stack 0x40000
/opt/mysql/bin/mysqld(my_print_stacktrace+0x35)[0x953a15]
/opt/mysql/bin/mysqld(handle_fatal_signal+0x494)[0x6e0034]
/lib64/libpthread.so.0[0x3c6180f500]
/opt/mysql/bin/mysqld(_ZN16NdbScanOperation18receiver_deliveredEP11NdbReceiver+0x2a)[0xa41e5a]
/opt/mysql/bin/mysqld(_ZN14NdbTransaction19receiveSCAN_TABCONFEPK12NdbApiSignalPKjj+0x16a)[0xa4bcaa]
/opt/mysql/bin/mysqld(_ZN7NdbImpl18trp_deliver_signalEPK12NdbApiSignalPK16LinearSectionPtr+0xc27)[0x9e9657]
/opt/mysql/bin/mysqld(_ZN17TransporterFacade14deliver_signalEP12SignalHeaderhPjP16LinearSectionPtr+0xda)[0x9f785a]
/opt/mysql/bin/mysqld(_ZN19TransporterRegistry6unpackER24TransporterReceiveHandlePjjt7IOStateRb+0x6c1)[0xa6f411]
/opt/mysql/bin/mysqld(_ZN19TransporterRegistry14performReceiveER24TransporterReceiveHandle+0x2e2)[0xa6bf72]
/opt/mysql/bin/mysqld(_ZN17TransporterFacade7do_pollEP10trp_clientjbb+0x56)[0x9f6cf6]
/opt/mysql/bin/mysqld(_ZN9PollGuard22wait_for_input_in_loopEib+0x6f)[0xa2ccff]
/opt/mysql/bin/mysqld(_ZN9PollGuard13wait_n_unlockEijjb+0x2a)[0xa2cf0a]
/opt/mysql/bin/mysqld(_ZN3Ndb11pollEvents2EiPy+0x75)[0xa4ca75]
/opt/mysql/bin/mysqld(_ZN3Ndb10pollEventsEiPy+0x6d)[0xa4cb5d]
/opt/mysql/bin/mysqld(_ZN17Ndb_binlog_thread6do_runEv+0x1afa)[0x9cb73a]
/opt/mysql/bin/mysqld(_ZN13Ndb_component8run_implEv+0x78)[0x9e0a78]
/opt/mysql/bin/mysqld(Ndb_component_run_C+0x19)[0x9e0bf9]
/lib64/libpthread.so.0[0x3c61807851]
/lib64/libc.so.6(clone+0x6d)[0x3c614e890d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

After this incident I stopped and restarted the mysqld process manually. Since than it is running ...