Bug #60382 all data node of mysql cluster was downed
Submitted: 8 Mar 2011 1:28 Modified: 21 Mar 2016 21:55
Reporter: ws lee Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1.32 ndb-6.3.24 OS:Solaris (10)
Assigned to: MySQL Verification Team CPU Architecture:Any

[8 Mar 2011 1:28] ws lee
Description:
I am using mysql clustrer 6.3.24.
2 data node and 2 sql node.
Yesterday, all 2 data node was downed.
error message is below.

Node 4: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming e
rror or missing error message, please report a bug). Temporary error, restart node'.
Node 5: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming e
rror or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
I don't know why data node was downed.
[8 Mar 2011 1:34] ws lee
ndb_4_error.log 

Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation Fault
Error object: main.cpp
Program: /usr/local/mysql5.1.32-ndb6.3.24/bin/ndbd
Pid: 3682
Trace: /var/lib/mysql-cluster5.1.32-ndb6.3.24/ndb_4_trace.log.1
Version: mysql-5.1.32 ndb-6.3.24-GA
***EOM***

ndb_5_error.log 
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation Fault
Error object: main.cpp
Program: /usr/local/mysql5.1.32-ndb6.3.24/bin/ndbd
Pid: 7548
Trace: /var/lib/mysql-cluster5.1.32-ndb6.3.24/ndb_5_trace.log.3
Version: mysql-5.1.32 ndb-6.3.24-GA
***EOM***
[21 Mar 2016 21:54] MySQL Verification Team
Looking at the trace file shows:
....
DBLQH   002693 
DBTC    004152 
DBTUP   010029 

--------------- Signal ----------------
r.bn: 263 "API", r.proc: 4, r.sigId: 1997609488 gsn: 41 "Unknown" prio: 1
s.bn: 32774 "API", s.proc: 10, s.sigId: 0 length: 12 trace: 1 #sec: 0 fragInf: 0
 H'00000037 H'00000000 H'00000000 H'8006000a H'00000000 H'ffffffff H'ffffffff
 H'00000000 H'00000100 H'00000000 H'00000000 H'00000000
.....

Here you can see the issue relates to the global signal number 41 being of unknown type.

This means that either the gsn does not have an associated function (if it was a data node), or since it was an api node, it is more likely that there was some corruption of the signal. 

Both situations are fixed by an upgrade to at least 7.0 versions where gsn 41 is implemented (does not exist before that) and checksumming can be used to see if there is corruption happening on the network.