MySQL Bugs: #73339: NDB data node crashes after upgrade from 7.2.13 to 7.3.5

Bug #73339	NDB data node crashes after upgrade from 7.2.13 to 7.3.5 - illegal signal
Submitted:	21 Jul 2014 9:54	Modified:	23 Mar 2016 14:05
Reporter:	Christian Ehmig	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.3.6	OS:	Linux (2.6.32)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Two weeks after uprading our 4 node NDB cluster from version 7.2.13 to 7.3.5, it seems like that each data node crashes in turn.
The first data node crashed on 18/07/2014 and had successfuly restarted.
Today, the second node died and right after that, the remaining node of the same nodegroup which resulted in a crash of the whole cluster.

Hardware specs for NDB nodes:
- 256 GB RAM
- Intel E6540 2.0GHz with HT enabled (12 cores)
- Running stable for two weeks after upgrade

Standard Log (node 4)

2014-07-21 08:29:12 [ndbd] ERROR    -- c_gcp_list.seize() failed: gci: 27847945177006088 nodes: 0000000000000000000000000000000000002000000180080140082e062a0000
2014-07-21 08:29:12 [ndbd] WARNING  -- ACK wo/ gcp record (gci: 6483855/8) ref: 0fa20050 from: 0fa20050
2014-07-21 08:29:12 [ndbd] WARNING  -- ACK wo/ gcp record (gci: 6483855/8) ref: 0fa20015 from: 0fa20015
2014-07-21 08:29:12 [ndbd] WARNING  -- ACK wo/ gcp record (gci: 6483855/8) ref: 0fa2002b from: 0fa2002b
...
2014-07-21 08:31:30 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=100
2014-07-21 08:31:30 [ndbd] INFO     -- Watchdog: User time: 13659651297  System time: 912081437014
2014-07-21 08:31:30 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=200
2014-07-21 08:31:30 [ndbd] INFO     -- Watchdog: User time: 13659651297  System time: 912081437014
2014-07-21 08:31:30 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=300
2014-07-21 08:31:30 [ndbd] INFO     -- Watchdog: User time: 13659651297  System time: 912081437014
2014-07-21 08:31:30 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=400
2014-07-21 08:31:30 [ndbd] INFO     -- Watchdog: User time: 13659651297  System time: 912081437014
...
2014-07-21 08:31:48 [ndbd] INFO     -- Watchdog: User time: 13659651297  System time: 912081437014
Warning: 1 thread(s) did not stop before starting crash dump.
2014-07-21 08:31:51 [ndbd] INFO     -- Job Handling
2014-07-21 08:31:51 [ndbd] INFO     -- /export/home/pb2/build/sb_0-11918329-1396368634.9/mysql-cluster-gpl-7.3.5/storage/ndb/src/kernel/vm/WatchDog.cpp
2014-07-21 08:31:51 [ndbd] INFO     -- Watchdog shutting down system
2014-07-21 08:31:51 [ndbd] INFO     -- Watchdog shutdown completed - exiting
2014-07-21 08:32:43 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Cluster MGM log

2014-07-21 08:32:00 [MgmtSrvr] WARNING  -- Node 5: Node 4 missed heartbeat 2
2014-07-21 08:32:05 [MgmtSrvr] INFO     -- Node 6: Data usage is 24%(1438323 32K pages of total 5856000)
2014-07-21 08:32:05 [MgmtSrvr] INFO     -- Node 6: Index usage is 15%(1084362 8K pages of total 6784128)
2014-07-21 08:32:05 [MgmtSrvr] WARNING  -- Node 5: Node 4 missed heartbeat 3
2014-07-21 08:32:09 [MgmtSrvr] INFO     -- Node 5: Data usage is 24%(1438332 32K pages of total 5856000)
2014-07-21 08:32:09 [MgmtSrvr] INFO     -- Node 5: Index usage is 15%(1084388 8K pages of total 6784128)
2014-07-21 08:32:10 [MgmtSrvr] WARNING  -- Node 5: Node 4 missed heartbeat 4
2014-07-21 08:32:10 [MgmtSrvr] ALERT    -- Node 5: Node 4 declared dead due to missed heartbeat
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed
2014-07-21 08:32:10 [MgmtSrvr] ALERT    -- Node 3: Node 4 Disconnected
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 3: Communication to Node 10 closed
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 3: Communication to Node 10 closed
2014-07-21 08:32:10 [MgmtSrvr] ALERT    -- Node 3: Node 10 Disconnected
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: Communication to Node 4 closed
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: Communication to Node 4 closed
2014-07-21 08:32:10 [MgmtSrvr] ALERT    -- Node 5: Arbitration check won - node group majority
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: President restarts arbitration thread [state=6]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: GCP Take over started
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: Node 5 taking over as DICT master
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: LCP Take over started
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: ParticipatingDIH = 0000000000000000
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: ParticipatingLQH = 0000000000000000
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LCP_COMPLETE_REP_From_Master_Received = 1
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: LCP Take over completed (state = 4)
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: ParticipatingDIH = 0000000000000000
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: ParticipatingLQH = 0000000000000000
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: m_LCP_COMPLETE_REP_From_Master_Received = 1
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: GCP Monitor: unlimited lags allowed
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: GCP Take over completed
2014-07-21 08:32:10 [MgmtSrvr] INFO     -- Node 5: kk: 6483986/3 2 0
...
2014-07-21 08:32:43 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2014-07-21 08:32:45 [MgmtSrvr] WARNING  -- Node 6: Node 5 missed heartbeat 2
2014-07-21 08:32:45 [MgmtSrvr] INFO     -- Node 6: Communication to Node 8 opened
2014-07-21 08:32:45 [MgmtSrvr] INFO     -- Node 6: Communication to Node 8 opened
2014-07-21 08:32:50 [MgmtSrvr] WARNING  -- Node 6: Node 5 missed heartbeat 3
2014-07-21 08:32:55 [MgmtSrvr] WARNING  -- Node 6: Node 5 missed heartbeat 4
2014-07-21 08:32:55 [MgmtSrvr] ALERT    -- Node 6: Node 5 declared dead due to missed heartbeat
2014-07-21 08:32:55 [MgmtSrvr] INFO     -- Node 6: Communication to Node 3 closed
2014-07-21 08:32:55 [MgmtSrvr] INFO     -- Node 6: Communication to Node 5 closed
2014-07-21 08:32:59 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2301: 'Assertion(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2014-07-21 08:33:01 [MgmtSrvr] ALERT    -- Node 2: Node 6 Disconnected
2014-07-21 08:33:11 [MgmtSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2014-07-21 08:33:28 [MgmtSrvr] ALERT    -- Node 6: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2014-07-21 08:34:29 [MgmtSrvr] WARNING  -- Failed to allocate nodeid for API at 127.0.0.1. Returned error: 'No free node id found for mysqld(API).'

Any ideas?

Thanks in advance!

Christian

How to repeat:
Currently, this bug/crash cannot be repeated.

Upgraded the cluster (data and mgm nodes only) to 7.3.6. Nodes keep crashing.

Time: Thursday 24 July 2014 - 10:31:41
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: /export/home/pb2/build/sb_0-12598553-1404311822.99/mysql-cluster-gpl-7.3.6/storage/ndb/src/kernel/vm/WatchDog.cpp
Program: ndbmtd
Pid: 7188
Version: mysql-5.6.19 ndb-7.3.6

New errors on other nodes:

(Version 7.3.5)
tatus: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: Illegal signal received (GSN 40 not added)
Error object: Illegal signal received (GSN 40 not added)
Program: ndbmtd
Pid: 6029 thr: 0
Version: mysql-5.6.17 ndb-7.3.5
Trace: /mnt/data/cluster/ndb_6_trace.log.8 [t1..t10]

(Version 7.3.6)

Time: Thursday 24 July 2014 - 08:01:49
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 8862) 0x00000006
Program: ndbmtd
Pid: 19753 thr: 3
Version: mysql-5.6.19 ndb-7.3.6
Trace: /mnt/data/cluster/ndb_6_trace.log.9 [t1..t10]
***EOM***

ndb config

Attachment: ndb_mgmd.cnf (application/octet-stream, text), 3.87 KiB.

node 6 trace 8

Attachment: ndb_6_trace.log.8.tgz (application/x-compressed, text), 617.83 KiB.

node 6 trace 9

Attachment: ndb_6_trace.log.9.tgz (application/x-compressed, text), 930.47 KiB.

ndb 3 trace 9

Attachment: ndb_3_trace.log.9.tgz (application/x-compressed, text), 575.94 KiB.

Whole cluster crashed again (better to say, the two remaining nodes crashed after the first two nodes crashed this morning).

Time: Thursday 24 July 2014 - 22:13:35
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 8862) 0x00000006
Program: ndbmtd
Pid: 22137 thr: 3
Version: mysql-5.6.19 ndb-7.3.6
Trace: /mnt/data/cluster/ndb_4_trace.log.9 [t1..t10]
***EOM***

Hi,

the 

(Version 7.3.5)
tatus: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: Illegal signal received (GSN 40 not added)
Error object: Illegal signal received (GSN 40 not added)
Program: ndbmtd
Pid: 6029 thr: 0
Version: mysql-5.6.17 ndb-7.3.5
Trace: /mnt/data/cluster/ndb_6_trace.log.8 [t1..t10]

crash is due to a bug fixed in 7.3.6 (18455971). 

The other crashes I don't have enough information to see why they happen.

Do you have a way to reproduce them? This BUG report is rather old, did you see these crashes happen again since? On your latest installation?

all best
Bogdan Kecman

The

Time: Thursday 24 July 2014 - 10:31:41
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: /export/home/pb2/build/sb_0-12598553-1404311822.99/mysql-cluster-gpl-7.3.6/storage/ndb/src/kernel/vm/WatchDog.cpp
Program: ndbmtd
Pid: 7188
Version: mysql-5.6.19 ndb-7.3.6

crash is due to overload of the system or miss configuration of one. To fix this one I suggest you contact Oracle MySQL Cluster Support team.

all best
Bogdan Kecman

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".