Bug #68151 Fresh 7.2.10 cluster rolling restart all datanodes crash in DBSPJ DbspjMain.cpp
Submitted: 23 Jan 2013 3:28 Modified: 18 Jul 2016 16:30
Reporter: Carl Krumins Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.2.10 OS:Linux (generic x64 linux binary on ubuntu)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: crash, DBSPJ, DbspjMain.cpp, error 2341

[23 Jan 2013 3:28] Carl Krumins
Description:
Hi

Have had various issues with only 3 out of 4 nodes (in 2 nodegroups) in the past usually being able to run all at once on older NDB versions thus a fresh install on latest version to help isolate:

Fresh install of mysql-5.5.29 ndb-7.2.10 to start from scratch to help isolate problem (and fresh log files). Linux x64 generic precompiled binaries (on Ubuntu).

Ran: mysqldump --hex-blob -A of all old data (to help ensure that the data is 'clean' (as possible) and not re-loading any corrupt binary data from a ndb_restore). (includes mix of mostly memory tables, some disk data tablespaces and some columns are blob/text columns - not sure if that makes a difference).

Start fresh install 7.2.10 cluster from scratch and start all 4 nodes up (in 2 node groups) - a new empty cluster.

Load data in: mysql < mysqldump.of.data.sql    (.sql dump uncompressed is about 80gig)

Being rolling restart of data nodes one at a time (was completed in this order - not sure if this makes a difference either)
From ndb_mgm: 2 stop
then restart ndbmtd from command line - data node 2 rejoins fine.
ndb_mgm: 3 stop
restart ndbmtd from command line - data node 3 rejoins fine.
ndb_mgm: 5 stop
restart ndbmtd from command line - data node 5 rejoins fine.

Then go to restart the final node:
ndb_mgm: 4 stop 
Data node 4 begins shutdown process and then crashes all other (2,3,5) data nodes (as per below).

2013-01-22 19:07:10 [MgmtSrvr] INFO     -- Node 4: Data usage is 77%(958533 32K pages of total 1241600)
2013-01-22 19:07:10 [MgmtSrvr] INFO     -- Node 4: Index usage is 89%(330658 8K pages of total 371456)
2013-01-22 19:07:19 [MgmtSrvr] INFO     -- Node 4: Node shutdown initiated
2013-01-22 19:07:21 [MgmtSrvr] INFO     -- Node 3: Data usage is 75%(938156 32K pages of total 1241600)
2013-01-22 19:07:21 [MgmtSrvr] INFO     -- Node 3: Index usage is 88%(329360 8K pages of total 371456)
2013-01-22 19:07:28 [MgmtSrvr] INFO     -- Node 2: Data usage is 75%(938334 32K pages of total 1241600)
2013-01-22 19:07:28 [MgmtSrvr] INFO     -- Node 2: Index usage is 88%(329418 8K pages of total 371456)
2013-01-22 19:07:35 [MgmtSrvr] INFO     -- Node 4: Suma: initiate handover for shutdown with nodes 0000000000000020 GCI: 122558
2013-01-22 19:07:35 [MgmtSrvr] INFO     -- Node 4: Suma: handover to node 5 gci: 122558 buckets: 00000001 (2)
2013-01-22 19:07:36 [MgmtSrvr] INFO     -- Node 5: Data usage is 75%(940295 32K pages of total 1241600)
2013-01-22 19:07:36 [MgmtSrvr] INFO     -- Node 5: Index usage is 88%(329602 8K pages of total 371456)
2013-01-22 19:07:40 [MgmtSrvr] INFO     -- Node 4: Suma: handover complete
2013-01-22 19:07:40 [MgmtSrvr] INFO     -- Node 4: Data usage is 77%(958536 32K pages of total 1241600)
2013-01-22 19:07:40 [MgmtSrvr] INFO     -- Node 4: Index usage is 89%(330658 8K pages of total 371456)
2013-01-22 19:07:51 [MgmtSrvr] INFO     -- Node 3: Data usage is 75%(938166 32K pages of total 1241600)
2013-01-22 19:07:51 [MgmtSrvr] INFO     -- Node 3: Index usage is 88%(329363 8K pages of total 371456)
2013-01-22 19:07:58 [MgmtSrvr] INFO     -- Node 2: Data usage is 75%(938348 32K pages of total 1241600)
2013-01-22 19:07:58 [MgmtSrvr] INFO     -- Node 2: Index usage is 88%(329421 8K pages of total 371456)
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 4: Node shutdown completed.
2013-01-22 19:08:02 [MgmtSrvr] ALERT    -- Node 2: Node 4 Disconnected
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 5: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 5: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 5: Communication to Node 4 closed
2013-01-22 19:08:02 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2013-01-22 19:08:02 [MgmtSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: GCP Take over started
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: Node 2 taking over as DICT master
2013-01-22 19:08:02 [MgmtSrvr] ALERT    -- Node 5: Node 4 Disconnected
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: LCP Take over started
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: ParticipatingDIH = 000000000000002c
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: ParticipatingLQH = 000000000000002c
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=3 000000000000002c]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LCP_COMPLETE_REP_From_Master_Received = 0
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: LCP Take over completed (state = 5)
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: ParticipatingDIH = 000000000000002c
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: ParticipatingLQH = 000000000000002c
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=3 000000000000002c]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=3 000000000000002c]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=3 000000000000002c]
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: m_LCP_COMPLETE_REP_From_Master_Received = 0
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: GCP Monitor: unlimited lags allowed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: GCP Take over completed
2013-01-22 19:08:02 [MgmtSrvr] INFO     -- Node 2: kk: 122558/10 0 0
2013-01-22 19:08:02 [MgmtSrvr] ALERT    -- Node 3: Node 4 Disconnected
2013-01-22 19:08:25 [MgmtSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2013-01-22 19:08:25 [MgmtSrvr] ALERT    -- Node 1: Node 5 Disconnected
2013-01-22 19:08:26 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2013-01-22 19:08:26 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2013-01-22 19:08:26 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2013-01-22 19:08:26 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Cluster completely crashed and shutdown (and them begins to auto restart as expected).

Data dodes 2, 3, and 5 all report the same error:

Time: Tuesday 22 January 2013 - 19:08:02
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbspjMain.cpp
Error object: DBSPJ (Line: 2746) 0x00000000
Program: ndbmtd
Pid: 3081452 thr: 12
Version: mysql-5.5.29 ndb-7.2.10
Trace: /data/mysqlcluster//ndb_2_trace.log.1 [t1..t17]
***EOM***

Full ndb_error_reporter logs attached.

How to repeat:
As above.

Suggested fix:
Not sure.
[18 Jul 2016 16:30] MySQL Verification Team
can reproduce on 7.2.10 fairly easy
can't reproduce on 7.4.10 so setting as cant reproduce