MySQL Bugs: #79646: Mysql Cluster Data Node Shutdowns Immediately After Being Started

Bug #79646	Mysql Cluster Data Node Shutdowns Immediately After Being Started
Submitted:	15 Dec 2015 11:10	Modified:	30 Dec 2015 13:53
Reporter:	Serhat Demircan	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.4.6	OS:	Debian (Wheezy)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	MySQL Cluster, ndbcluster

Description:
I have an mysql cluster which have 4 api nodes, 2 management nodes and 8 data nodes. Today stopped a data node for maintenance. After starting it again data node shutdowns immediately with following erros. I could not find a way to bring this data node online. 

==> /ndbdata/ndb_5_error.log <==
Time: Tuesday 15 December 2015 – 01:58:45
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtcMain.cpp
Error object: DBTC (Line: 19291) 0×00000002
Program: ndbmtd
Pid: 7546 thr: 12
Version: mysql-5.6.24 ndb-7.4.6
Trace: /ndbdata/ndb_5_trace.log.5 [t1..t15]
***EOM***

==> /ndbdata/ndb_5_out.log <==
2015-12-15 01:58:08 [ndbd] INFO     -- Node started
2015-12-15 01:58:45 [ndbd] INFO     -- /export/home/pb2/build/sb_0-14878975-1427910955.8/mysql-cluster-gpl-7.4.6/storage/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp
2015-12-15 01:58:45 [ndbd] INFO     -- DBTC (Line: 19291) 0x00000002
2015-12-15 01:58:45 [ndbd] INFO     -- Error handler shutting down system
2015-12-15 01:58:45 [ndbd] INFO     -- Error handler shutdown completed - exiting
2015-12-15 01:58:48 [ndbd] DEBUG    -- Angel got child 7546
2015-12-15 01:58:48 [ndbd] DEBUG    -- error: 2341, signal: 0, sphase: 255
2015-12-15 01:58:48 [ndbd] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
On a cluster which have 8 data nodes with my configuration:
1 - Stop a data node
2 - Then start it again

Trace Log

Attachment: ndb_5_trace.log.5 (application/octet-stream, text), 1003.90 KiB.

Cluster config

Attachment: config.ini (application/octet-stream, text), 3.77 KiB.

ping

Hi,
Thanks for reporting this bug.
I cannot reproduce this on a new 8node cluster.
I assume you can't reproduce the problem neither, can you confirm that? I don't consider "reproducing" if you just can't start the data node that is now in crashed state.

If I understand you correctly you have 8 node cluster and one node is now unable to start (other 7 nodes are running properly). In order to get out of this problem, start the problematic node with --initial, it will clear up the filesystem on the failing node, reload all data from replica inside the cluster and start the node properly. Once you have a properly running cluster if then by stopping a node you come again to same situation (node won't start again) then that is "reproduction" of this issue.

all best
Bogdan Kecman