Bug #74435 Data Node crashed and can not startup
Submitted: 18 Oct 2014 4:00 Modified: 3 Nov 2015 18:20
Reporter: ray liu Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:NDB 7.3.4 OS:Linux (CentOS release 6.5 (Final))
Assigned to: MySQL Verification Team CPU Architecture:Any

[18 Oct 2014 4:00] ray liu
Description:
Data Node crashed, and can not startup.

How to repeat:

----------------------------------------------------------------------
Crash: Data Node Log:

node: 3 tab: 2642 frag: 1 no lcp to restore
node: 3 tab: 2706 frag: 0 no lcp to restore
node: 3 tab: 2706 frag: 1 no lcp to restore
node: 3 tab: 2735 frag: 0 no lcp to restore
node: 3 tab: 2735 frag: 1 no lcp to restore
execSTART_RECREQ chaning srnodes from 000000000000000c to 0000000000000008
alloc_chunk(15894 16) - 
2014-10-18 11:24:49 [ndbd] INFO     -- Start phase 5 completed
2014-10-18 11:24:49 [ndbd] INFO     -- Start phase 6 completed
2014-10-18 11:24:49 [ndbd] INFO     -- Start phase 7 completed
2014-10-18 11:24:54 [ndbd] INFO     -- Start phase 8 completed
2014-10-18 11:24:54 [ndbd] INFO     -- Start phase 9 completed
2014-10-18 11:24:54 [ndbd] INFO     -- Start phase 100 completed
2014-10-18 11:24:55 [ndbd] INFO     -- Suma: handover from node 2 gci: 10199688 buckets: 00000002 (2)
10199688/0 (10199687/4294967295) switchover complete bucket 1 state: 1starting
2014-10-18 11:24:58 [ndbd] INFO     -- Start phase 101 completed
2014-10-18 11:24:58 [ndbd] INFO     -- Node started
2014-10-18 11:27:47 [ndbd] INFO     -- findNeighbours from: 4907 old (left: 2 right: 2) new (65535 65535)
2014-10-18 11:27:47 [ndbd] ALERT    -- Network partitioning - arbitration required
2014-10-18 11:27:47 [ndbd] INFO     -- President restarts arbitration thread [state=7]
2014-10-18 11:27:47 [ndbd] ALERT    -- Arbitration won - positive reply from node 1
start_resend(0, empty bucket (10199852/4 10199852/3) -> active
Finished with handling node-failure
execGCP_NOMORETRANS(10199852/4) c_ongoing_take_over_cnt -> seize
completing gcp 10199852/4 in execTAKE_OVERTCCONF
2014-10-18 11:27:47 [ndbd] INFO     -- Started arbitrator node 1 [ticket=2a380005341d6797]
--------------------------------------------------------------
Crash: mgm Node log:

2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: DICT: enable FK 2740 done (466/2735/fk_acr_counpon_account_AcrStoreId)
2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: DICT: enable FK 2742 done (2642/2585/fk_acr_order_detail_AcrOrderId)
2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: DICT: enable FK 2744 done (2642/2594/fk_acr_order_coupon_AcrOrderId)
2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: Start phase 8 completed (initial node restart)
2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: Start phase 9 completed (initial node restart)
2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: Start phase 100 completed (initial node restart)
2014-10-18 11:24:48 [MgmtSrvr] INFO     -- Node 3: Node 1: API mysql-5.6.15 ndb-7.3.4
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 opened
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Communication to Node 5 opened
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Communication to Node 6 opened
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Communication to Node 7 opened
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Communication to Node 8 opened
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Node 5 Connected
2014-10-18 11:24:49 [MgmtSrvr] INFO     -- Node 3: Node 6 Connected
2014-10-18 11:24:50 [MgmtSrvr] INFO     -- Node 3: Node 5: API mysql-5.6.15 ndb-7.3.4
2014-10-18 11:24:50 [MgmtSrvr] INFO     -- Node 3: Node 6: API mysql-5.6.15 ndb-7.3.4
2014-10-18 11:24:50 [MgmtSrvr] INFO     -- Node 3: Suma: initiate handover for startup with nodes 0000000000000004 GCI: 10199688
2014-10-18 11:24:50 [MgmtSrvr] INFO     -- Node 3: Suma: handover from node 2 gci: 10199688 buckets: 00000002 (2)
2014-10-18 11:24:52 [MgmtSrvr] INFO     -- Node 3: Start phase 101 completed (initial node restart)
2014-10-18 11:24:52 [MgmtSrvr] INFO     -- Node 3: Started (mysql-5.6.15 ndb-7.3.4)
2014-10-18 11:24:52 [MgmtSrvr] INFO     -- Node 2: DICT: unlocked by node 3 for NodeRestart
2014-10-18 11:27:41 [MgmtSrvr] ALERT    -- Node 3: Node 2 Disconnected
2014-10-18 11:27:41 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 closed
2014-10-18 11:27:41 [MgmtSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2014-10-18 11:27:41 [MgmtSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2014-10-18 11:27:42 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2014-10-18 11:27:42 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2014-10-18 11:27:42 [MgmtSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: GCP Take over started
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: Node 3 taking over as DICT master
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: GCP Monitor: unlimited lags allowed
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: GCP Take over completed
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: kk: 10199852/4 0 0
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: LCP Take over started
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: LCP Take over completed (state = 4)
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2014-10-18 11:27:42 [MgmtSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=2a380005341d6797]
2014-10-18 11:28:40 [MgmtSrvr] WARNING  -- Node 3: Failure handling of node 2 has not completed in 1 min - state = 6
2014-10-18 11:29:41 [MgmtSrvr] WARNING  -- Node 3: Failure handling of node 2 has not completed in 2 min - state = 6
2014-10-18 11:30:11 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 opened
----------------------------------------------------------------
start error: data node:
[root@centos01 ~]# ndbd
2014-10-18 11:43:49 [ndbd] INFO     -- Angel connected to '192.168.1.10:1186'
2014-10-18 11:43:49 [ndbd] INFO     -- Angel allocated nodeid: 2
[root@centos01 ~]# tail -f /var/lib/mysql-cluster/ndb_1_cluster.log
2014-10-18 11:43:51 [MgmtSrvr] INFO     -- Node 3: Node 2 Connected
2014-10-18 11:43:51 [MgmtSrvr] INFO     -- Node 2: CM_REGCONF president = 3, own Node = 2, our dynamic id = 0/6
2014-10-18 11:43:51 [MgmtSrvr] INFO     -- Node 3: Node 2: API mysql-5.6.15 ndb-7.3.4
2014-10-18 11:43:51 [MgmtSrvr] INFO     -- Node 2: Node 3: API mysql-5.6.15 ndb-7.3.4
2014-10-18 11:43:51 [MgmtSrvr] INFO     -- Node 2: Start phase 1 completed 
2014-10-18 11:43:52 [MgmtSrvr] INFO     -- Node 2: Start phase 2 completed (node restart)
2014-10-18 11:43:52 [MgmtSrvr] INFO     -- Node 3: DICT: locked by node 2 for NodeRestart
2014-10-18 11:43:52 [MgmtSrvr] INFO     -- Node 2: Start phase 3 completed (node restart)
2014-10-18 11:43:52 [MgmtSrvr] INFO     -- Node 2: Start phase 4 completed (node restart)
2014-10-18 11:43:52 [MgmtSrvr] INFO     -- Node 2: Receive arbitrator node 1 [ticket=2a380005341d6797]
2014-10-18 11:45:59 [MgmtSrvr] INFO     -- Node 2: Starting to restore schema
2014-10-18 11:46:05 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2014-10-18 11:46:05 [MgmtSrvr] ALERT    -- Node 3: Node 2 Disconnected
2014-10-18 11:46:05 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 closed
2014-10-18 11:46:05 [MgmtSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2014-10-18 11:46:05 [MgmtSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2014-10-18 11:46:05 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2014-10-18 11:46:05 [MgmtSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2014-10-18 11:46:05 [MgmtSrvr] INFO     -- Node 3: Removed lock for node 2
2014-10-18 11:46:05 [MgmtSrvr] INFO     -- Node 3: DICT: remove lock by failed node 2 for NodeRestart
2014-10-18 11:46:05 [MgmtSrvr] INFO     -- Node 3: DICT: unlocked by node 2 for NodeRestart
2014-10-18 11:46:05 [MgmtSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=2a380006342e3c9d]
2014-10-18 11:46:08 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 opened

--------------------------
[root@centos01 ~]# tail -n 50 /var/lib/mysql-cluster/ndb_2_error.log 
Pid: 6557
Version: mysql-5.6.15 ndb-7.3.4
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.7 [t1..t1]
***EOM*
Time: Wednesday 30 July 2014 - 16:31:56
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: /export/home/pb2/build/sb_0-11331885-1390486620.31/mysql-cluster-gpl-7.3.4/storage/ndb/src/kernel/ndbd.cpp
Program: ndbd
Pid: 6655
Version: mysql-5.6.15 ndb-7.3.4
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.8 [t1..t1]
***EOM*
Time: Monday 29 September 2014 - 17:31:28
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 6068) 0x00000002
Program: ndbd
Pid: 1408
Version: mysql-5.6.15 ndb-7.3.4
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.9 [t1..t1]
***EOM***
                      
Time: Saturday 18 October 2014 - 11:27:41
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtcMain.cpp
Error object: DBTC (Line: 18290) 0x00000002
Program: ndbd
Pid: 1413
Version: mysql-5.6.15 ndb-7.3.4
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.10 [t1..t1]
***EOM***
                                                                               
Time: Saturday 18 October 2014 - 11:46:04
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: /export/home/pb2/build/sb_0-11331885-1390486620.31/mysql-cluster-gpl-7.3.4/storage/ndb/src/kernel/ndbd.cpp
Program: ndbd
Pid: 10972
Version: mysql-5.6.15 ndb-7.3.4
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.11 [t1..t1]
***
[18 Oct 2014 4:01] ray liu
ndb_error_report_20141018115045.tar

Attachment: ndb_error_report_20141018115045.tar.bz2 (application/octet-stream, text), 654 bytes.

[18 Oct 2014 4:05] ray liu
ndb data node trace log

Attachment: Node_log_and_trace_log.7z (application/octet-stream, text), 145.82 KiB.

[3 Nov 2014 6:18] ray liu
ndb_error_report

Attachment: ndb_error_report_20141103141058.tar.bz2 (application/octet-stream, text), 1.73 MiB.

[20 Oct 2015 13:00] MySQL Verification Team
looks like corrupted datafiles on the node that won't start. To solve the problem you start that datanode with --initial to let it clean the datadir and fetch data from another running datanode

looking trough logs to see if we have a bug here or not but your "path is clear"

take care
bogdan kecman
[3 Nov 2015 18:20] MySQL Verification Team
Hi,

I can not confirm that this is not a bug but I neither can reproduce this nor verify that this is one so setting status to "can't repeat". If you manage to reproduce this issue yourself please update the bug with new set of logs

kind regards
Bogdan Kecman