Bug #74627 Data Node crash and can not startup, can start use ndbd --initial
Submitted: 30 Oct 2014 3:56 Modified: 3 Nov 2015 17:51
Reporter: ray liu Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:NDB 7.3.5 OS:Linux (CentOS release 6.5 (Final))
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: crash

[30 Oct 2014 3:56] ray liu
Description:

Crashed data node log:

2014-10-21 19:49:20 [ndbd] INFO     -- clearing SumaStartMe dict lock for 4
2014-10-21 19:49:26 [ndbd] INFO     -- granting SumaStartMe dict lock to 4
prepare to handover bucket: 1
8914530/0 (8914529/4294967295) switchover complete bucket 1 state: 2handover
2014-10-21 19:49:28 [ndbd] INFO     -- clearing SumaStartMe dict lock for 4
2014-10-22 13:36:40 [ndbd] INFO     -- /export/home/pb2/build/sb_0-11918329-1396368634.9/mysql-cluster-gpl-7.3.5/storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp
2014-10-22 13:36:40 [ndbd] INFO     -- DBDICT (Line: 19143) 0x00000006
2014-10-22 13:36:40 [ndbd] INFO     -- Error handler shutting down system
2014-10-22 13:36:40 [ndbd] INFO     -- Error handler shutdown completed - exiting
2014-10-22 13:36:41 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Another data node log:
2014-10-21 19:49:28 [ndbd] INFO     -- Node started
2014-10-22 13:36:41 [ndbd] INFO     -- findNeighbours from: 4882 old (left: 3 right: 3) new (65535 65535)
2014-10-22 13:36:41 [ndbd] ALERT    -- Network partitioning - arbitration required
2014-10-22 13:36:41 [ndbd] INFO     -- President restarts arbitration thread [state=7]
2014-10-22 13:36:41 [ndbd] ALERT    -- Arbitration won - positive reply from node 1
start_resend(0, empty bucket (8977501/5 8977501/4) -> active
Finished with handling node-failure
execGCP_NOMORETRANS(8977502/0) c_ongoing_take_over_cnt -> seize
completing gcp 8977502/0 in execTAKE_OVERTCCONF
2014-10-22 13:36:41 [ndbd] INFO     -- Started arbitrator node 1 [ticket=0bdc00245b30cd91]

How to repeat:

Config.ini
[ndbd default] #
LockPagesInMainMemory=1
SchedulerSpinTimer=400
SchedulerExecutionTimer=100
#RealTimeScheduler=1
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
DiskCheckpointSpeed=10M
DiskCheckpointSpeedInRestart=100M
RedoBuffer=32M
NoOfReplicas=2
DataMemory=8G
NoOfFragmentLogFiles= 300  #6 * DataMemory (MB)/(4 * FragmentLogFileSize)
#FragmentLogFileSize=256M
IndexMemory=1024M
MaxNoOfTables=1800
MaxNoOfUniqueHashIndexes=3000
MaxNoOfAttributes=50000
MaxNoOfOrderedIndexes=16000
TransactionDeadlockDetectionTimeout=300000
MaxBufferedEpochs=100000
MaxNoOfTriggers=2000
MaxNoOfExecutionThreads=4  #The number of cores on the data node hosts.
MaxNoOfConcurrentScans=450  #Default 256 max 500
MaxNoOfConcurrentTransactions=40960 #The default value is 4096
MaxNoOfConcurrentOperations=327680 #The default value is 32768
#LongMessageBuffer=12M
NoOfFragmentLogParts=8 #The value must be an even multiple of 4 between 4 and 16
ODirect=1
[tcp default] #tcp 
#portnumber=2202
SendBufferMemory=20M
ReceiveBufferMemory=20M

[ndb_mgmd]
hostname=10.163.239.170
datadir=/var/lib/mysql-cluster

[ndb_mgmd]
hostname=10.163.192.124
datadir=/var/lib/mysql-cluster

[ndbd]
hostname=10.163.223.191
datadir=/var/lib/mysql-cluster

[ndbd]
hostname=10.163.231.72
datadir=/var/lib/mysql-cluster
[ndbd]
hostname=10.163.231.72
datadir=/var/lib/mysql-cluster

[mysqld]
hostname=10.163.223.191
[mysqld]
hostname=10.163.231.72
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[mysqld]
[30 Oct 2014 3:57] ray liu
Crashed data node start Log:
[root@AY14062610010149880bZ ~]# ndbd
2014-10-28 16:30:51 [ndbd] INFO     -- Angel connected to '10.163.239.170:1186'
2014-10-28 16:30:51 [ndbd] INFO     -- Angel allocated nodeid: 3
Manage Node report:
ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

[root@AY14062610010149880bZ ~]# tail -f /var/lib/mysql-cluster/ndb_3_out.log 
2014-10-28 16:30:51 [ndbd] INFO     -- Angel pid: 7601 started child: 7602
2014-10-28 16:30:51 [ndbd] INFO     -- Configuration fetched from '10.163.239.170:1186', generation: 1
NDBMT: non-mt
2014-10-28 16:30:51 [ndbd] INFO     -- NDB Cluster -- DB node 3
2014-10-28 16:30:51 [ndbd] INFO     -- mysql-5.6.17 ndb-7.3.5 --
2014-10-28 16:30:51 [ndbd] INFO     -- numa_set_interleave_mask(numa_all_nodes) : OK
2014-10-28 16:30:51 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 9136Mb initial: 9264Mb
Adding 40Mb to ZONE_LO (1,1279)
Instantiating DBSPJ instanceNo=0
Successfully set real-time prio on tid = 7604
Started thread, index = 0, id = 7604, type = SocketClientThread
Successfully set real-time prio on tid = 7603
Started thread, index = 1, id = 7603, type = WatchDogThread
Successfully set real-time prio on tid = 7605
Started thread, index = 2, id = 7605, type = SocketServerThread
2014-10-28 16:30:51 [ndbd] INFO     -- Start initiated (mysql-5.6.17 ndb-7.3.5)
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 4, id = 7606, type = NdbfsThread
Adding 8152Mb to ZONE_LO (1280,260863)
Adding 1073Mb to ZONE_LO (262145,34311)
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 5, id = 7607, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 6, id = 7608, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 7, id = 7609, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 8, id = 7610, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 9, id = 7611, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 10, id = 7612, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 11, id = 7613, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 12, id = 7614, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 13, id = 7615, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 14, id = 7616, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 15, id = 7617, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 16, id = 7618, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 17, id = 7619, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 18, id = 7620, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 19, id = 7621, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 20, id = 7622, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 21, id = 7623, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 22, id = 7624, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 23, id = 7625, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 24, id = 7626, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 25, id = 7627, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 26, id = 7628, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 27, id = 7629, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 28, id = 7630, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 29, id = 7631, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 30, id = 7632, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 31, id = 7633, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 32, id = 7634, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 33, id = 7635, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 34, id = 7636, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 35, id = 7637, type = NdbfsThread
NDBFS/AsyncFile: Allocating 310392 for In/Deflate buffer
Started thread, index = 36, id = 7638, type = NdbfsThread
Started thread, index = 37, id = 7639, type = NdbfsThread
Started thread, index = 38, id = 7640, type = NdbfsThread
WOPool::init(61, 9)
RWPool::init(22, 14)
RWPool::init(42, 75)
LCPFragWatchdog : WarnElapsed : 20000(ms) MaxElapsed 60000(ms) : period millis : 10000
blockSz: 800, wpp: 8188 -> 624 (76)
RWPool::init(29, 624)
RWPool::init(42, 16)
RWPool::init(62, 12)
RWPool::init(42, 83)
2014-10-28 16:30:55 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100
2014-10-28 16:30:55 [ndbd] INFO     -- Watchdog: User time: 32  System time: 315
RWPool::init(c2, 18)
RWPool::init(e2, 16)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 54)
WOPool::init(21, 10)
blockSz: 1024, wpp: 8188 -> 816 (28)
RWPool::init(68, 816)

2014-10-28 16:30:55 [ndbd] INFO     -- Start phase 0 completed
2014-10-28 16:30:55 [ndbd] INFO     -- Using locked memory
2014-10-28 16:30:55 [ndbd] INFO     -- findNeighbours from: 2191 old (left: 65535 right: 65535) new (4 4)
Dbspj::execSTTOR() inst:0 phase=1
2014-10-28 16:30:55 [ndbd] INFO     -- Start phase 1 completed
2014-10-28 16:30:55 [ndbd] INFO     -- Start phase 2 completed
2014-10-28 16:30:56 [ndbd] INFO     -- Start phase 3 completed
Dbspj::execSTTOR() inst:0 phase=4
2014-10-28 16:30:56 [ndbd] INFO     -- Start phase 4 completed
2014-10-28 16:30:56 [ndbd] INFO     -- Receive arbitrator node 1 [ticket=0bdc00245b30cd91]
Found pending trans (0) - aborting
abort create 501
abort create 502
....
abort create 1032
restartCreateObj(1) file: 1
restartCreateObj(20) file: 1
...
restartCreateObj(1065) file: 1
restartDropObj(219)
2014-10-28 16:31:13 [ndbd] INFO     -- Received signal 11. Running error handler.
2014-10-28 16:31:13 [ndbd] INFO     -- Signal 11 received; Segmentation fault
2014-10-28 16:31:13 [ndbd] INFO     -- /export/home/pb2/build/sb_0-11918329-1396368634.9/mysql-cluster-gpl-7.3.5/storage/ndb/src/kernel/ndbd.cpp
2014-10-28 16:31:13 [ndbd] INFO     -- Error handler signal shutting down system
2014-10-28 16:31:13 [ndbd] INFO     -- Error handler shutdown completed - exiting
2014-10-28 16:31:14 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[3 Nov 2014 6:57] ray liu
ndb_error_report

Attachment: ndb_error_report_20141103145231.tar.bz2 (application/octet-stream, text), 886.02 KiB.

[7 Jul 2015 12:48] Mikael Ronström
Crash is in Dbdict.cpp in a restart situation.
[3 Nov 2015 17:51] MySQL Verification Team
Hi Ray,

Crash is in Dbdict.cpp in a restart situation, but I cannot reproduce this unless I corrupt (manually). Are you sure you don't have any issues with storage subsystem?

Have you reproduced this problem after it first happened (the solution to the problem is to start failing datanode with --initial)?

Best regards
Bogdan Kecman