Description:
Hello,
Our cluster have 4 data nodes.
It crashed a few hours ago and we are unable to restart it.
There was an alter table with reorganize hanging for a long time. It made the datanode 2 to die:
2019-11-17 11:21:53 [ndbd] INFO -- part: 0 : time to complete: 1
2019-11-17 11:21:55 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:22:21 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:23:00 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:23:03 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:23:22 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:24:16 [ndbd] INFO -- part: 0 : time to complete: 1
2019-11-17 11:24:31 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:24:33 [ndbd] INFO -- part: 1 : time to complete: 1 - Repeated 2 times
2019-11-17 11:24:52 [ndbd] INFO -- part: 0 : time to complete: 1
2019-11-17 11:24:54 [ndbd] INFO -- part: 1 : time to complete: 1
2019-11-17 11:26:52 [ndbd] INFO -- part: 0 : time to complete: 1
2019-11-17 11:27:23 [ndbd] INFO -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp
2019-11-17 11:27:23 [ndbd] INFO -- DBDICT (Line: 31898) 0x00000004 Check false failed
2019-11-17 11:27:23 [ndbd] INFO -- Error handler restarting system
2019-11-17 11:27:23 [ndbd] INFO -- Error handler shutdown completed - exiting
2019-11-17 11:27:23 [ndbd] ALERT -- Node 12: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2019-11-17 11:27:23 [ndbd] INFO -- Ndb has terminated (pid 60890) restarting
2019-11-17 11:27:23 [ndbd] INFO -- Angel reconnected to '172.20.11.21:1186'
When I saw the node dead, I launched a backup just in case. It made all the cluster to die after node 11 failure:
2019-11-17 11:27:23 [ndbd] INFO -- start_resend(1, DBTC 0: Step NF_BLOCK_HANDLE completed, failure handling for node 12 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2019-11-17 11:27:23 [ndbd] INFO -- NR Status: node=12,OLD=Initial state,NEW=Node failed, fail handling ongoing
empty bucket (20042972/1 20042972/0) -> active
2019-11-17 11:27:23 [ndbd] INFO -- Master takeover started from 12
2019-11-17 11:27:23 [ndbd] INFO -- GCP Monitor: Computed max GCP_COMMIT lag to 54 seconds
2019-11-17 11:27:23 [ndbd] INFO -- GCP Monitor: Computed max GCP_SAVE lag to 144 seconds
2019-11-17 11:27:23 [ndbd] INFO -- DBTC 0: Step NF_CHECK_SCAN completed, failure handling for node 12 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION.
2019-11-17 11:27:23 [ndbd] INFO -- DBTC 0: Step NF_CHECK_TRANSACTION completed, failure handling for node 12 waiting for NF_TAKEOVER.
2019-11-17 11:27:23 [ndbd] INFO -- DBTC 0: Completed take over of failed node 12
2019-11-17 11:27:23 [ndbd] INFO -- DBTC 0: Step NF_TAKEOVER completed, failure handling for node 12 complete.
2019-11-17 11:27:24 [ndbd] INFO -- Adjusting disk write speed bounds due to : Node restart ongoing
2019-11-17 11:29:20 [ndbd] INFO -- Adjusting disk write speed bounds due to : Node restart finished
2019-11-17 11:49:41 [ndbd] INFO -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/backup/Backup.cpp
2019-11-17 11:49:41 [ndbd] INFO -- BACKUP (Line: 6841) 0x00000000 Check conf->reorgFlag == 0 failed
2019-11-17 11:49:41 [ndbd] INFO -- Error handler restarting system
2019-11-17 11:49:41 [ndbd] INFO -- Error handler shutdown completed - exiting
2019-11-17 11:49:42 [ndbd] ALERT -- Node 11: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2019-11-17 11:49:42 [ndbd] INFO -- Ndb has terminated (pid 3873) restarting
2019-11-17 11:49:42 [ndbd] INFO -- Angel reconnected to '172.20.11.21:1186'
2019-11-17 11:49:45 [ndbd] INFO -- Angel reallocated nodeid: 11
2019-11-17 11:49:45 [ndbd] INFO -- Angel pid: 54760 started child: 36625
2019-11-17 11:49:45 [ndbd] INFO -- Normal start of data node using checkpoint and log info if existing
2019-11-17 11:49:45 [ndbd] INFO -- Configuration fetched from '172.20.11.21:1186', generation: 4
2019-11-17 11:49:45 [ndbd] INFO -- Changing directory to '/home/cluster/'
ThreadConfig: input: main,ldm,ldm,ldm,ldm,recv,rep LockExecuteThreadToCPU: => parsed: main,ldm,ldm,ldm,ldm,recv,rep
2019-11-17 11:49:45 [ndbd] INFO -- MaxNoOfTriggers set to 30272
NDBMT: MaxNoOfExecutionThreads=7
NDBMT: workers=4 threads=4 tc=0 send=0 receive=1
2019-11-17 11:49:45 [ndbd] INFO -- NDB Cluster -- DB node 11
2019-11-17 11:49:45 [ndbd] INFO -- mysql-5.7.23 ndb-7.6.7 --
2019-11-17 11:49:45 [ndbd] INFO -- WatchDog timer is set to 6000 ms
2019-11-17 11:49:45 [ndbd] INFO -- Memory Allocation for global memory pools Starting
I tried to reload cluster with node 11,13,14 but it fails at this point:
2019-11-17 11:50:31 [ndbd] INFO -- Restart recreating table with id = 562
2019-11-17 11:50:31 [ndbd] INFO -- Restart recreating table with id = 565
2019-11-17 11:50:31 [ndbd] INFO -- Restart recreating table with id = 569
2019-11-17 11:50:31 [ndbd] INFO -- Restart recreating table with id = 572
2019-11-17 11:50:31 [ndbd] INFO -- Restart recreating table with id = 576
2019-11-17 11:50:31 [ndbd] INFO -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp
2019-11-17 11:50:31 [ndbd] INFO -- DBDIH (Line: 24839) 0x00000000 Check fragPtr.p->noStoredReplicas > 0 failed
2019-11-17 11:50:31 [ndbd] INFO -- Error handler restarting system
2019-11-17 11:50:31 [ndbd] INFO -- Error handler shutdown completed - exiting
2019-11-17 11:50:32 [ndbd] INFO -- Angel detected startup failure, count: 1
2019-11-17 11:50:32 [ndbd] ALERT -- Node 11: Forced node shutdown completed. Occured during startphase 4. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Since all my business is stuck, I don't care if I loose a few data. Is there a way to relaunch cluster by ignoring missing fragment?
How to repeat:
.