MySQL Bugs: #97668: (urgent) Cluster crashed after alter table

Bug #97668	(urgent) Cluster crashed after alter table
Submitted:	17 Nov 2019 6:13	Modified:	19 Nov 2019 13:42
Reporter:	Alex West	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.7.23 ndb-7.6.12	OS:	CentOS
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Hello,

Our cluster have 4 data nodes.

It crashed a few hours ago and we are unable to restart it.

There was an alter table with reorganize hanging for a long time. It made the datanode 2 to die:
2019-11-17 11:21:53 [ndbd] INFO     -- part: 0 : time to complete: 1
2019-11-17 11:21:55 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:22:21 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:23:00 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:23:03 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:23:22 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:24:16 [ndbd] INFO     -- part: 0 : time to complete: 1
2019-11-17 11:24:31 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:24:33 [ndbd] INFO     -- part: 1 : time to complete: 1 - Repeated 2 times
2019-11-17 11:24:52 [ndbd] INFO     -- part: 0 : time to complete: 1
2019-11-17 11:24:54 [ndbd] INFO     -- part: 1 : time to complete: 1
2019-11-17 11:26:52 [ndbd] INFO     -- part: 0 : time to complete: 1
2019-11-17 11:27:23 [ndbd] INFO     -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/dbdict/Dbdict.cpp
2019-11-17 11:27:23 [ndbd] INFO     -- DBDICT (Line: 31898) 0x00000004 Check false failed
2019-11-17 11:27:23 [ndbd] INFO     -- Error handler restarting system
2019-11-17 11:27:23 [ndbd] INFO     -- Error handler shutdown completed - exiting
2019-11-17 11:27:23 [ndbd] ALERT    -- Node 12: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2019-11-17 11:27:23 [ndbd] INFO     -- Ndb has terminated (pid 60890) restarting
2019-11-17 11:27:23 [ndbd] INFO     -- Angel reconnected to '172.20.11.21:1186'

When I saw the node dead, I launched a backup just in case. It made all the cluster to die after node 11 failure:
2019-11-17 11:27:23 [ndbd] INFO     -- start_resend(1, DBTC 0: Step NF_BLOCK_HANDLE completed, failure handling for node 12 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2019-11-17 11:27:23 [ndbd] INFO     -- NR Status: node=12,OLD=Initial state,NEW=Node failed, fail handling ongoing
empty bucket (20042972/1 20042972/0) -> active
2019-11-17 11:27:23 [ndbd] INFO     -- Master takeover started from 12
2019-11-17 11:27:23 [ndbd] INFO     -- GCP Monitor: Computed max GCP_COMMIT lag to 54 seconds
2019-11-17 11:27:23 [ndbd] INFO     -- GCP Monitor: Computed max GCP_SAVE lag to 144 seconds
2019-11-17 11:27:23 [ndbd] INFO     -- DBTC 0: Step NF_CHECK_SCAN completed, failure handling for node 12 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION.
2019-11-17 11:27:23 [ndbd] INFO     -- DBTC 0: Step NF_CHECK_TRANSACTION completed, failure handling for node 12 waiting for NF_TAKEOVER.
2019-11-17 11:27:23 [ndbd] INFO     -- DBTC 0: Completed take over of failed node 12
2019-11-17 11:27:23 [ndbd] INFO     -- DBTC 0: Step NF_TAKEOVER completed, failure handling for node 12 complete.
2019-11-17 11:27:24 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart ongoing
2019-11-17 11:29:20 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart finished
2019-11-17 11:49:41 [ndbd] INFO     -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/backup/Backup.cpp
2019-11-17 11:49:41 [ndbd] INFO     -- BACKUP (Line: 6841) 0x00000000 Check conf->reorgFlag == 0 failed
2019-11-17 11:49:41 [ndbd] INFO     -- Error handler restarting system
2019-11-17 11:49:41 [ndbd] INFO     -- Error handler shutdown completed - exiting
2019-11-17 11:49:42 [ndbd] ALERT    -- Node 11: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2019-11-17 11:49:42 [ndbd] INFO     -- Ndb has terminated (pid 3873) restarting
2019-11-17 11:49:42 [ndbd] INFO     -- Angel reconnected to '172.20.11.21:1186'
2019-11-17 11:49:45 [ndbd] INFO     -- Angel reallocated nodeid: 11
2019-11-17 11:49:45 [ndbd] INFO     -- Angel pid: 54760 started child: 36625
2019-11-17 11:49:45 [ndbd] INFO     -- Normal start of data node using checkpoint and log info if existing
2019-11-17 11:49:45 [ndbd] INFO     -- Configuration fetched from '172.20.11.21:1186', generation: 4
2019-11-17 11:49:45 [ndbd] INFO     -- Changing directory to '/home/cluster/'
ThreadConfig: input: main,ldm,ldm,ldm,ldm,recv,rep LockExecuteThreadToCPU:  => parsed: main,ldm,ldm,ldm,ldm,recv,rep
2019-11-17 11:49:45 [ndbd] INFO     -- MaxNoOfTriggers set to 30272
NDBMT: MaxNoOfExecutionThreads=7
NDBMT: workers=4 threads=4 tc=0 send=0 receive=1
2019-11-17 11:49:45 [ndbd] INFO     -- NDB Cluster -- DB node 11
2019-11-17 11:49:45 [ndbd] INFO     -- mysql-5.7.23 ndb-7.6.7 --
2019-11-17 11:49:45 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2019-11-17 11:49:45 [ndbd] INFO     -- Memory Allocation for global memory pools Starting

I tried to reload cluster with node 11,13,14 but it fails at this point:
2019-11-17 11:50:31 [ndbd] INFO     -- Restart recreating table with id = 562
2019-11-17 11:50:31 [ndbd] INFO     -- Restart recreating table with id = 565
2019-11-17 11:50:31 [ndbd] INFO     -- Restart recreating table with id = 569
2019-11-17 11:50:31 [ndbd] INFO     -- Restart recreating table with id = 572
2019-11-17 11:50:31 [ndbd] INFO     -- Restart recreating table with id = 576
2019-11-17 11:50:31 [ndbd] INFO     -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp
2019-11-17 11:50:31 [ndbd] INFO     -- DBDIH (Line: 24839) 0x00000000 Check fragPtr.p->noStoredReplicas > 0 failed
2019-11-17 11:50:31 [ndbd] INFO     -- Error handler restarting system
2019-11-17 11:50:31 [ndbd] INFO     -- Error handler shutdown completed - exiting
2019-11-17 11:50:32 [ndbd] INFO     -- Angel detected startup failure, count: 1
2019-11-17 11:50:32 [ndbd] ALERT    -- Node 11: Forced node shutdown completed. Occured during startphase 4. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Since all my business is stuck, I don't care if I loose a few data. Is there a way to relaunch cluster by ignoring missing fragment?

How to repeat:
.

Hi,

 - for data recovery, please contact our MySQL Support team
 - for post mortem please upload full ndb_error_report and logs from sql nodes too
 - please upgrade to 7.6.12 as you are running ndb-7.6.7 according to this log snippet you sent as if you look at https://dev.mysql.com/doc/relnotes/mysql-cluster/7.6/en/ you can see we fixed a number of bugs since 7.6.7 and you might be hitting one of them

all best

mysql-bug-data-97668.zip uploaded

Hi,

> Is it normal behavior for ndbrequire to shutdown cluster?

yes, that is what ndbrequire does.

I'm having issues reproducing this crash with modern 7.6... looks like you hit some bug from the older version, trying to locate which one. 

Do you have DISK DATA ?

all best

yes, "fs" folders

Hi,

No, I did not mean if you stored your datadirs, the question is do you use tables that are stored on disk:
https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-disk-data.html

but I assume you don't so my first idea is down the drain.

all best

I can't reproduce the alter+backup crash with ndb-7.6.12 (I can with ndb-7.6.7) but I can't find which fix solved this bug so I'd still be weary about running alter during backup, but be sure to update to latest version.