Bug #97336 Node failure
Submitted: 23 Oct 2019 3:05 Modified: 25 Oct 2019 13:12
Reporter: Alex West Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.6.7 OS:CentOS (7)
Assigned to: MySQL Verification Team CPU Architecture:x86

[23 Oct 2019 3:05] Alex West
Description:
Hello,

I had a node crash this morning. It had the following output on node:
2019-10-23 08:01:04 [ndbd] INFO     -- dataLen: 87, op.maxRecordSize = 19, header: 65623
2019-10-23 08:01:04 [ndbd] INFO     -- /export/home/pb2/build/sb_0-29707995-1532148374.74/rpm/BUILD/mysql-cluster-gpl-7.6.7/mysql-cluster-gpl-7.6.7/storage/ndb/src/kernel/blocks/backup/Backup.cpp
2019-10-23 08:01:04 [ndbd] INFO     -- BACKUP (Line: 7928) 0x00000004 Check false failed
2019-10-23 08:01:04 [ndbd] INFO     -- Error handler restarting system
2019-10-23 08:01:04 [ndbd] INFO     -- Error handler shutdown completed - exiting
2019-10-23 08:01:05 [ndbd] ALERT    -- Node 12: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2019-10-23 08:01:05 [ndbd] INFO     -- Ndb has terminated (pid 48933) restarting

The 2 Data nodes were directly connected (lacp trunk) until yesterday but I moved them to a switch (HP) with lacp correctly set. So the faulty node was resynced yesterday.
The reason I changed from directly connected to switch is because I wish to add 2 mode data nodes this week. This was the first step of work.
The last log before the problem is the following:
2019-10-22 23:49:29 [ndbd] INFO     -- Start phase 101 completed
2019-10-22 23:49:29 [ndbd] INFO     -- Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events
2019-10-22 23:49:29 [ndbd] INFO     -- Node started
2019-10-22 23:49:29 [ndbd] INFO     -- GCP Monitor: Computed max GCP_COMMIT lag to 47 seconds
2019-10-22 23:49:29 [ndbd] INFO     -- GCP Monitor: Computed max GCP_SAVE lag to 137 seconds
2019-10-23 01:24:46 [ndbd] INFO     -- LCP Scan have stalled for 0 seconds, lastlast seen on line 13406, check_lcp_stop_count: 0
2019-10-23 01:32:26 [ndbd] INFO     -- DROP_TAB_REQ: tab: 587, tabLcpStatus: 3
2019-10-23 01:32:26 [ndbd] INFO     -- DROP_TAB_REQ: tab: 586, tabLcpStatus: 3
2019-10-23 01:32:50 [ndbd] INFO     -- DROP_TAB_REQ: tab: 259, tabLcpStatus: 3
2019-10-23 01:32:50 [ndbd] INFO     -- DROP_TAB_REQ: tab: 369, tabLcpStatus: 3
2019-10-23 01:32:50 [ndbd] INFO     -- DROP_TAB_REQ: tab: 329, tabLcpStatus: 3
2019-10-23 01:32:50 [ndbd] INFO     -- DROP_TAB_REQ: tab: 254, tabLcpStatus: 3
2019-10-23 01:32:50 [ndbd] INFO     -- DROP_TAB_REQ: tab: 351, tabLcpStatus: 3
2019-10-23 01:32:50 [ndbd] INFO     -- DROP_TAB_REQ: tab: 248, tabLcpStatus: 3
2019-10-23 02:05:17 [ndbd] INFO     -- LCP Scan have stalled for 0 seconds, lastlast seen on line 13406, check_lcp_stop_count: 0

I hope anyone can help.

Best Regards

How to repeat:
I do not have an "how to repeat" procedure but my config is as follow:
[NDB_MGMD DEFAULT]
Portnumber=1186

[NDB_MGMD]
NodeId=21
HostName=172.XX.11.21
DataDir=/home/cluster/
Portnumber=1186

[TCP DEFAULT]
SendBufferMemory=4M
ReceiveBufferMemory=4M

[NDBD DEFAULT]
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15
DataMemory=80G
#IndexMemory=20G
MaxNoOfTables=4096
MaxNoOfTriggers=3500
NoOfReplicas=2
StringMemory=25
DiskPageBufferMemory=64M
SharedGlobalMemory=500M
LongMessageBuffer=32M
MaxNoOfConcurrentTransactions=16384
MaxNoOfConcurrentOperations=64000
BatchSizePerLocalScan=512
FragmentLogFileSize=256M
NoOfFragmentLogFiles=455
RedoBuffer=32M
MaxNoOfExecutionThreads=12
StopOnError=false
LockPagesInMainMemory=1
TimeBetweenEpochsTimeout=32000
TimeBetweenWatchdogCheckInitial=60000
TransactionInactiveTimeout=60000
HeartbeatIntervalDbDb=1500
HeartbeatIntervalDbApi=1500
MaxNoOfAttributes=5000
StartPartialTimeout=60
StartPartitionedTimeout=60
MaxNoOfOrderedIndexes=5000
MaxNoOfUniqueHashIndexes=128
TransactionDeadlockDetectionTimeout=2000

[NDBD]
NodeId=11
HostName=172.XX.11.11
DataDir=/home/cluster/
ServerPort=50501

[NDBD]
NodeId=12
HostName=172.XX.11.12
DataDir=/home/cluster/
ServerPort=50502

[MYSQLD DEFAULT]

[MYSQLD]
NodeId=31
HostName=172.XX.11.31
[MYSQLD]
NodeId=32
HostName=172.XX.11.32
[MYSQLD]
NodeId=33
HostName=172.XX.11.33
[MYSQLD]
NodeId=34
HostName=172.XX.11.34

# TCP/IP Connections
[tcp]
NodeId1=11
NodeId2=12
HostName1=10.10.10.1
HostName2=10.10.10.2
[25 Oct 2019 11:37] MySQL Verification Team
Hi,

I cannot reproduce this and without full logs (ndb_error_reporter) it's impossible to say what happened but this looks like improperly sized system and that's out of scope of bugs database. You should check our MySQL Support team or reach out to https://forums.mysql.com/

kind regards
[25 Oct 2019 13:12] Alex West
Thanks for your reply, I can provide you the error export.

I don't understand the "undersized" server, do I miss memory or settings? If there is something to increase, I would be happy to know.
[25 Oct 2019 13:20] MySQL Verification Team
Hi,

> Thanks for your reply, I can provide you the error export.

Feel free to.

> I don't understand the "undersized" server, do I miss memory or settings? If there is something to increase, I would be happy to know.

MCCGE is designed in a way to "crash" if it cannot produce expected/reproducible results, that's why it is a "real-time" system. That's why there is a number of those MaxNoOf* options. Properly "sizing" the cluster requires knowledge of your system and your application and is something our Support Team can help you with.

all best