MySQL Bugs: #101352: NDB Data node start fail

Bug #101352	NDB Data node start fail
Submitted:	28 Oct 2020 8:06	Modified:	16 Nov 2020 7:31
Reporter:	Jinho Choi	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.21	OS:	Other (Amazon Linux 2)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86 (aws ec2 r5.xlarge)
Tags:	data node, ndb

Description:
We are using mysql cluster (8.0.21) as
3 mgm nodes, 2 data nodes (disk storage, 1 shard, 2 replica), 1 mysql node

On this monday, data node master (node id 11) had shutdown with error message "file system error. start initial'.

And I tried "ndbmtd --initial", but same error.
So I replaced with new machine, and tried again.
But, after about 1 hour, following error reported.

------------------------
mgm
------------------------
ndb_mgm> Node 11: Forced node shutdown completed. Occurred during startphase 5. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

------------------------
ndb_11_error.log
------------------------
Current byte-offset of file-pointer is: 1067

Time: Thursday 22 October 2020 - 18:57:49
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: ../../../../../mysql-cluster-gpl-8.0.21/storage/ndb/src/kernel/ndbd.cpp
Program: ndbmtd
Pid: 7541 thr: 2
Version: mysql-8.0.21 ndb-8.0.21
Trace file name: ndb_11_trace.log.1_t2
Trace file path: /usr/local/mysql/my_cluster/ndb_data/ndb_11_trace.log.1 [t1..t7]
***EOM***

Db service is availeable with another data node (node id 12).
What should I do ?
How can I revive data node 11 ?

Should I upgrade to 8.0.22 ?
Or is this any kind of timeout problem ?

How to repeat:
start data node again with initial parameter.

ndb_11_out.log

Attachment: ndb_11_out.log (application/octet-stream, text), 96.13 KiB.

config.ini

Attachment: config.ini (application/octet-stream, text), 786 bytes.

Hi,

I cannot reproduce this. To me it looks like originally there was some data corruption on the filesystem of the node, then when you tried to start initial the network was not stable enough and it was not able to fetch all the data from the surviving node. I cannot confirm this as I can't reproduce this but looking at the code and your log this is only thing that makes sense to me.

all best
Bogdan

Attached logs are from new ec2 machine (data node 11). And i did start with --initial option.
So, there is no corrupted data on the data node 11.

If it was network problem? If i try it again, it could be successfull end of startinf data node 11 ?

What should I do? 
What is my option left ?

Can you tell me what the log says?
It was new machine so it is not data problem.

Is there any timeout configuration related to data node start with initial ?
When the node start failed, it was about 1 hour after from starting.