MySQL Bugs: #92220: Mysql Cluster datanode stops writing LCPs when it's companion dies

Bug #92220	Mysql Cluster datanode stops writing LCPs when it's companion dies
Submitted:	29 Aug 2018 7:26	Modified:	12 Dec 2018 16:05
Reporter:	Hendrik Woltersdorf	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.6.41 ndb-7.4.21	OS:	CentOS
Assigned to:		CPU Architecture:	x86

Description:
On one datanode a HDD in the HP RAID array died.
This caused the datanode (ndbd) to do a forced shutdown:
Caused by error 7200: LCP fragment scan watchdog detected a problem. ...
At this point the other, surviving datanode stopped writing LCPs.
We had to shutdown and restart the surviving datanode, causing a downtime, to get it writing LCPs again.
When we tried to restart the datanode, that had died, before restarting the surviving datanode, the start did not finish, because non of the datanodes wrote a LCP (and therefore no GCP). The start process stopped at:
"Make On-line Database recoverable by waiting for LCP Starting"

How to repeat:
Cause a situation, where LCPs cannot be written to disk.

ndb_error_reporter files

Attachment: ndb_error_report_20180829091058.tar.bz2 (application/octet-stream, text), 1.25 MiB.

We saw this first at 2018-08-16 16:39:12 in version 7.4.20
and again at 2018-08-23 15:00:08 in version 7.4.21.

Hi,

Having issues reproducing this, if I understand correctly hdd died on one node and the other node (you have only 2 data nodes) stopped writing LCP's ? Weird. I just tried this on 2 7.4.2 clusters I have running and they continued working normally.. but now I notice you are using compressedlcd so I'll try this next.

How did you notice surviving node is not writing LCP's? Lack of io utilization (no hdd led blinking :D ) or you were inspecting logs or ?

kind regards
bogdan

The hdd died in a RAID, controlled by a HP controller. The operating system did not notice anything, but the performance of writing LCP's was severely degraded.
I found the issue on the surviving node by inspecting the log (no LCP entries) and afterwards checking the timestamps of the files under the LCP directory and its subdirectories.

regards,
Hendrik Woltersdorf

Hi,

well the crash makes sense (and is expected) in such situation.

I assume after you fixed the hardware the sw issue fixed itself :)

all best
Bogdan

That the datanode on the machine with the dead hdd shuts down itself is expected, but not that the surviving datanode stops writing LCP's. Because this prevents the  shutdown datanode from being restarted. I have to stop the surviving datanode, to be able to start the shutdown datanode. This causes a downtime of the whole cluster, which is the opposite of High Avaliabilitiy.
This bug should therefore be reopened.

yes, I agree with you, sorry, reopened :)

now lemme see if I can reproduce this

all best
Bogdan

Hi,

I'm unable to reproduce this. I slowed down the IO on the one node to the point it crashed but the other node continued to work normally. 

Repeated the test case numerous times, every time without a problem.

kind regards
Bogdan

Confirmed that we seen this on other system. Trying to reproduce...

Verified as we are seeing this in a second system but I'm still unable to reproduce this myself in controlled environment!

Bogdan