Bug #92220 Mysql Cluster datanode stops writing LCPs when it's companion dies
Submitted: 29 Aug 2018 7:26 Modified: 12 Dec 2018 16:05
Reporter: Hendrik Woltersdorf Email Updates:
Status: Verified Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.6.41 ndb-7.4.21 OS:CentOS
Assigned to: CPU Architecture:x86

[29 Aug 2018 7:26] Hendrik Woltersdorf
On one datanode a HDD in the HP RAID array died.
This caused the datanode (ndbd) to do a forced shutdown:
Caused by error 7200: LCP fragment scan watchdog detected a problem. ...
At this point the other, surviving datanode stopped writing LCPs.
We had to shutdown and restart the surviving datanode, causing a downtime, to get it writing LCPs again.
When we tried to restart the datanode, that had died, before restarting the surviving datanode, the start did not finish, because non of the datanodes wrote a LCP (and therefore no GCP). The start process stopped at:
"Make On-line Database recoverable by waiting for LCP Starting"

How to repeat:
Cause a situation, where LCPs cannot be written to disk.
[29 Aug 2018 7:27] Hendrik Woltersdorf
ndb_error_reporter files

Attachment: ndb_error_report_20180829091058.tar.bz2 (application/octet-stream, text), 1.25 MiB.

[29 Aug 2018 7:28] Hendrik Woltersdorf
We saw this first at 2018-08-16 16:39:12 in version 7.4.20
and again at 2018-08-23 15:00:08 in version 7.4.21.
[17 Sep 2018 15:23] MySQL Verification Team

Having issues reproducing this, if I understand correctly hdd died on one node and the other node (you have only 2 data nodes) stopped writing LCP's ? Weird. I just tried this on 2 7.4.2 clusters I have running and they continued working normally.. but now I notice you are using compressedlcd so I'll try this next.

How did you notice surviving node is not writing LCP's? Lack of io utilization (no hdd led blinking :D ) or you were inspecting logs or ?

kind regards
[25 Sep 2018 7:51] Hendrik Woltersdorf
The hdd died in a RAID, controlled by a HP controller. The operating system did not notice anything, but the performance of writing LCP's was severely degraded.
I found the issue on the surviving node by inspecting the log (no LCP entries) and afterwards checking the timestamps of the files under the LCP directory and its subdirectories.

Hendrik Woltersdorf
[25 Sep 2018 8:50] MySQL Verification Team

well the crash makes sense (and is expected) in such situation.

I assume after you fixed the hardware the sw issue fixed itself :)

all best
[25 Sep 2018 10:25] Hendrik Woltersdorf
That the datanode on the machine with the dead hdd shuts down itself is expected, but not that the surviving datanode stops writing LCP's. Because this prevents the  shutdown datanode from being restarted. I have to stop the surviving datanode, to be able to start the shutdown datanode. This causes a downtime of the whole cluster, which is the opposite of High Avaliabilitiy.
This bug should therefore be reopened.
[25 Sep 2018 18:35] MySQL Verification Team
yes, I agree with you, sorry, reopened :)

now lemme see if I can reproduce this

all best
[26 Sep 2018 4:56] MySQL Verification Team

I'm unable to reproduce this. I slowed down the IO on the one node to the point it crashed but the other node continued to work normally. 

Repeated the test case numerous times, every time without a problem.

kind regards
[26 Sep 2018 17:36] MySQL Verification Team
Confirmed that we seen this on other system. Trying to reproduce...
[12 Dec 2018 16:05] MySQL Verification Team
Verified as we are seeing this in a second system but I'm still unable to reproduce this myself in controlled environment!