Bug #92220 | Mysql Cluster datanode stops writing LCPs when it's companion dies | ||
---|---|---|---|
Submitted: | 29 Aug 2018 7:26 | Modified: | 12 Dec 2018 16:05 |
Reporter: | Hendrik Woltersdorf | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | mysql-5.6.41 ndb-7.4.21 | OS: | CentOS |
Assigned to: | CPU Architecture: | x86 |
[29 Aug 2018 7:26]
Hendrik Woltersdorf
[29 Aug 2018 7:27]
Hendrik Woltersdorf
ndb_error_reporter files
Attachment: ndb_error_report_20180829091058.tar.bz2 (application/octet-stream, text), 1.25 MiB.
[29 Aug 2018 7:28]
Hendrik Woltersdorf
We saw this first at 2018-08-16 16:39:12 in version 7.4.20 and again at 2018-08-23 15:00:08 in version 7.4.21.
[17 Sep 2018 15:23]
MySQL Verification Team
Hi, Having issues reproducing this, if I understand correctly hdd died on one node and the other node (you have only 2 data nodes) stopped writing LCP's ? Weird. I just tried this on 2 7.4.2 clusters I have running and they continued working normally.. but now I notice you are using compressedlcd so I'll try this next. How did you notice surviving node is not writing LCP's? Lack of io utilization (no hdd led blinking :D ) or you were inspecting logs or ? kind regards bogdan
[25 Sep 2018 7:51]
Hendrik Woltersdorf
The hdd died in a RAID, controlled by a HP controller. The operating system did not notice anything, but the performance of writing LCP's was severely degraded. I found the issue on the surviving node by inspecting the log (no LCP entries) and afterwards checking the timestamps of the files under the LCP directory and its subdirectories. regards, Hendrik Woltersdorf
[25 Sep 2018 8:50]
MySQL Verification Team
Hi, well the crash makes sense (and is expected) in such situation. I assume after you fixed the hardware the sw issue fixed itself :) all best Bogdan
[25 Sep 2018 10:25]
Hendrik Woltersdorf
That the datanode on the machine with the dead hdd shuts down itself is expected, but not that the surviving datanode stops writing LCP's. Because this prevents the shutdown datanode from being restarted. I have to stop the surviving datanode, to be able to start the shutdown datanode. This causes a downtime of the whole cluster, which is the opposite of High Avaliabilitiy. This bug should therefore be reopened.
[25 Sep 2018 18:35]
MySQL Verification Team
yes, I agree with you, sorry, reopened :) now lemme see if I can reproduce this all best Bogdan
[26 Sep 2018 4:56]
MySQL Verification Team
Hi, I'm unable to reproduce this. I slowed down the IO on the one node to the point it crashed but the other node continued to work normally. Repeated the test case numerous times, every time without a problem. kind regards Bogdan
[26 Sep 2018 17:36]
MySQL Verification Team
Confirmed that we seen this on other system. Trying to reproduce...
[12 Dec 2018 16:05]
MySQL Verification Team
Verified as we are seeing this in a second system but I'm still unable to reproduce this myself in controlled environment! Bogdan