Bug #67609 | NDB on high wait IO CPU load => complete service crash | ||
---|---|---|---|
Submitted: | 16 Nov 2012 13:32 | Modified: | 25 Feb 2013 9:26 |
Reporter: | Gerald Degn | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | mysql-5.5.27 ndb-7.2.8 | OS: | Linux (RHEL 6.3) |
Assigned to: | Daniel Smythe | CPU Architecture: | Any |
[16 Nov 2012 13:32]
Gerald Degn
[16 Nov 2012 13:35]
Gerald Degn
output of ndb_error_reporter
Attachment: ndb_error_report_20121116142131.tar.bz2 (application/octet-stream, text), 444.15 KiB.
[16 Nov 2012 13:36]
Gerald Degn
output of CPU, showing 6 CPUs on 100 wa load
Attachment: cpu_top.txt (text/plain), 7.94 KiB.
[24 Jan 2013 19:41]
Daniel Smythe
Hi, I don't think there is a problem here - what kind of disk hardware is on the ndbmtd nodes? I'm seeing around 4-5GB of data memory used out of 10GB or so configured, but you have TimeBetweenLocalCheckpoints = 31 which is 8GB... ( http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-ti... ) So this explains the variable amount of time between local checkpoints... Then I think when local checkpoint starts your disk is saturated from the writing. So that is why I ask about disk hardware - if it cannot reliably write out DataMemory then things will get slow regardless of how long you wait between local checkpoints. You can also correlate the high CPU IO Wait time with the cluster logs by setting ndb_mgm -e 'ALL CLUSTERLOG CHECKPOINT=15' - then we will see in ndb_1_cluster.log when LCP starts/ends.
[25 Feb 2013 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[25 Feb 2013 9:26]
Gerald Degn
Hi, sorry for long delay with feedback. First, thanks for feedback and hints to check settings. Looks you're right that the issue was caused by checkpoints. We made a 2nd installation, where we left checkpoint parameters on default, and could not reproduce this issue anymore. We will investigate on checkpoint parameters, to find optimal settings for our installation. I will close this ticket. Regards Gerald