Bug #49142 | ndbmtd dies while starting phase 5 after drop_caches | ||
---|---|---|---|
Submitted: | 26 Nov 2009 17:28 | Modified: | 30 Dec 2009 20:38 |
Reporter: | Robert Klikics | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-5.1-telco-7.0 | OS: | Linux (Debian Etch) |
Assigned to: | Gustaf Thorslund | CPU Architecture: | Any |
Tags: | drop_caches, ndbmtd, telco-7.0.9b |
[26 Nov 2009 17:28]
Robert Klikics
[30 Nov 2009 17:11]
Hartmut Holzgraefe
Raising TimeBetweenEpochsTimeout might help here, in general it is not a good idea to cause extra disk i/o, especially when a restart is going on, as the extra i/o activity may lead to global checkpoint flushes not finishing in time and so causing "GCP stop" node failures ...
[30 Nov 2009 17:37]
Robert Klikics
Hi mr. holzgraefe, thanks for your advice, but if i understood TimeBetweenEpochsTimeout correctly, it's used for mysql-cluster replication, which we do not use. Sincerelly Martin
[7 Dec 2009 12:36]
Hartmut Holzgraefe
The TimeBetweenEpochsTimeout parameter has been added to improve replication handling, it is always in effect though even when no transaction log is written by any mysqld node in the cluster. So raising this timeout parameter *may* help to work around the problem. The main problem is still the extra I/O activity caused by the cache purge, the relaxed timeout setting would help though *if* it is this timeout you are running into which the trace logs seem to indicate.
[7 Dec 2009 14:36]
Robert Klikics
Dear mr. holzgraefe, what should i say. We've kicked the CompressLCP Option, because we've had the feeling that's the cluster runs instable with this option in conjunction with ndbmtd. Also the cluster got a 5x times higher LoadAVG since we've activated CompressLCP. We leaved the TimeBetween* values untouched, because we'll do not drop the caches again while startup phase. Is there a documentation, how to read the trace files? Thanks in advance Martin P.
[30 Dec 2009 20:38]
Andrew Hutchings
Hello Robert, CompressedLCP requires a high CPU overhead (to execute the compression) and in tests causes the LCP to actually run slower. There is no written documentation on how to read the trace files and it takes a long time to teach. You need a good understanding of the source to be able to interpret them as well.