Bug #57322 | TimeBetweenGlobalCheckpoints can be set lower than 3 * HeartbeatIntervalDbDb | ||
---|---|---|---|
Submitted: | 7 Oct 2010 17:07 | Modified: | 20 Oct 2010 12:57 |
Reporter: | Daniel Smythe | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 6.3.26 + | OS: | Any |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
Tags: | GCP stop, Network Partitioning, TimeBetweenGlobalCheckpoints |
[7 Oct 2010 17:07]
Daniel Smythe
[20 Oct 2010 9:47]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/121273 3311 Jonas Oreland 2010-10-20 ndb - bug#57322 - compute correct "failure times" when setting max-lag values for gcp and micro-gcp
[20 Oct 2010 9:56]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:jonas@mysql.com-20101020094501-9c07g1dk6ltsmn2o) (version source revid:jonas@mysql.com-20101020094501-9c07g1dk6ltsmn2o) (merge vers: 5.1.51-ndb-6.3.39) (pib:21)
[20 Oct 2010 9:56]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:jonas@mysql.com-20101020094955-fr3nxe2j2h106p12) (version source revid:jonas@mysql.com-20101020094955-fr3nxe2j2h106p12) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[20 Oct 2010 9:59]
Jonas Oreland
pushed to 6.3.39, 7.0.20 and 7.1.9
[20 Oct 2010 10:06]
Jonas Oreland
Explanation: 1) GCP stop is detected using 2 "max-lag" variables (one for GCP and one for epochs), which maximum time that gcp/epoch can be unchanged. 2) If e.g TimeBetweenEpochsTimeout=100 but HeartbeatDBDB=1500 a node failure can be fired after 4 missed heartbeats (e.g 6000 ms) That means that the TimeBetweenEpochsTimeout would be exceeded, and a gcp would "incorrectly" be detected. 3) Therefor the TimeBetweenEpochsTimeout is automatically adjusted based on the values of HeartbeatDBDB and ArbitTimeout. However: The automatic adjustment didn't correctly take into consideration that during cascading node-failures, there can be several "iterations" of (4 * HeartbeatDBDB + ArbitTimeout) timeouts going on until all node-failures has internally been resolved. Therefor a "incorrect" GCP detection could happen with cascading node failures. So the patch fixes so that is also considered. (btw: given this I think that synposis is a bit missleading...)
[20 Oct 2010 12:57]
Jon Stephens
Documented as follows in the NDB-6.3.39, 7.0.20, and 7.1.9 changelogs: A GCP stop is detected using 2 parameters which determine the maximum time that a global checkpoint or epoch can go unchanged; one of these controls this timeout for GCPs and one controls the timeout for epochs. Suppose the cluster is configured such that TimeBetweenEpochsTimeout is 100 ms but HeartbeatDBDB is 1500 ms. A node failure can be signalled after 4 missed heartbeats—in this case, 6000 ms. However, this would exceed TimeBetweenEpochsTimeout, causing false detection of a GCP. To prevent this from happening, the configured value for TimeBetweenEpochsTimeout is automatically adjusted, based on the values of HeartbeatDBDB and ArbitrationTimeout. The current issue arose when the automatic adjustment routine did not correctly take into consideration the fact that, during cascading node-failures, several intervals of length 4 * (HeartbeatDBDB + ArbitrationTimeout) may elapse before all node failures have internally been resolved. This could cause false GCP detection in the event of a cascading node failure. Closed.