Bug #41295 Node crash during node-failure-handling
Submitted: 8 Dec 2008 11:05 Modified: 11 Dec 2008 0:55
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:* OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[8 Dec 2008 11:05] Jonas Oreland
Description:
During node-failure handling (of non-master) there is a (very) low risk
that the master was waiting for a GCP_NODEFINISHED from the dead-node,
and still have received all other GCP_NODEFINISHED.

If this happend and the dead-node had transaction that was currently committing
in the epoch, the master node could crash in DBTC when discovering that a transaction belonged to a epoch already complete

How to repeat:
new testprg with error inserts forcing the condition
[8 Dec 2008 12:35] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/60898

2764 Jonas Oreland	2008-12-08
      ndb - bug#41295 bug#41296 bug#41297
[8 Dec 2008 14:00] Bugs System
Pushed into 5.1.30-ndb-6.2.17  (revid:jonas@mysql.com-20081208123555-23afeiagk2vputc1) (version source revid:jonas@mysql.com-20081208123555-23afeiagk2vputc1) (pib:5)
[8 Dec 2008 14:01] Bugs System
Pushed into 5.1.30-ndb-6.3.20  (revid:jonas@mysql.com-20081208123555-23afeiagk2vputc1) (version source revid:jonas@mysql.com-20081208133911-5ef2zriejdniqgkd) (pib:5)
[8 Dec 2008 14:02] Bugs System
Pushed into 5.1.30-ndb-6.4.0  (revid:jonas@mysql.com-20081208123555-23afeiagk2vputc1) (version source revid:jonas@mysql.com-20081208135815-5pzw01ax9hrbbw3j) (pib:5)
[8 Dec 2008 14:13] Jonas Oreland
note: 6.3.20 *might* be incorrect, check with tomas
[10 Dec 2008 23:08] Jon Stephens
Fix actually went into NDB-6.3.21, updated changelog entry accordingly.
[10 Dec 2008 23:16] Jon Stephens
Documented in the NDB-6.2.17 and 6.3.21 changelogs as follows:

        During node failure handling (of a data node other than the
        master), there was a chance that the master was waiting for a
        GCP_NODEFINISHED signal from the failed node
        after having received it from all other data nodes. If this
        occurred while the failed node had a transaction that was still
        being committed in the current epoch, the master node could
        crash in the DBTC kernel block when
        discovering that a transaction actually belonged to an epoch
        which was already completed.
[12 Dec 2008 23:28] Bugs System
Pushed into 6.0.9-alpha  (revid:jonas@mysql.com-20081208123555-23afeiagk2vputc1) (version source revid:tomas.ulin@sun.com-20081209185954-9svcixh2p5hsfi6w) (pib:5)