Bug #36245 NF_COMPLETEREP can get lost of new master dies just before sending it
Submitted: 22 Apr 2008 7:31 Modified: 27 Apr 2008 11:15
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:* OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[22 Apr 2008 7:31] Jonas Oreland
On cascading master failure, NF_COMPLETEREP can get lost
this can lead to (atleast)
- transactions hanging (6min) returning 4012
- scan hanging for a long time
- ndb_mgmd to lock up in alloc_nodeid

How to repeat:
kill master exactly when sending NF_COMPLETEREP

Suggested fix:
let all nodes send it
[25 Apr 2008 6:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:


ChangeSet@1.2203, 2008-04-25 08:30:39+02:00, jonas@perch.ndb.mysql.com +4 -0
  ndb - bug#36245
    NF_COMPLETEREP can get lost on cascading master failure
    causing *big* pain and misery
[25 Apr 2008 6:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:


ChangeSet@1.2204, 2008-04-25 08:33:06+02:00, jonas@perch.ndb.mysql.com +2 -0
  ndb - bug#36245
[25 Apr 2008 6:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:


ChangeSet@1.2602, 2008-04-25 08:36:45+02:00, jonas@perch.ndb.mysql.com +6 -0
  ndb - bug#36245
    NF_COMPLETEREP can get lost on cascading master failure
    causing *big* pain and misery
[25 Apr 2008 7:54] Jonas Oreland
pushed to 51-ndb, telco* and drop6
(50-ndb was locked for unknown reason)
[25 Apr 2008 9:47] Bugs System
Pushed into 5.1.24-ndb-6.3.13
[27 Apr 2008 11:16] Jon Stephens
Documented in the 5.1.24-ndb-6.3.14 changelog as follows:

        Notification of a cascading master node failures could sometimes not be
        transmitted correctly (that is, transmission of the 
        NF_COMPLETEREP signal could fail), leading to 
        transactions hanging and timing out (NDB error 4012),
        scans hanging, and failure of the management server process.
[12 Dec 2008 23:29] Bugs System
Pushed into 6.0.6-alpha  (revid:sp1r-jonas@perch.ndb.mysql.com-20080425063645-48096) (version source revid:jonas@mysql.com-20080808094047-4e1yiarqa2t3opg3) (pib:5)