Bug #57946 Race condition when api/mgm-node gets disconnected due to missed-heartbeat
Submitted: 3 Nov 2010 8:17 Modified: 5 Nov 2010 22:14
Reporter: Daniel Smythe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.1.6 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: heartbeat, ndb_mgmd, ndbd

[3 Nov 2010 8:17] Daniel Smythe
Description:
There appears to be a race condition when an api or mgm node gets disconnected due to missed heartbeat which can cause data nodes to crash.

How to repeat:
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 1, Length: 21, Rec Block No: 0)
Error object: SimulatedBlock.cpp:516
Program: /usr/local/mysql/bin/ndbmtd
Pid: 9020 thr: 0
Version: mysql-5.1.47 ndb-7.1.6

Suggested fix:
QMGR should check in API_REGREQ if the node is pending close socket
[3 Nov 2010 8:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122601

3917 Jonas Oreland	2010-11-03
      ndb - bug#57946 - fix weird race-condition with CLOSE_COMREQ/API_REQREQ
[3 Nov 2010 8:39] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:jonas@mysql.com-20101103083416-8bobvmv9cujj4rb7) (version source revid:jonas@mysql.com-20101103083416-8bobvmv9cujj4rb7) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[3 Nov 2010 8:44] Jonas Oreland
pushed to 7.0.20 and 7.1.9

DOCS:
1) api/mgmd connect
2) no heartbeat arrives
3) qmgr decides that api/mgmd missed enough heartbeats
4) heartbeat arrives
5) crash

note:
- it have to be first heartbeat (i.e no other heartbeat has arrived earlier)
- time between 3, and actual connection close is normally micro-seconds
[5 Nov 2010 22:14] Jon Stephens
Documented fix in the NDB-7.0.20 and 7.1.9 changelogs as follows:

      The disconnection of an API or management node due to missed 
      heartbeats led to a race condition which could cause data nodes 
      to crash.

Closed.