Bug #56844 Race condition with 2 ndb_mgmd starting simultanious with "--reload"
Submitted: 17 Sep 2010 14:30 Modified: 17 Sep 2010 16:34
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[17 Sep 2010 14:30] Jonas Oreland
Description:
2 (or more) ndb_mgmd starting in parallel with "--reload"
could (rarely) cause both of them to fail to start.

seen rarely in CluB, but more frequent on sol10-sparc-a

Problem was that config-change protocol was very deadlock prone
as it "locked" all "replicas" in parallel

Code is now changed to "lock" one node at a time
(in node id order) making it deadlock free, so that
atleast one of the nodes will succeed.

How to repeat:
run testMgmd long enough
or run my new test for testMgmd once...
it never passes

Suggested fix:
"lock" in order
[17 Sep 2010 14:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118489

3767 Jonas Oreland	2010-09-17
      ndb - bug#56844 - make config change protocol contact nodes in order to avoid deadlock
[17 Sep 2010 14:47] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.19 (revid:jonas@mysql.com-20100917144451-l5l9ea7qotpab3t3) (version source revid:jonas@mysql.com-20100917143059-6k3zsmma884um847) (merge vers: 5.1.47-ndb-7.0.19) (pib:21)
[17 Sep 2010 14:51] Jonas Oreland
pushed to 7.0.19 and 7.1.8
[17 Sep 2010 16:34] Jon Stephens
Documented bugfix in the NDB-7.0.19 and 7.1.8 changelogs as follows:

        Under certain rare conditions, attempting to start more than one
        ndb_mgmd process simultaneously using the --reload option caused
        a race condition such that none of the ndb_mgmd processes could
        start.

Closed.