MySQL Bugs: #56844: Race condition with 2 ndb_mgmd starting simultanious with "--reload"

Bug #56844	Race condition with 2 ndb_mgmd starting simultanious with "--reload"
Submitted:	17 Sep 2010 14:30	Modified:	17 Sep 2010 16:34
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-7.0	OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
2 (or more) ndb_mgmd starting in parallel with "--reload"
could (rarely) cause both of them to fail to start.

seen rarely in CluB, but more frequent on sol10-sparc-a

Problem was that config-change protocol was very deadlock prone
as it "locked" all "replicas" in parallel

Code is now changed to "lock" one node at a time
(in node id order) making it deadlock free, so that
atleast one of the nodes will succeed.

How to repeat:
run testMgmd long enough
or run my new test for testMgmd once...
it never passes

Suggested fix:
"lock" in order

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118489

3767 Jonas Oreland	2010-09-17
      ndb - bug#56844 - make config change protocol contact nodes in order to avoid deadlock

Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.19 (revid:jonas@mysql.com-20100917144451-l5l9ea7qotpab3t3) (version source revid:jonas@mysql.com-20100917143059-6k3zsmma884um847) (merge vers: 5.1.47-ndb-7.0.19) (pib:21)

pushed to 7.0.19 and 7.1.8

Documented bugfix in the NDB-7.0.19 and 7.1.8 changelogs as follows:

        Under certain rare conditions, attempting to start more than one
        ndb_mgmd process simultaneously using the --reload option caused
        a race condition such that none of the ndb_mgmd processes could
        start.

Closed.