MySQL Bugs: #51644: Race condition in system restart, can lead to nodes never starting

Bug #51644	Race condition in system restart, can lead to nodes never starting
Submitted:	2 Mar 2010 14:51	Modified:	5 Mar 2010 13:25
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-6.3	OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
If for some reason one or several nodes reads their LCP (and apply undo)
significantly faster than others. It can be that the system restarts
hangs printing "delay: req=X"

This bug has been around since forever, but has surfaced due to other changes.

One easy way of having one node faster than others is to have a mixed ndbd/ndbmtd cluster.

How to repeat:
run system restart tests in autotest,
observer that some timeout

Suggested fix:
.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/102064

3110 Jonas Oreland	2010-03-02
      ndb - bug#51644 fix race condition wrt EXEC_SRREQ/EXEC_FRAGREQ

pushed to 6.3.32 and 7.0.13

Documented in the NDB-6.3.32 and 7.0.13 changelogs as follows:

        When one or more data nodes read their LCPs and applied undo
        logs significantly faster than others, this could lead to a race
        condition causing system restarts of data nodes to hang. This
        could most often occur when using both ndbd and ndbmtd processes
        for the data nodes.

Closed.