Bug #51644 Race condition in system restart, can lead to nodes never starting
Submitted: 2 Mar 2010 14:51 Modified: 5 Mar 2010 13:25
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[2 Mar 2010 14:51] Jonas Oreland
Description:
If for some reason one or several nodes reads their LCP (and apply undo)
significantly faster than others. It can be that the system restarts
hangs printing "delay: req=X"

This bug has been around since forever, but has surfaced due to other changes.

One easy way of having one node faster than others is to have a mixed ndbd/ndbmtd cluster.

How to repeat:
run system restart tests in autotest,
observer that some timeout

Suggested fix:
.
[2 Mar 2010 15:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/102064

3110 Jonas Oreland	2010-03-02
      ndb - bug#51644 fix race condition wrt EXEC_SRREQ/EXEC_FRAGREQ
[4 Mar 2010 13:53] Jonas Oreland
pushed to 6.3.32 and 7.0.13
[5 Mar 2010 13:25] Jon Stephens
Documented in the NDB-6.3.32 and 7.0.13 changelogs as follows:

        When one or more data nodes read their LCPs and applied undo
        logs significantly faster than others, this could lead to a race
        condition causing system restarts of data nodes to hang. This
        could most often occur when using both ndbd and ndbmtd processes
        for the data nodes.

Closed.