Bug #51645 Race condition in ndbmtd wrt to EXEC_SR can lead to nodes not starting
Submitted: 2 Mar 2010 15:42 Modified: 5 Mar 2010 13:32
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[2 Mar 2010 15:42] Jonas Oreland
Description:
When performing system restart involving ndbmtd, there is a slight risk
that restart hangs due to incorrect serialization of signals passed
between LQH instances/proxies.

Some signals were sent using proxy, others directly: which means that
no order between them are guaranteed.

If they arrived in wrong order (which is very unlikely)
one or several nodes could end up looping "delay: req=X" forever.

How to repeat:
run autotest

Suggested fix:
fix so that all signals that needs relative order are sent same path.
[2 Mar 2010 15:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/102067

3418 Jonas Oreland	2010-03-02
      ndb - bug#51645 - same EXEC_SR and EXEC_FRAG same path
[2 Mar 2010 15:58] Bugs System
Pushed into 5.1.41-ndb-7.0.14 (revid:jonas@mysql.com-20100302154927-6awe2owvg9wvp7w1) (version source revid:jonas@mysql.com-20100302154722-fy20t45o4nkzzmat) (merge vers: 5.1.41-ndb-7.0.14) (pib:16)
[4 Mar 2010 13:52] Jonas Oreland
pushed into 6.3.32 and 7.0.13
[4 Mar 2010 13:53] Jonas Oreland
sorry this was 7.0.13 only
[5 Mar 2010 13:33] Jon Stephens
Documented bugfix in the NDB-7.0.13 changelog as follows:

        When performing a system restart of a MySQL Cluster where
        multi-threaded data nodes were in use, there was a slight risk
        that the restart would hang due to incorrect serialization of
        signals passed between LQH instances and proxies; some signals
        were sent using a proxy, and others directly, which meant that
        the order in which they were sent and received could not be
        guaranteed. If signals arrived in the wrong order, this could
        cause one or more data nodes to hang. Now all signals that need
        to be sent and received in the same order are sent using the
        same path.

Closed.