Bug #56579 Incorrect error handling of SUB_START_REF could lead to hanging node-restarts
Submitted: 6 Sep 2010 8:12 Modified: 6 Sep 2010 11:27
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[6 Sep 2010 8:12] Jonas Oreland
Description:
When mysql server starts,
  it will as part of setting up replication
  subscribe to data-events from all data nodes.
  This is implemented using SUB_START_REQ

SUB_START_REQ atomicity is implemented by
  if any of the nodes return error a
  SUB_STOP_REQ will be sent to nodes that
  replied SUB_START_CONF

However, iff *all* nodes returned error,
  then SUB_STOP_REQ was sent to *zero* nodes
  and code hang waiting for reply.

This leads to
1) hanging mysqld in restart
2) subsequent datanode restarts will hang
   (as SUB_START_REQ is serialized wrt node restart)

How to repeat:
not sure,
except my new test program

Suggested fix:
Check if all nodes replied error code
  and return to API/MYSQLD directly in that case.
[6 Sep 2010 8:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117570

3281 Jonas Oreland	2010-09-06
      ndb - bug#56579 - handle case where all node reply SUB_START_REF
[6 Sep 2010 8:26] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.38 (revid:jonas@mysql.com-20100906081408-qxf5u3v4mgc0od6c) (version source revid:jonas@mysql.com-20100906081408-qxf5u3v4mgc0od6c) (merge vers: 5.1.47-ndb-6.3.38) (pib:21)
[6 Sep 2010 8:26] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.19 (revid:jonas@mysql.com-20100906082033-xncr7d9c2onjye9i) (version source revid:jonas@mysql.com-20100906082033-xncr7d9c2onjye9i) (merge vers: 5.1.47-ndb-7.0.19) (pib:21)
[6 Sep 2010 8:28] Jonas Oreland
pushed to 6.3.38, 7.0.19 and 7.1.8
[6 Sep 2010 9:49] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.38 (revid:jonas@mysql.com-20100906094448-275ygrbhwz0uvq0l) (version source revid:jonas@mysql.com-20100906094448-275ygrbhwz0uvq0l) (merge vers: 5.1.47-ndb-6.3.38) (pib:21)
[6 Sep 2010 9:49] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.19 (revid:jonas@mysql.com-20100906094535-h4bq42ssrh2gcprz) (version source revid:jonas@mysql.com-20100906094535-h4bq42ssrh2gcprz) (merge vers: 5.1.47-ndb-7.0.19) (pib:21)
[6 Sep 2010 9:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117581

3282 Jonas Oreland	2010-09-06
      ndb - bug#56579 - remove accidently left incorrect assert
[6 Sep 2010 11:27] Jon Stephens
Documented bugfix in the 6.3.38, 7.0.19, and 7.1.8 changelogs, as follows:

        When an SQL node starts, as part of setting up replication, it
        subscribes to data events from all data nodes using a
        SUB_START_REQ (subscription start request) signal. Atomicity of
        SUB_START_REQ is implemented such that, if any of the nodes
        returns an error, a SUB_STOP_REQ (subscription stop request) is
        sent to any nodes that replied with a SUB_START_CONF
        (subscription start confirmation). However, if all data nodes
        returned an error, SUB_STOP_REQ was not sent to any of them.
        This caused mysqld to hang when restarting (while waiting for a
        response), and subsequent data node restarts to hang as well.

Closed.