Bug #49560 Hanging restart with mysqld + take-over during system restart
Submitted: 9 Dec 2009 14:18 Modified: 11 Dec 2009 9:39
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[9 Dec 2009 14:18] Jonas Oreland
Description:
Take-over during system restart is when
  one or several nodes have too old REDO log
  during a system restart, so they will be started
  using node restart procedure

If this happens when a mysqld is attached to cluster,
  the mysqld will take the "global schema lock"
  which is a row-lock, and try to setup the replication

But, replication fails to be setup (due to this bug)
  so the "global schema lock" is held for a long long time
  which leads to the node-restart also hangs, as it will
  trip over the row-lock.

End result is that mysqld fails to setup replication
  (leading to tables being read-only) and one node
  hangs in restart.

How to repeat:
Seen in autotest testSystemRestart -n to T1

Suggested fix:
Fix the SUMA error handling, so that it handles this case better
[9 Dec 2009 15:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/93349

3189 Jonas Oreland	2009-12-09
      ndb - bug#49560 - cleanup error handling wrt Suma not started
[9 Dec 2009 15:31] Jonas Oreland
By carefully reading code, it was discovered that this
only exists in 7.0

Patch was made to 6.3 anyway, as it cleanup the code-path
And to keep the 2 version relatively in sync.
[10 Dec 2009 6:23] Jonas Oreland
pushed to 6.3.29 and 7.0.10
[11 Dec 2009 9:39] Jon Stephens
Documented bugfix in the NDB-6.3.29 and 7.0.10 changelogs as follows:

        Node takeover during a system restart occurs when the REDO log
        for one or more data nodes is out of date, so that a node
        restart is invoked for that node or those nodes. If this happens
        while a mysqld is attached to the cluster, the mysqld takes a
        global schema lock (a row lock), while trying to set up
        cluster-internal replication.

        However, this setup process could fail, causing the global
        schema lock to be held for an excessive length of time, which
        made the node restart hang as well. As a result, the mysqld
        failed to set up cluster-internal replication, which led to
        tables being read-only, and caused one node to hang during the
        restart.

          NOTE: This issue could actually occur in MySQL Cluster NDB 
          7.0 only, but the fix was also applied in MySQL Cluster 
          NDB 6.3, in order to keep the two codebases in alignment.

Closed.