Bug #29364 SQL queries hang while data node in start phase 5
Submitted: 26 Jun 2007 15:06 Modified: 4 Jul 2007 9:28
Reporter: Geert Vanderkelen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.42 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: ndb, SQL

[26 Jun 2007 15:06] Geert Vanderkelen
Description:
When restarting a data node some queries will hang during start phase 5. They continue once phase 6 starts.

This basically means that restarting a data node can't be done in production without a risk of hanging the application stuck. If there is lots of data and tables/indexes, this can mean a long time (report to hanging for over 15 minutes..).

How to repeat:
Repeatable test case provided privately.

The test case contains rather lots of tables and indexes with some tables being MyISAM. The query to demonstrate the hanging queries is a join of NDB/MyISAM tables, but it has been report to hang on NDB SQL queries only.

I haven't been able to or gave up repeating it with 1 or 2 data nodes.
[26 Jun 2007 15:21] Geert Vanderkelen
Some part of the cluster log showing where the hanging queries start to pop up:

* Restart node 1 (master)
- No problems

* Restart node 2 (node 3 is master)
- No problems

* Restart of node 3 (master).

2007-06-26 10:51:32 [MgmSrvr] INFO     -- Node 3: DICT: index 210 activated
2007-06-26 10:51:32 [MgmSrvr] INFO     -- Node 4: Local checkpoint 336 started. Keep GCI = 5490 oldest restorable GCI = 5437
-> Hang starts here
2007-06-26 10:52:30 [MgmSrvr] INFO     -- Node 4: Local checkpoint 337 started. Keep GCI = 5579 oldest restorable GCI = 5437
2007-06-26 10:52:30 [MgmSrvr] INFO     -- Node 4: Node 3 is WAIT_LCP including in LCP
2007-06-26 10:53:37 [MgmSrvr] INFO     -- Node 3: Start phase 5 completed (node restart)
-> Hang disappears about here
2007-06-26 10:53:37 [MgmSrvr] INFO     -- Node 3: Start phase 6 completed (node restart)
2007-06-26 10:53:37 [MgmSrvr] INFO     -- Node 3: Start phase 7 completed (node restart)

* Restart of node 4 (master)
- No problems

* Restart of node 1 (master)
2007-06-26 11:05:46 [MgmSrvr] INFO     -- Node 1: DICT: index 210 activated
2007-06-26 11:05:46 [MgmSrvr] INFO     -- Node 2: Local checkpoint 347 started. Keep GCI = 5850 oldest restorable GCI = 5768
-> Hang starts here
2007-06-26 11:06:40 [MgmSrvr] INFO     -- Node 2: Local checkpoint 348 started. Keep GCI = 5877 oldest restorable GCI = 5768
2007-06-26 11:06:40 [MgmSrvr] INFO     -- Node 2: Node 1 is WAIT_LCP including in LCP

* Restart of node 2 (master)
- No problems

* Restart of node 3 (master)
2007-06-26 11:17:55 [MgmSrvr] INFO     -- Node 4: Local checkpoint 356 started. Keep GCI = 6149 oldest restorable GCI = 6094
-> Hang start here
2007-06-26 11:18:50 [MgmSrvr] INFO     -- Node 4: Local checkpoint 357 started. Keep GCI = 6235 oldest restorable GCI = 6094
2007-06-26 11:18:50 [MgmSrvr] INFO     -- Node 4: Node 3 is WAIT_LCP including in LCP
[26 Jun 2007 15:22] Geert Vanderkelen
Verified using 5.0.42 (and earlier versions of enterprise).
[2 Jul 2007 9:46] Jonas Oreland
mysqtest script for testing

Attachment: run.test (application/octet-stream, text), 439 bytes.

[2 Jul 2007 11:45] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/30063

ChangeSet@1.2312, 2007-07-02 13:45:24+02:00, jonas@perch.ndb.mysql.com +3 -0
  ndb - bug#29364 - "SQL queries hang while data node in start phase 5"
    In TC init node status for already started nodes during node restart
    (not present in 5.1)
[2 Jul 2007 12:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/30065

ChangeSet@1.2502, 2007-07-02 13:59:17+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#29364 - port merge (5.0 -> 5.1)
[2 Jul 2007 13:01] Jonas Oreland
pushed to 50-ndb
(and test prg to 51-ndb, 51-telco, telco-6.*)
[2 Jul 2007 13:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/30076

ChangeSet@1.2157, 2007-07-02 15:22:46+02:00, jonas@perch.ndb.mysql.com +3 -0
  ndb - bug#29364 - "SQL queries hang while data node in start phase 5"
    In TC init node status for already started nodes during node restart
    (not present in 5.1)
[3 Jul 2007 18:57] Bugs System
Pushed into 5.1.21-beta
[4 Jul 2007 9:28] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.21 changelog.
[4 Jul 2007 10:27] Jon Stephens
Also documented for telco-6.2.4 release.
[10 Jul 2007 13:29] Bugs System
Pushed into 5.0.46
[6 Sep 2007 12:15] Jon Stephens
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at

  http://www.mysql.com/downloads/

Also documented fix in 5.0.46 changelog.