Bug #21213 3r: Node restart fails and causes other node to shutdown
Submitted: 21 Jul 2006 12:44 Modified: 6 Sep 2006 11:05
Reporter: Lars Torstensson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1,5.0,5.1 OS:Linux (Linux 2.6.9-34.ELsmp #1 )
Assigned to: Jonas Oreland CPU Architecture:Any

[21 Jul 2006 12:44] Lars Torstensson
Description:
We have a 6 node 3 replicas cluster.

I stopt one (node 5) of the 6 data nodes from the mgmt_client. 
If I restart the data node, the restart will fail and cause an other (node 6)data node to crach as well.

Mgmt-client output:
ndb_mgm> all status
Node 1: started (Version 5.0.22)
Node 2: started (Version 5.0.22)
Node 3: started (Version 5.0.22)
Node 4: started (Version 5.0.22)
Node 5: started (Version 5.0.22)
Node 6: started (Version 5.0.22)

ndb_mgm> 5 stop
Node 5: Node shutdown initiated
Node 5 has shutdown.

ndb_mgm> Node 5: Node shutdown completed.
Node 6: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: \'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node\'.
Node 5: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 6. Caused by error 2308: \'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node\'.

Cluster log.
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 1: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 2: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 3: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 4: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 6: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 1 completed 
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 2 completed (node restart)
Jul 21 13:58:41 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 3 completed (node restart)
Jul 21 13:58:41 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Receive arbitrator node 7 [ticket=39e8000186cc0ea3]
Jul 21 13:58:41 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 4 completed (node restart)
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 6 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 7 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 8 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 9 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 10 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 11 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 12 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 13 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 14 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 15 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 16 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Local checkpoint 1783 started. Keep GCI = 16582 oldest restorable GCI = 16592
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 7: Node 6 Connected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 6: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: \'Internal program error 
(failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node\'.
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 7: Node 5 Connected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Arbitration check won - node group majority
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: President restarts arbitration thread [state=6]
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 6. Caused by error 2
308: \'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node\'.
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 6 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 6 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 6 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 6 opened

Node 5:
/var/db/6-nodes/log/ndb_5_error.log
Time: Friday 21 July 2006 - 13:59:32
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 6 disconected
Error object: QMGR (Line: 2481) 0x0000000a
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 14876
Trace: /var/db/6-nodes/log/ndb_5_trace.log.3
Version: Version 5.0.22
***EOM***

Node 6:
/var/db/6-nodes/log/ndb_6_error.log
Time: Friday 21 July 2006 - 13:59:31
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 3657) 0x0000000a
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 13729
Trace: /var/db/6-nodes/log/ndb_6_trace.log.4
Version: Version 5.0.22
***EOM***

How to repeat:
Restart one of the data nodes.
[21 Jul 2006 12:46] Lars Torstensson
Changed Catagory
[7 Aug 2006 14:11] Jonas Oreland
totSendLenAI was not correctly updated in case of ACC_ABORT
[8 Aug 2006 5:59] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10142

ChangeSet@1.2541, 2006-08-08 07:59:52+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#21213
    Make sure totSendlenAi is set in case of ACC_ABORTCONF and activeCreate == true
    (only needed when >2 replica)
[14 Aug 2006 13:54] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10368

ChangeSet@1.2269, 2006-08-14 15:54:21+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#21213
    5.1-fix adopt to changes in node recovery
[1 Sep 2006 8:00] Jonas Oreland
pushed to 5.1.12
[1 Sep 2006 19:28] Jonas Oreland
pushed to 5.0.25
[6 Sep 2006 7:12] Jonas Oreland
pushed into 4.1.22
[6 Sep 2006 11:05] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented in 4.1.22/5.0.25/5.1.12 changelogs.
[1 Sep 2010 7:08] Laurencin Frédéric
wit version 5.0.51a-24 lenny we still see this issue are you aware about the reason why ?