Bug #21535 3r: The failure handling of node 1 ends up in an infinite loop.
Submitted: 9 Aug 2006 10:58 Modified: 2 Nov 2006 8:38
Reporter: Lars Torstensson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:MySQL cluster 5.0.22 OS:Linux (Redhat Linux 2.6.9-34.ELsmp #1 )
Assigned to: Jonas Oreland

[9 Aug 2006 10:58] Lars Torstensson
Description:
We have a 6 node 3 replicas cluster.

When I stop one (node 2) of the 6 data nodes from the mgmt_client, then node 1 crashes. Then the failure handling of node 1 ends up in an infinite loop.

"Failure handling of node 1 has not completed in 74 min"

Cluster log:
Aug  7 16:31:22 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node shutdown initiated
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 7: Node 1 Connected
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 1 Disconnected
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Node 1 Disconnected
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Node 1 Disconnected
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Node 1 Disconnected
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Node 1 Disconnected
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Communication to Node 1 closed
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Communication to Node 1 closed
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Communication to Node 1 closed
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Communication to Node 1 closed
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Communication to Node 1 closed
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Arbitration check won - node group majority
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: President restarts arbitration thread [state=6]
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: GCP Take over started
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: GCP Take over completed
Aug  7 16:31:29 nl2-db4 NDB[7962]: [MgmSrvr] Node 1: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing 
error message, please report a bug). Temporary error, restart node'.
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 7: Node 2 Connected
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Node 2 Disconnected
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Node 2 Disconnected
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Node 2 Disconnected
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Node 2 Disconnected
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Communication to Node 2 closed
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Communication to Node 2 closed
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Communication to Node 2 closed
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Communication to Node 2 closed
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Arbitration check won - node group majority
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: President restarts arbitration thread [state=6]
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: GCP Take over started
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: GCP Take over completed
Aug  7 16:31:32 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node shutdown completed.
Aug  7 16:31:52 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: LCP Take over started
Aug  7 16:31:52 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: ParticipatingDIH = 0000000000000078

Aug  7 16:31:52 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Node 6: queued: table=13 fragment=1 replica=185
Aug  7 16:31:52 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Communication to Node 2 opened
Aug  7 16:31:52 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Communication to Node 2 opened
Aug  7 16:31:53 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Communication to Node 2 opened
Aug  7 16:31:53 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Communication to Node 2 opened
Aug  7 16:32:06 nl2-db4 NDB[7962]: [MgmSrvr] Mgmt server state: nodeid 2 reserved for ip 192.168.171.5, m_reserved_nodes 0000000000000084.
Aug  7 16:32:15 nl2-db4 NDB[7962]: [MgmSrvr] Node 7: Node 2 Connected
Aug  7 16:32:16 nl2-db4 NDB[7962]: [MgmSrvr] Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000080.
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Node 2 Connected
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 3 Connected
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 4 Connected
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 5 Connected
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 6 Connected
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: CM_REGCONF president = 3, own Node = 2, our dynamic id = 15
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Node 2: API version 5.0.22
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Node 2 Connected

Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 6: API version 5.0.22
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Start phase 1 completed 
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Start phase 2 completed (node restart)
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Receive arbitrator node 7 [ticket=2cd90006e8ee565e]
Aug  7 16:32:17 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Start phase 3 completed (node restart)
Aug  7 16:32:18 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Start phase 4 completed (node restart)
Aug  7 16:32:22 nl2-db4 NDB[7962]: [MgmSrvr] Node 2: Node 54 Connected
Aug  7 16:32:27 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Failure handling of node 1 has not completed in 1 min. - state = 3
Aug  7 16:32:28 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Failure handling of node 1 has not completed in 1 min. - state = 3
Aug  7 16:32:28 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Failure handling of node 1 has not completed in 1 min. - state = 3
Aug  7 16:32:28 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Failure handling of node 1 has not completed in 1 min. - state = 3

Aug  7 17:44:36 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Local checkpoint 15192 started. Keep GCI = 160625 oldest restorable GCI = 160631
Aug  7 17:44:54 nl2-db4 NDB[7962]: [MgmSrvr] Node 4: Failure handling of node 1 has not completed in 73 min. - state = 3
Aug  7 17:44:54 nl2-db4 NDB[7962]: [MgmSrvr] Node 5: Failure handling of node 1 has not completed in 73 min. - state = 3
Aug  7 17:44:54 nl2-db4 NDB[7962]: [MgmSrvr] Node 6: Failure handling of node 1 has not completed in 73 min. - state = 3
Aug  7 17:44:54 nl2-db4 NDB[7962]: [MgmSrvr] Node 3: Failure handling of node 1 has not completed in 73 min. - state = 3

ndb_1_error.log:
Time: Monday 7 August 2006 - 16:31:28
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 2421) 0x0000000a
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 11481
Trace: /var/db/6-nodes/log/ndb_1_trace.log.5
Version: Version 5.0.22

How to repeat:
Some times when i stop a node from the mgmt client (1 out of 5)
[9 Aug 2006 10:58] Lars Torstensson
Category
[9 Aug 2006 11:19] Lars Torstensson
I have uploaded (ftp) core and trace files. "bug-data-21535.tar.gz"
[15 Aug 2006 6:59] Jonas Oreland
see also bug#21481
[15 Aug 2006 7:01] Jonas Oreland
During fragment distkey is only updated on primary replica + starting replica
  during node recovery.

This leads to that backup 2 (and 3) can get very scewed dist-key, which will
  be a big problem(crash) if they become primary (due to node failure/shutdown)
[14 Sep 2006 13:37] Jonas Oreland
Will impl. fix 
But reported case with master node failure during node recovery as separate bug
  (and 3 replica)

http://bugs.mysql.com/22346
[14 Sep 2006 20:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11969

ChangeSet@1.2247, 2006-09-14 22:01:26+02:00, jonas@perch.ndb.mysql.com +8 -0
  ndb - bug#21535
    send new fragdistkeys to all replicas during node recovery
    to make sure that not 3rd or 4th replicas fragDistKey becomes out of sync
[14 Sep 2006 20:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11970

ChangeSet@1.2248, 2006-09-14 22:05:35+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#21535
    remove accidently left debug code
[15 Sep 2006 9:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11998

ChangeSet@1.2249, 2006-09-15 11:18:17+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#21535
    review, Fix also master being "old" version
[10 Oct 2006 18:38] Jonas Oreland
pushed into 5.1.12
[27 Oct 2006 7:47] Lars Torstensson
I have managed to reproduce this bug in a 5.0.27 release built from the bk://mysql.bkbits.net/mysql-5.0-ndb repo 2006-10-26

I have a 2 replica 4 node cluster. 

Reproduce: 
kill -9 on ndbd node 1

I have uploaded the core/trace/cluster log.

BK ChangeLog: 
~/mysql-5.0_s3$ grep 21535 -A 5 -B 1 BK/ChangeLog 
2006/09/15 11:18:17+02:00 jonas@perch.ndb.mysql.com 
  ndb - bug#21535
    review, Fix also master being "old" version

ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
  2006/09/15 11:18:15+02:00 jonas@perch.ndb.mysql.com +22 -17
  Fix also master being "old" version
--
  2006/09/14 22:05:35+02:00 jonas@perch.ndb.mysql.com 
  ndb - bug#21535
    remove accidently left debug code

ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
  2006/09/14 22:05:34+02:00 jonas@perch.ndb.mysql.com +0 -11
  remove accidently left debug code
--
  2006/09/14 22:01:26+02:00 jonas@perch.ndb.mysql.com 
  ndb - bug#21535
    send new fragdistkeys to all replicas during node recovery
    to make sure that not 3rd or 4th replicas fragDistKey becomes out of sync

ndb/include/kernel/GlobalSignalNumbers.h
  2006/09/14 22:01:24+02:00 jonas@perch.ndb.mysql.com +1 -0
[27 Oct 2006 7:50] Lars Torstensson
File name bug-21535-2006-10-26.tar.gz
[1 Nov 2006 14:43] Jonas Oreland
pushed into 5.0.29
[2 Nov 2006 8:38] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix for 5.0.29 & 5.1.12.
[4 Nov 2006 3:22] Jon Stephens
*Fix for 5.0 documented in 5.0.30 Release Notes.*