Bug #3963 Infilite loop @ API failure during TC takeover
Submitted: 2 Jun 2004 9:31 Modified: 8 Jul 2004 8:13
Reporter: Lars Torstensson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysqlcluster-4.1.2-3.4.4-alpha-pc-linux- OS:Linux (Redhat AS)
Assigned to: Jonas Oreland CPU Architecture:Any

[2 Jun 2004 9:31] Lars Torstensson
Description:
Rolling upgrade
Shutdown 1 node
API crashed
One surviving node get 99% CPU Usage
Print "[MgmSrvr] Node 3: Failure handling of node 22 has not completed in 1 min. - state = 2" in cluster log

---
Continued upgrade (restart all nodes) solved problem

How to repeat:
See above
[7 Jun 2004 22:47] Jonas Oreland
hangeSet
  1.1899 04/06/07 22:43:08 joreland@mysql.com +1 -0
  BUG#3963
  API failure during TCTAKEOVER, TCKEYFAILCONF is sent
  and since the API has failed, it will never ack the marker.
  Wait for takeover to complete, so the marker isn't removed
  in the middle of the takeover

  ndb/src/kernel/blocks/dbtc/DbtcMain.cpp
    1.4 04/06/07 22:43:07 joreland@mysql.com +13 -6
    BUG#3963
    API failure during TCTAKEOVER, TCKEYFAILCONF is sent
    and since the API has failed, it will never ack the marker.
    Wait for takeover to complete, so the marker isn't removed
    in the middle of the takeover

# This is a BitKeeper patch.  What follows are the unified diffs for the
# set of deltas contained in the patch.  The rest of the patch, the part
# that BitKeeper cares about, is below these diffs.
# User:	joreland
# Host:	eel.hemma.oreland.se
# Root:	/home/jonas/src/mysql-4.1-ndb

--- 1.3/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp	Wed May 26 10:55:43 2004
+++ 1.4/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp	Mon Jun  7 22:43:07 2004
@@ -1019,12 +1019,19 @@
       ptrCheckGuard(apiConnectPtr, capiConnectFilesize, apiConnectRecord);
       if(apiConnectPtr.p->commitAckMarker == iter.curr.i){
 	jam();
-        /**
-         * The record is still active
-         *
-         * Don't remove it, but continueb instead
-         */
-        break;
+	TcFailRecordPtr node_fail_ptr;
+	node_fail_ptr.i = 0;
+	ptrAss(node_fail_ptr, tcFailRecord);
+	if(node_fail_ptr.p->failStatus != FS_IDLE) {
+	  jam();
+	  /**
+	   * The record is still active
+	   *   and TC take-over haven't completed
+	   *
+	   * Don't remove it, but continueb instead
+	   */
+	  break;
+	}
       }
       
       sendRemoveMarkers(signal, iter.curr.p);