Bug #14408 ndbd (MySQL Cluster) hangs in Phase 5 during recovery under load
Submitted: 27 Oct 2005 20:50 Modified: 15 May 2008 13:54
Reporter: Tim Heath Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1.15 OS:FreeBSD (FreeBSD 5.4)
Assigned to: Assigned Account CPU Architecture:Any

[27 Oct 2005 20:50] Tim Heath
Description:
I am seeing a ndbd "hang" during recovery when the cluster is under load. The problem is reproducible and consistently occurs at the same stage.

Configuration:
- 3 machine cluster: 2 ndbd nodes and 1 mgmd. Each server also running an API node
- All servers 2 x 1 gHz Pentium III
- 100 mbit switched lan
- MySQL Cluster v4.1.15 (FreeBSD v5.4)

Config.ini:
[NDBD DEFAULT]
NoOfReplicas= 2
DataDir= /var/db/mysql-cluster
DataMemory= 768M
IndexMemory= 256M
# Max number of recs updated simultaneously / number of ndbd nodes
MaxNoOfConcurrentOperations= 200000

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]
LogDestination= FILE:filename=/var/db/mysql-cluster/cluster.log,maxsize=1000000,maxfiles=6

[TCP DEFAULT]

[NDB_MGMD]
HostName= 10.10.10.3

[NDBD]
HostName= 10.10.10.1

[NDBD]
HostName= 10.10.10.2

[MYSQLD]
HostName= 10.10.10.1

[MYSQLD]
HostName= 10.10.10.2

[MYSQLD]
HostName= 10.10.10.3

Steps to reproduce problem:
1. Place cluster under fairly heavy load of 400-500 insert operations per second
2. Simulate a server crash by rebooting one of the ndbd nodes
3. Cluster continues to operate fine with remaining ndbd node
4. Failed node begins to rejoin cluster, completes through Phase 4 and begins phase 5:

2005-10-27 15:07:46 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (node restart)
2005-10-27 15:07:46 [MgmSrvr] INFO     -- Node 2: Start phase 4 completed (node restart)
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 3 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 4 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 5 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 6 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 7 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 8 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 9 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 10 activated

5. At this point network traffic increases (as viewed throught netstat) as the recovering node is refreshed from the primary node.
6. After a few seconds the refresh stops (again, as viewed through netstat). No error message in any of the logs, although the following messages show up in the "ndb_x_out.log" file on the recovering node:

NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH

I also see hundreds of TIME_WAIT TCP connections to port 2202 on the recovering node, most of them originating from itself.

ndbd --initial does not help. Once the insert operation are stopped and there is no load, the cluster can be shutdown, restarted, and the failed node comes up OK with a manual "ndbd" command.

The same scenario with a lighter load (200-300 updates per second) works fine.

Any ideas would be appreciated!

Thanks and regards,

Tim Heath

How to repeat:
Please reference description
[14 Dec 2005 6:39] Jonas Oreland
Hi,

This could be Bug #15685.

The attached patch is quite big as it contains test aswell.
But the bug fix is part below.

This has been pushed into 4.1.17 (which I dont know when it will come out)

Can you try the patch, and see if it helps?

/Jonas

--- 1.60/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp	2005-10-21 06:41:46 +02:00
+++ 1.61/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp	2005-12-12 17:19:02 +01:00
@@ -6129,7 +6129,6 @@
 	<< " - place: " << c_apiConTimer_line[apiConnectptr.i]);
   switch (apiConnectptr.p->apiConnectstate) {
   case CS_STARTED:
-    ndbrequire(c_apiConTimer_line[apiConnectptr.i] != 3615);
     if(apiConnectptr.p->lqhkeyreqrec == apiConnectptr.p->lqhkeyconfrec){
       jam();
       /*
@@ -6389,8 +6388,8 @@
 	warningEvent(buf);
 	ndbout_c(buf);
 	ndbrequire(false);
+	releaseAbortResources(signal);
       }
-      releaseAbortResources(signal);
       return;
     }//if
     TloopCount++;
[17 Mar 2006 10:19] Jonas Oreland
Another question, did you ever get error code 4012?
[8 Apr 2006 11:40] Jonas Oreland
Since we have been able to reproduce this,
  could you try a newer 4.1 version...there has been
  a few bug fixes that could match your problem
[17 Apr 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[4 Jan 2007 19:32] Matthew Boehm
Hello,
 I am reopening this bug because I am running 5.0.27 and are experiencing the same issue. I am running cluster in the same config (1 mgm, 2 nodes also running api). Node 2 crashed a few days ago (unable to contact other nodes) and when I tried to bring it back, it hung in phase 5. I killed the processes on that machine. Removed all ndb_X_* files and directories from /var/lib/mysql/ and started back up with --initial and still gets hung in phase 5.

 I too are seeing TIME_WAIT in netstat from itself, over 1300 of them.

What info do you need from me? Right now Node 1 is on so the cluster is working in production.
[15 Apr 2008 13:54] Susanne Ebrecht
All reporters,

We fixed lots of bugs.
Please test with newer version of MySQL (also FreeBSD) and let us know if you still will get this problem.
[15 May 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".