MySQL Bugs: #14408: ndbd (MySQL Cluster) hangs in Phase 5 during recovery under load

Bug #14408	ndbd (MySQL Cluster) hangs in Phase 5 during recovery under load
Submitted:	27 Oct 2005 20:50	Modified:	15 May 2008 13:54
Reporter:	Tim Heath	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	4.1.15	OS:	FreeBSD (FreeBSD 5.4)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
I am seeing a ndbd "hang" during recovery when the cluster is under load. The problem is reproducible and consistently occurs at the same stage.

Configuration:
- 3 machine cluster: 2 ndbd nodes and 1 mgmd. Each server also running an API node
- All servers 2 x 1 gHz Pentium III
- 100 mbit switched lan
- MySQL Cluster v4.1.15 (FreeBSD v5.4)

Config.ini:
[NDBD DEFAULT]
NoOfReplicas= 2
DataDir= /var/db/mysql-cluster
DataMemory= 768M
IndexMemory= 256M
# Max number of recs updated simultaneously / number of ndbd nodes
MaxNoOfConcurrentOperations= 200000

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]
LogDestination= FILE:filename=/var/db/mysql-cluster/cluster.log,maxsize=1000000,maxfiles=6

[TCP DEFAULT]

[NDB_MGMD]
HostName= 10.10.10.3

[NDBD]
HostName= 10.10.10.1

[NDBD]
HostName= 10.10.10.2

[MYSQLD]
HostName= 10.10.10.1

[MYSQLD]
HostName= 10.10.10.2

[MYSQLD]
HostName= 10.10.10.3

Steps to reproduce problem:
1. Place cluster under fairly heavy load of 400-500 insert operations per second
2. Simulate a server crash by rebooting one of the ndbd nodes
3. Cluster continues to operate fine with remaining ndbd node
4. Failed node begins to rejoin cluster, completes through Phase 4 and begins phase 5:

2005-10-27 15:07:46 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (node restart)
2005-10-27 15:07:46 [MgmSrvr] INFO     -- Node 2: Start phase 4 completed (node restart)
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 3 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 4 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 5 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 6 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 7 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 8 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 9 activated
2005-10-27 15:08:03 [MgmSrvr] INFO     -- Node 2: DICT: index 10 activated

5. At this point network traffic increases (as viewed throught netstat) as the recovering node is refreshed from the primary node.
6. After a few seconds the refresh stops (again, as viewed through netstat). No error message in any of the logs, although the following messages show up in the "ndb_x_out.log" file on the recovering node:

NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH
NR: setLcpActiveStatusEnd - !m_participatingLQH

I also see hundreds of TIME_WAIT TCP connections to port 2202 on the recovering node, most of them originating from itself.

ndbd --initial does not help. Once the insert operation are stopped and there is no load, the cluster can be shutdown, restarted, and the failed node comes up OK with a manual "ndbd" command.

The same scenario with a lighter load (200-300 updates per second) works fine.

Any ideas would be appreciated!

Thanks and regards,

Tim Heath

How to repeat:
Please reference description

Hi,

This could be Bug #15685.

The attached patch is quite big as it contains test aswell.
But the bug fix is part below.

This has been pushed into 4.1.17 (which I dont know when it will come out)

Can you try the patch, and see if it helps?

/Jonas

--- 1.60/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp	2005-10-21 06:41:46 +02:00
+++ 1.61/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp	2005-12-12 17:19:02 +01:00
@@ -6129,7 +6129,6 @@
 	<< " - place: " << c_apiConTimer_line[apiConnectptr.i]);
   switch (apiConnectptr.p->apiConnectstate) {
   case CS_STARTED:
-    ndbrequire(c_apiConTimer_line[apiConnectptr.i] != 3615);
     if(apiConnectptr.p->lqhkeyreqrec == apiConnectptr.p->lqhkeyconfrec){
       jam();
       /*
@@ -6389,8 +6388,8 @@
 	warningEvent(buf);
 	ndbout_c(buf);
 	ndbrequire(false);
+	releaseAbortResources(signal);
       }
-      releaseAbortResources(signal);
       return;
     }//if
     TloopCount++;

Another question, did you ever get error code 4012?

Since we have been able to reproduce this,
  could you try a newer 4.1 version...there has been
  a few bug fixes that could match your problem

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Hello,
 I am reopening this bug because I am running 5.0.27 and are experiencing the same issue. I am running cluster in the same config (1 mgm, 2 nodes also running api). Node 2 crashed a few days ago (unable to contact other nodes) and when I tried to bring it back, it hung in phase 5. I killed the processes on that machine. Removed all ndb_X_* files and directories from /var/lib/mysql/ and started back up with --initial and still gets hung in phase 5.

 I too are seeing TIME_WAIT in netstat from itself, over 1300 of them.

What info do you need from me? Right now Node 1 is on so the cluster is working in production.

All reporters,

We fixed lots of bugs.
Please test with newer version of MySQL (also FreeBSD) and let us know if you still will get this problem.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".