Bug #8051 Cannot restart ndbd with large table (20 000 000 rows)
Submitted: 20 Jan 2005 16:40 Modified: 15 Mar 2005 14:55
Reporter: Chris Kennedy Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:4.1.8/4.1.9 OS:HP/UX (HP-UX 11i11 PARISC 64bit)
Assigned to: Assigned Account CPU Architecture:Any

[20 Jan 2005 16:40] Chris Kennedy
Description:
Have a cluster with 4 data nodes, 2 replicas running on two systems.

if the database includes a large table (20 000 000 rows), ndbd nodes fail during restart:

Date/Time: Thursday 20 January 2005 - 16:03:04
Type of error: error
Message: System error
Fault ID: 2303
Problem data: Node 1 killed this node because it could not copy a fragment during node restart
Object of reference: NDBCNTR (Line: 179) 0x0000000a
ProgramName: ndbd
ProcessID: 8812
TraceFile: /cktemp/mysql-cluster/DN3of4/ndb_3_trace.log.5
***EOM***

Note ndb_3_trace.log.5 is 1717820 bytes,  which I can't reproduce here.  However,  it did contain:
--------------- Signal ----------------
r.bn: 251 "NDBCNTR", r.proc: 3, r.sigId: 43233420 gsn: 395 "SYSTEM_ERROR" prio:1
s.bn: 246 "DBDIH", s.proc: 1, s.sigId: 100504 length: 4 trace: 2 #sec: 0 fragInf: 0
errorRef: H'00f60001
errorCode: 5
data1: H'00000003
data2: H'00000009

The cluster log contains:

2005-01-20 15:55:06 [MgmSrvr] ALERT    -- Node 9: Node 3 Disconnected
2005-01-20 15:55:06 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2005-01-20 15:55:06 [MgmSrvr] INFO     -- Node 1: Communication to Node 3 closed
2005-01-20 15:55:06 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2005-01-20 15:55:06 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2005-01-20 15:55:06 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2005-01-20 15:55:06 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2005-01-20 15:55:06 [MgmSrvr] ALERT    -- Node 1: Arbitration check won - node group majority
2005-01-20 15:55:06 [MgmSrvr] INFO     -- Node 1: President restarts arbitration
 thread [state=6]
2005-01-20 15:55:08 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_r
eserved_nodes 0000000000100216.
2005-01-20 15:55:10 [MgmSrvr] INFO     -- Node 1: Communication to Node 3 opened
2005-01-20 15:55:10 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened
2005-01-20 15:55:10 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
2005-01-20 15:55:53 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved f
or ip 10.15.0.165, m_reserved_nodes 000000000010021e.
2005-01-20 15:55:54 [MgmSrvr] INFO     -- Node 9: Node 3 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 4: Node 3 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Node 1 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Node 2 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: CM_REGCONF president = 1, own Node = 3, our dynamic id = 5
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 1: Node 3: API version 4.1.9
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 2: Node 3: API version 4.1.9
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 4: Node 3: API version 4.1.9
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Node 1: API version 4.1.9
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Node 2: API version 4.1.9
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Node 4: API version 4.1.9
2005-01-20 15:55:55 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed
2005-01-20 15:55:56 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (noderestart)
2005-01-20 15:55:56 [MgmSrvr] INFO     -- Node 3: Receive arbitrator node 9 [tic
ket=2189000290a9772d]
2005-01-20 15:55:56 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (noderestart)
2005-01-20 15:55:56 [MgmSrvr] INFO     -- Node 3: Start phase 4 completed (noderestart)
2005-01-20 15:55:59 [MgmSrvr] INFO     -- Node 3: DICT: index 3 activated
2005-01-20 15:55:59 [MgmSrvr] INFO     -- Node 3: DICT: index 4 activated
2005-01-20 15:55:59 [MgmSrvr] INFO     -- Node 3: DICT: index 5 activated
2005-01-20 15:55:59 [MgmSrvr] INFO     -- Node 3: DICT: index 7 activated
2005-01-20 15:55:59 [MgmSrvr] INFO     -- Node 3: DICT: index 8 activated
2005-01-20 15:55:59 [MgmSrvr] INFO     -- Node 3: DICT: index 9 activated
2005-01-20 16:03:04 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2005-01-20 16:03:04 [MgmSrvr] INFO     -- Node 1: Communication to Node 3 closed
2005-01-20 16:03:04 [MgmSrvr] ALERT    -- Node 9: Node 3 Disconnected
2005-01-20 16:03:04 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2005-01-20 16:03:04 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2005-01-20 16:03:04 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2005-01-20 16:03:04 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2005-01-20 16:03:04 [MgmSrvr] ALERT    -- Node 1: Arbitration check won - node group majority
2005-01-20 16:03:04 [MgmSrvr] INFO     -- Node 1: President restarts arbitration
 thread [state=6]
2005-01-20 16:03:05 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_r
eserved_nodes 0000000000100216.
2005-01-20 16:03:07 [MgmSrvr] INFO     -- Node 1: Communication to Node 3 opened
2005-01-20 16:03:07 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
2005-01-20 16:03:08 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened

How to repeat:
Create a table:
CREATE TABLE numbers
(
  number BIGINT UNSIGNED,
  sep SMALLINT UNSIGNED,
  PRIMARY KEY( number ),
  UNIQUE( number )
) MIN_ROWS=15000000 MAX_ROWS=25000000 ENGINE=ndbcluster DEFAULT CHARSET=latin1;

populate with 20 000 000 rows.
in ndb_mgm:
3 stop

when stopped, restart node 3.
[20 Jan 2005 17:07] Jonas Oreland
Could you please supply more of the trace file.
Maybe by uploading it.

/Jonas
[20 Jan 2005 17:13] Chris Kennedy
Have tried 3 times to attach the file,  but it fails.  I will try and attach it by another route
[20 Jan 2005 18:41] Jonas Oreland
Can you make it available by ftp?
[21 Jan 2005 7:39] Jonas Oreland
Hi,

Thanks for the trace file.
Unfortunatly...it did not contain the error code that I was looking for.
I had gotten lost in the error reporting...

I made a patch so that the error code will show...
I would appreciate if you could apply the patch.
Try again, and ship the new trace file.

/Jonas

<patch>
--- 1.17/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp	Fri Dec 17 10:32:21 2004
+++ 1.18/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp	Fri Jan 21 07:53:00 2005
@@ -2976,6 +2976,8 @@
   SystemError * const sysErr = (SystemError*)&signal->theData[0];
   sysErr->errorCode = SystemError::CopyFragRefError;
   sysErr->errorRef = reference();
+  sysErr->data1 = errorCode;
+  sysErr->data2 = 0;
   sendSignal(cntrRef, GSN_SYSTEM_ERROR, signal, 
 	     SystemError::SignalLength, JBB);
   return;
</patch>
[21 Jan 2005 8:24] Chris Kennedy
Unfortunately my installation is from the binary,  so it may take me a while to get something working from a source build.
[2 Feb 2005 5:55] Jonas Oreland
Any news?
[10 Feb 2005 12:28] Chris Kennedy
Sorry to take so long to respond,  but I did not have the hardware available to try this until today.

Have tried using the supplied 4.1.10pre version.  Unfortunately my NDB API (4.1.8) clients cannot connect to it.  The call to Ndb::waitUntilReady() never terminates.
[15 Feb 2005 13:56] Tomas Ulin
4.1.8 and 4.1.10 are not compatible.

You will have to recompile your application.

BR,

Tomas
[15 Feb 2005 14:55] Chris Kennedy
Is the NDB API from 4.1.9 compatible? If not, I need to get the 4.1.10 source from somewhere.
[16 Mar 2005 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".