Bug #13461 Slave Cluster crashed on restart of two data nodes in seperate groups
Submitted: 24 Sep 2005 17:22 Modified: 14 Oct 2005 10:28
Reporter: Jonathan Miller
Status: Closed
Category:Server: Cluster Severity:S2 (Serious)
Version:5.1 4.1? 5.0? OS:Linux (Linux)
Assigned to: Tomas Ulin Target Version:

[24 Sep 2005 17:22] Jonathan Miller
Description:
The slave cluster had run out of data memory during stress testing. The error message
returned was a good message leading right to the issue:

050924  0:10:37 [ERROR] Slave: Error in Write_rows event: error during transaction
execution on table atae.dcacache, Error_code: 135
050924  0:10:37 [Warning] Slave: Got error 827 'Out of memory in Ndb Kernel, table data
(increase DataMemory)' from NDB Error_code: 1296

With this error message I increased the data memory in the slave config.ini and restarted
the manager.

Once the manager was up, I restart a data node from group 0. Once that data node was in
phase 4 I restarted a data node from group 1 leaving one data node in each group.

The cluster went down except for the last data node that I had started whicn was still in
a starting mode. The error messages and trace files from the data nodes that crash are as
follows:

Date/Time: Saturday 24 September 2005 - 17:02:55
Type of error: error
Message: Node failed during system restart
Fault ID: 2308
Problem data: Unhandled node failure during restart
Object of reference: NDBCNTR (Line: 1417) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 9151
TraceFile: /space/run/ndb_5_trace.log.1
Version 5.1.2 (a_drop5p4)

--------------- Signal ----------------
r.bn: 251 "NDBCNTR", r.proc: 5, r.sigId: 10106253 gsn: 26 "NODE_FAILREP" prio: 1s.bn: 252
"QMGR", s.proc: 5, s.sigId: 10106221 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000006 H'00000001 H'00000100 H'00000000
--------------- Signal ----------------

Date/Time: Saturday 24 September 2005 - 17:02:56
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbdihMain.cpp
Object of reference: DBDIH (Line: 4009) 0x0000000e
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 18413
TraceFile: /space/run/ndb_6_trace.log.1
Version 5.1.2 (a_drop5p4)

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 6, r.sigId: 59166609 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 6, s.sigId: 59166606 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000004 H'00000006 H'00000001 H'00000020 H'00000000
--------------- Signal ----------------

I then tried to restart the failed nodes since one was still in "starting" condition. I
restart one data node on the system that had the one in "starting" condition, and
restarted two data nodes on the other system. All the data nodes came down again with the
following.

Date/Time: Saturday 24 September 2005 - 17:04:57
Type of error: error
Message: System error
Fault ID: 2303
Problem data: Unable to find restorable replica for table: 0 fragment: 0 gci: 31113
Object of reference: DBDIH (Line: 8744) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 9202
TraceFile: /space/run/ndb_8_trace.log.1
Version 5.1.2 (a_drop5p4)
***EOM***

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 8, r.sigId: 1647749 gsn: 164 "CONTINUEB" prio: 1
s.bn: 246 "DBDIH", s.proc: 8, s.sigId: 1647748 length: 3 trace: 0 #sec: 0 fragInf: 0
 Start fragment: Table: 0 Fragment: 0
--------------- Signal ----------------

How to repeat:
See above
[27 Sep 2005 11:33] Jonas Oreland
1) it not supported to "take down" a node during NR
This currently _should_ crash starting node.

2) Regarding the "unable to find", is this reproducable?
If so how?
[27 Sep 2005 12:45] Jonathan Miller
1) it not supported to "take down" a node during NR
This currently _should_ crash starting node.

> That is a bug. If it is not supported then block me from do so, que it up and run is
after, but crash starting node and cluster is not exceptable.

2) Regarding the "unable to find", is this reproducable?
If so how?

> Not 100% sure if it reproducable, but all the files you need from it are on  
ndb10, ndb11, and ndb12.
[27 Sep 2005 12:51] Jonas Oreland
>That is a bug. If it is not supported then block me from do so, que
> it up and run is after, but crash starting node and cluster is not
> exceptable.

Do you mean from ndb_mgm, then its a bug there.
Please report it separatly if you can repeat it.

Otherwise it hard to block.
I can _never_ block you from kill -9 or physically unplugging a cable.

BTW: The cluster shouldnt fail.
Is this reproducable?

2) Regarding the "unable to find", is this reproducable?
If so how?

>> Not 100% sure if it reproducable, but all the files you need from it
>> are on  ndb10, ndb11, and ndb12.

Can you please try to reproduce the test case?
[27 Sep 2005 13:03] Jonathan Miller
> Yes, the ndb_mgmd allowed me to issue the restart. Not sure why I need to open a
different bug report as this is the bug report that I have open for it.

> Yes this is reproducable, get yourself a large database, restart a data node, when it
enters phase 4, restart another data node in the other group.
[27 Sep 2005 13:17] Jonas Oreland
1) bug if ndb_mgm allowed you stop a node while another was restarting
2) bug if cluster fails during this (unless in different node groups)
3) bug if it then fails to perform SR.

But I interpret your replies as: only 1) is relevant (and reproducable?)
[13 Oct 2005 14:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/31021
[13 Oct 2005 17:23] Tomas Ulin
pushed into 5.0.15 only
[14 Oct 2005 10:28] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented fix in 5.0.15 changelog.