MySQL Bugs: #13461: Slave Cluster crashed on restart of two data nodes in seperate groups

Bug #13461	Slave Cluster crashed on restart of two data nodes in seperate groups
Submitted:	24 Sep 2005 15:22	Modified:	14 Oct 2005 8:28
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1 4.1? 5.0?	OS:	Linux (Linux)
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
The slave cluster had run out of data memory during stress testing. The error message returned was a good message leading right to the issue:

050924  0:10:37 [ERROR] Slave: Error in Write_rows event: error during transaction execution on table atae.dcacache, Error_code: 135
050924  0:10:37 [Warning] Slave: Got error 827 'Out of memory in Ndb Kernel, table data (increase DataMemory)' from NDB Error_code: 1296

With this error message I increased the data memory in the slave config.ini and restarted the manager.

Once the manager was up, I restart a data node from group 0. Once that data node was in phase 4 I restarted a data node from group 1 leaving one data node in each group.

The cluster went down except for the last data node that I had started whicn was still in a starting mode. The error messages and trace files from the data nodes that crash are as follows:

Date/Time: Saturday 24 September 2005 - 17:02:55
Type of error: error
Message: Node failed during system restart
Fault ID: 2308
Problem data: Unhandled node failure during restart
Object of reference: NDBCNTR (Line: 1417) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 9151
TraceFile: /space/run/ndb_5_trace.log.1
Version 5.1.2 (a_drop5p4)

--------------- Signal ----------------
r.bn: 251 "NDBCNTR", r.proc: 5, r.sigId: 10106253 gsn: 26 "NODE_FAILREP" prio: 1s.bn: 252 "QMGR", s.proc: 5, s.sigId: 10106221 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000006 H'00000001 H'00000100 H'00000000
--------------- Signal ----------------

Date/Time: Saturday 24 September 2005 - 17:02:56
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbdihMain.cpp
Object of reference: DBDIH (Line: 4009) 0x0000000e
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 18413
TraceFile: /space/run/ndb_6_trace.log.1
Version 5.1.2 (a_drop5p4)

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 6, r.sigId: 59166609 gsn: 26 "NODE_FAILREP" prio: 1
s.bn: 251 "NDBCNTR", s.proc: 6, s.sigId: 59166606 length: 5 trace: 8 #sec: 0 fragInf: 0
 H'00000004 H'00000006 H'00000001 H'00000020 H'00000000
--------------- Signal ----------------

I then tried to restart the failed nodes since one was still in "starting" condition. I restart one data node on the system that had the one in "starting" condition, and restarted two data nodes on the other system. All the data nodes came down again with the following.

Date/Time: Saturday 24 September 2005 - 17:04:57
Type of error: error
Message: System error
Fault ID: 2303
Problem data: Unable to find restorable replica for table: 0 fragment: 0 gci: 31113
Object of reference: DBDIH (Line: 8744) 0x0000000a
ProgramName: /home/ndbdev/jmiller/builds/libexec/ndbd
ProcessID: 9202
TraceFile: /space/run/ndb_8_trace.log.1
Version 5.1.2 (a_drop5p4)
***EOM***

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 8, r.sigId: 1647749 gsn: 164 "CONTINUEB" prio: 1
s.bn: 246 "DBDIH", s.proc: 8, s.sigId: 1647748 length: 3 trace: 0 #sec: 0 fragInf: 0
 Start fragment: Table: 0 Fragment: 0
--------------- Signal ----------------

How to repeat:
See above

1) it not supported to "take down" a node during NR
This currently _should_ crash starting node.

2) Regarding the "unable to find", is this reproducable?
If so how?

1) it not supported to "take down" a node during NR
This currently _should_ crash starting node.

> That is a bug. If it is not supported then block me from do so, que it up and run is after, but crash starting node and cluster is not exceptable.

2) Regarding the "unable to find", is this reproducable?
If so how?

> Not 100% sure if it reproducable, but all the files you need from it are on  
ndb10, ndb11, and ndb12.

>That is a bug. If it is not supported then block me from do so, que
> it up and run is after, but crash starting node and cluster is not
> exceptable.

Do you mean from ndb_mgm, then its a bug there.
Please report it separatly if you can repeat it.

Otherwise it hard to block.
I can _never_ block you from kill -9 or physically unplugging a cable.

BTW: The cluster shouldnt fail.
Is this reproducable?

2) Regarding the "unable to find", is this reproducable?
If so how?

>> Not 100% sure if it reproducable, but all the files you need from it
>> are on  ndb10, ndb11, and ndb12.

Can you please try to reproduce the test case?

> Yes, the ndb_mgmd allowed me to issue the restart. Not sure why I need to open a different bug report as this is the bug report that I have open for it.

> Yes this is reproducable, get yourself a large database, restart a data node, when it enters phase 4, restart another data node in the other group.

1) bug if ndb_mgm allowed you stop a node while another was restarting
2) bug if cluster fails during this (unless in different node groups)
3) bug if it then fails to perform SR.

But I interpret your replies as: only 1) is relevant (and reproducable?)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/31021

pushed into 5.0.15 only

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented fix in 5.0.15 changelog.