MySQL Bugs: #22469: Forced node shutdown (Error 2311, 2308)

Bug #22469	Forced node shutdown (Error 2311, 2308)
Submitted:	19 Sep 2006 8:45	Modified:	22 Nov 2006 10:41
Reporter:	Stefan Pasel	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.24	OS:	Linux (Suse 10)
Assigned to:		CPU Architecture:	Any
Tags:	cluster

Description:
This is a follow up to http://bugs.mysql.com/bug.php?id=21509
I'm adding this bug as i cannot attach traces to the old one. Furthermore i think this bug is serious as restarting the cluster without loss of data after a crash is almost not possible.

The bug addresses the following messages:

Forced node shutdown completed. Occured during startphase X. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug).

Forced node shutdown completed. Occured during startphase X. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

[Note Bug#21509 also mentions error 2341: 'Internal program error (failed
ndbrequire - I've also seen this, but could not replicate it with the barebone testcase and it's therefore not included in the logs]

Our test-setup: 4 machines with 4 dns (1dn/cpu) using GB-Ethernet, 1 Replica (i.e. 8 nodegroups)

Please also note that the error seems not to be bound to a specific startphase. In the supplied testcase i had almost NO DATA inside the cluster (only the structure of 4 tables) and the cluster crashed in phase 1.
I could start the testcase without the error by starting the ndbd with "ndbd -n".

After restoring the backup (~2 GB Data) and trying the "ndbd -n" workaround the error occurred in phase 4. Without "ndbd -n" the error occurred in phase 1.
Starting a partial cluster (8dns on 2 machines) an afterwards re-integrating the other 8 nodes in the partial cluster seems to be working fine.

I have been looking into possible network outages as my guess was that the nodes are "loosing each other" when starting the cluster and starting the partial cluster and re-integrating the nodes "step by step" seem to work.
I'm going to downgrade the current cluster to work with 4 nodegroups (2dns per machine) and see if this improves the situation.

How to repeat:
- Startup Cluster,
- Insert Data.. Work.. whatever
- Shutdown Cluster (planned or crash)
- Restart Nodes without initial (you want the data of the last checkpoint)
->Error occurs

can you upload config.ini & error/trace/cluster logs?

/Jonas

Please, try to repeat with a newer version, 5.0.26, and, in case of similar problem, send information Jonas Oreland already asked for.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".