Bug #22469 Forced node shutdown (Error 2311, 2308)
Submitted: 19 Sep 2006 8:45 Modified: 22 Nov 2006 10:41
Reporter: Stefan Pasel Email Updates:
Status: No Feedback Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.24 OS:Linux (Suse 10)
Assigned to: CPU Architecture:Any
Tags: cluster

[19 Sep 2006 8:45] Stefan Pasel
This is a follow up to http://bugs.mysql.com/bug.php?id=21509
I'm adding this bug as i cannot attach traces to the old one. Furthermore i think this bug is serious as restarting the cluster without loss of data after a crash is almost not possible.

The bug addresses the following messages:

Forced node shutdown completed. Occured during startphase X. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug).

Forced node shutdown completed. Occured during startphase X. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

[Note Bug#21509 also mentions  error 2341: 'Internal program error (failed
ndbrequire - I've also seen this, but could not replicate it with the barebone testcase and it's therefore not included in the logs]

Our test-setup: 4 machines with 4 dns (1dn/cpu) using GB-Ethernet, 1 Replica (i.e. 8 nodegroups)

Please also note that the error seems not to be bound to a specific startphase. In the supplied testcase i had almost NO DATA inside the cluster (only the structure of 4 tables) and the cluster crashed in phase 1. 
I could start the testcase without the error by starting the ndbd with "ndbd -n".

After restoring the backup (~2 GB Data) and trying the "ndbd -n" workaround the error occurred in phase 4. Without "ndbd -n" the error occurred in phase 1. 
Starting a partial cluster (8dns on 2 machines) an afterwards re-integrating the other 8 nodes in the partial cluster seems to be working fine.

I have been looking into possible network outages as my guess was that the nodes are "loosing each other" when starting the cluster and starting the partial cluster and re-integrating the nodes "step by step" seem to work.
I'm going to downgrade the current cluster to work with 4 nodegroups (2dns per machine) and see if this improves the situation.

How to repeat:
- Startup Cluster,
- Insert Data.. Work.. whatever
- Shutdown Cluster (planned or crash)
- Restart Nodes without initial (you want the data of the last checkpoint)
->Error occurs
[26 Sep 2006 16:37] Jonas Oreland
can you upload config.ini & error/trace/cluster logs?

[22 Oct 2006 10:41] Valeriy Kravchuk
Please, try to repeat with a newer version, 5.0.26, and, in case of similar problem, send information Jonas Oreland already asked for.
[23 Nov 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".