Bug #22316 Four-node cluster loses Node Group 1
Submitted: 13 Sep 2006 15:49 Modified: 26 Oct 2006 17:22
Reporter: Steve Wolf Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.0.24a OS:Linux (CentOS 4.4 x86_64)
Assigned to: CPU Architecture:Any
Tags: cluster, node groups

[13 Sep 2006 15:49] Steve Wolf
Description:
Running a four-node cluster.  Node 1 and Node 2 had each failed independently, moving Master to Node 3, the first node of Node Group 1.

For testing purposes, I failed Node 4 from the Management Node with the "4 stop" command.  When I restarted ndbd on Node 4, I saw in ndb_mgm:

> ndb_mgm> show
> Connected to Management Server at: localhost:1186
> Cluster Configuration
> ---------------------
> [ndbd(NDB)]     4 node(s)
> id=1    @192.168.4.195  (Version: 5.0.24, Nodegroup: 0)
> id=2    @192.168.4.196  (Version: 5.0.24, Nodegroup: 0)
> id=3    @192.168.4.197  (Version: 5.0.24, Nodegroup: 1, Master)
> id=4    @192.168.4.198  (Version: 5.0.24, starting, Nodegroup: 0)
> 
> [ndb_mgmd(MGM)] 1 node(s)
> id=5    @192.168.4.209  (Version: 5.0.24)
> 
> [mysqld(API)]   2 node(s)
> id=6    @192.168.4.195  (Version: 5.0.24)
> id=7    @192.168.4.196  (Version: 5.0.24)

So Node 4 was trying to come back up in Node Group 0 instead of Node Group 1.  It got stuck in this configuration, so I performed a shutdown.

When I restarted ndbd on the four nodes, ndb_mgm showed:

> ndb_mgm> show
> Cluster Configuration
> ---------------------
> [ndbd(NDB)]     4 node(s)
> id=1    @192.168.4.195  (Version: 5.0.24, starting, Nodegroup: 0, Master)
> id=2    @192.168.4.196  (Version: 5.0.24, starting, Nodegroup: 0)
> id=3    @192.168.4.197  (Version: 5.0.24, starting, Nodegroup: 0)
> id=4    @192.168.4.198  (Version: 5.0.24, starting, Nodegroup: 0)
> 
> [ndb_mgmd(MGM)] 1 node(s)
> id=5   (Version: 5.0.24)
> 
> [mysqld(API)]   2 node(s)
> id=6 (not connected, accepting connect from 192.168.4.195)
> id=7 (not connected, accepting connect from 192.168.4.196)

All four nodes are trying to come up in Node Group 0.  I didn't like the look of this, so I immediately issued a shutdown.  Then I attempted to bring up just Nodes 1 and 2, hoping that after they established Node Group 0 the other two nodes would properly create Node Group 1.  This time, nobody claimed Master, and the cluster failed to start:

> ndb_mgm> show
> Cluster Configuration
> ---------------------
> [ndbd(NDB)]     4 node(s)
> id=1    @192.168.4.195  (Version: 5.0.24, starting, Nodegroup: 0)
> id=2    @192.168.4.196  (Version: 5.0.24, starting, Nodegroup: 0)
> id=3 (not connected, accepting connect from 192.168.4.197)
> id=4 (not connected, accepting connect from 192.168.4.198)
> 
> [ndb_mgmd(MGM)] 1 node(s)
> id=5   (Version: 5.0.24)
> 
> [mysqld(API)]   2 node(s)
> id=6 (not connected, accepting connect from 192.168.4.195)
> id=7 (not connected, accepting connect from 192.168.4.196)
> 
> ndb_mgm> Node 1: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug). Temporary error, rest
> Node 2: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

So now the cluster is unusable and I have to rebuild it.

I suspect this may be related to the Master Node being in Node Group 1 instead of Node Group 0, but I could be wrong.

How to repeat:
Unknown
[13 Sep 2006 16:06] Steve Wolf
Node Group 0 cluster log and trace files

Attachment: bug22316_logs_group0.tar.gz (application/x-gzip, text), 177.33 KiB.

[13 Sep 2006 16:07] Steve Wolf
Node Group 1 cluster log and trace files

Attachment: bug22316_logs_group1.tar.gz (application/x-gzip, text), 53.26 KiB.

[13 Sep 2006 16:08] Steve Wolf
Management Node log and configuration files

Attachment: bug22316_logs_mgmt.tar.gz (application/x-gzip, text), 16.68 KiB.

[13 Sep 2006 17:03] Steve Wolf
In rebuilding the cluster, I see that all nodes show as Nodegroup 0 before the cluster is brought up.  So this bug simplifies to the errors when bringing up the cluster.  They claim to be temporary, but happen every time.
[26 Sep 2006 17:22] Jonas Oreland
Hi,

Reading your cluster log I made the following conclustions:

* system restart fails as you dont start all 4 nodes fast enough...
With default setting you have 30s for allowing nodes to get in contact with each other.

Otherwise: As stated in ndb_1_error.log
--
Time: Wednesday 13 September 2006 - 00:38:07
Status: Temporary error, restart node
Message: Conflict when selecting restart type (Internal error, programming error or missing error message, please report a bug)
Error: 2311
Error data: Unable to start missing node group!  starting: 0000000000000006 (missing fs for: 0000000000000000)
Error object: QMGR (Line: 1356) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 24521
Trace: /usr/local/mysql/data/ndb_1_trace.log.2
Version: Version 5.0.24
***EO
--

This means that node 1 & 2 has connected...but an entire node group is 
 missing, as indicated by 
"Error data: Unable to start missing node group!  starting: 0000000000000006"

Please respond to wheater this can be a correct conclusion.
(Note if performing initial start, this timeout is unlimited...)

/Jonas
[26 Oct 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".