MySQL Bugs: #22316: Four-node cluster loses Node Group 1

Bug #22316	Four-node cluster loses Node Group 1
Submitted:	13 Sep 2006 15:49	Modified:	26 Oct 2006 17:22
Reporter:	Steve Wolf	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.0.24a	OS:	Linux (CentOS 4.4 x86_64)
Assigned to:		CPU Architecture:	Any
Tags:	cluster, node groups

Description:
Running a four-node cluster.  Node 1 and Node 2 had each failed independently, moving Master to Node 3, the first node of Node Group 1.

For testing purposes, I failed Node 4 from the Management Node with the "4 stop" command.  When I restarted ndbd on Node 4, I saw in ndb_mgm:

> ndb_mgm> show
> Connected to Management Server at: localhost:1186
> Cluster Configuration
> ---------------------
> [ndbd(NDB)]     4 node(s)
> id=1    @192.168.4.195  (Version: 5.0.24, Nodegroup: 0)
> id=2    @192.168.4.196  (Version: 5.0.24, Nodegroup: 0)
> id=3    @192.168.4.197  (Version: 5.0.24, Nodegroup: 1, Master)
> id=4    @192.168.4.198  (Version: 5.0.24, starting, Nodegroup: 0)
> 
> [ndb_mgmd(MGM)] 1 node(s)
> id=5    @192.168.4.209  (Version: 5.0.24)
> 
> [mysqld(API)]   2 node(s)
> id=6    @192.168.4.195  (Version: 5.0.24)
> id=7    @192.168.4.196  (Version: 5.0.24)

So Node 4 was trying to come back up in Node Group 0 instead of Node Group 1.  It got stuck in this configuration, so I performed a shutdown.

When I restarted ndbd on the four nodes, ndb_mgm showed:

> ndb_mgm> show
> Cluster Configuration
> ---------------------
> [ndbd(NDB)]     4 node(s)
> id=1    @192.168.4.195  (Version: 5.0.24, starting, Nodegroup: 0, Master)
> id=2    @192.168.4.196  (Version: 5.0.24, starting, Nodegroup: 0)
> id=3    @192.168.4.197  (Version: 5.0.24, starting, Nodegroup: 0)
> id=4    @192.168.4.198  (Version: 5.0.24, starting, Nodegroup: 0)
> 
> [ndb_mgmd(MGM)] 1 node(s)
> id=5   (Version: 5.0.24)
> 
> [mysqld(API)]   2 node(s)
> id=6 (not connected, accepting connect from 192.168.4.195)
> id=7 (not connected, accepting connect from 192.168.4.196)

All four nodes are trying to come up in Node Group 0.  I didn't like the look of this, so I immediately issued a shutdown.  Then I attempted to bring up just Nodes 1 and 2, hoping that after they established Node Group 0 the other two nodes would properly create Node Group 1.  This time, nobody claimed Master, and the cluster failed to start:

> ndb_mgm> show
> Cluster Configuration
> ---------------------
> [ndbd(NDB)]     4 node(s)
> id=1    @192.168.4.195  (Version: 5.0.24, starting, Nodegroup: 0)
> id=2    @192.168.4.196  (Version: 5.0.24, starting, Nodegroup: 0)
> id=3 (not connected, accepting connect from 192.168.4.197)
> id=4 (not connected, accepting connect from 192.168.4.198)
> 
> [ndb_mgmd(MGM)] 1 node(s)
> id=5   (Version: 5.0.24)
> 
> [mysqld(API)]   2 node(s)
> id=6 (not connected, accepting connect from 192.168.4.195)
> id=7 (not connected, accepting connect from 192.168.4.196)
> 
> ndb_mgm> Node 1: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug). Temporary error, rest
> Node 2: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

So now the cluster is unusable and I have to rebuild it.

I suspect this may be related to the Master Node being in Node Group 1 instead of Node Group 0, but I could be wrong.

How to repeat:
Unknown

Node Group 0 cluster log and trace files

Attachment: bug22316_logs_group0.tar.gz (application/x-gzip, text), 177.33 KiB.

Node Group 1 cluster log and trace files

Attachment: bug22316_logs_group1.tar.gz (application/x-gzip, text), 53.26 KiB.

Management Node log and configuration files

Attachment: bug22316_logs_mgmt.tar.gz (application/x-gzip, text), 16.68 KiB.

In rebuilding the cluster, I see that all nodes show as Nodegroup 0 before the cluster is brought up.  So this bug simplifies to the errors when bringing up the cluster.  They claim to be temporary, but happen every time.

Hi,

Reading your cluster log I made the following conclustions:

* system restart fails as you dont start all 4 nodes fast enough...
With default setting you have 30s for allowing nodes to get in contact with each other.

Otherwise: As stated in ndb_1_error.log
--
Time: Wednesday 13 September 2006 - 00:38:07
Status: Temporary error, restart node
Message: Conflict when selecting restart type (Internal error, programming error or missing error message, please report a bug)
Error: 2311
Error data: Unable to start missing node group!  starting: 0000000000000006 (missing fs for: 0000000000000000)
Error object: QMGR (Line: 1356) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 24521
Trace: /usr/local/mysql/data/ndb_1_trace.log.2
Version: Version 5.0.24
***EO
--

This means that node 1 & 2 has connected...but an entire node group is 
 missing, as indicated by 
"Error data: Unable to start missing node group!  starting: 0000000000000006"

Please respond to wheater this can be a correct conclusion.
(Note if performing initial start, this timeout is unlimited...)

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".