MySQL Bugs: #21509: Cluster node shuts down caused by error 2311

Bug #21509	Cluster node shuts down caused by error 2311
Submitted:	8 Aug 2006 15:32	Modified:	3 May 2012 7:02
Reporter:	Alexander Bangert	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.0.22	OS:	Linux (Gentoo Linux, Kernel 2.6.16)
Assigned to:		CPU Architecture:	Any

Description:
I'm using a cluster with 1 management, 2 ndbd and 2 storage nodes on 3 machines (mgmd on machine 1, 1 ndbd and 1 storage node on each other machine). When starting ndbd 2 (Node 3, machine 3), it breaks down: 

Node 3: Forced node shutdown completed. Occured during startphase 3. Initiated b
y signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal
 error, programming error or missing error message, please report a bug). Tempor
ary error, rest

If ndbd1 (Node 2, machine 2) is still in start-up phase, it will break down, too:
ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 4. In
itiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbre
quire)(Internal error, programming error or missing error message, please report
 a bug). Temporary error
Node 2: Forced node shutdown completed. Occured during startphase 4. Initiated b
y signal 0. Caused by error 2308: 'Another node failed during system restart, pl
ease investigate error(s) on other node(s)(Restart error). Temporary error, rest
art node'.

How to repeat:
Unknown

Updating Category.

Please provide cluster log and trace/error logs

I have experianced This error on 5.0.21 when doing the following;
8dn, 1mgm, managed node restart in same node group as 'Master' node.

mgm> 15 restart -n

< Normal Node shutdown 'closed' messages >
2006-08-10 09:43:23 [MgmSrvr] ALERT    -- Node 16: Arbitration check won - node group majority
2006-08-10 09:43:23 [MgmSrvr] INFO     -- Node 16: President restarts arbitration thread [state=6]
2006-08-10 09:43:23 [MgmSrvr] INFO     -- Node 15: Node shutdown completed, restarting, no start.
2006-08-10 09:43:23 [MgmSrvr] INFO     -- Mgmt server state: nodeid 15 reserved for ip 10.10.250.15, m_reserved_nodes 000000000000800e.
2006-08-10 09:43:24 [MgmSrvr] INFO     -- Node 1: Node 15 Connected
2006-08-10 09:43:24 [MgmSrvr] INFO     -- Mgmt server state: nodeid 15 freed, m_reserved_nodes 000000000000000e.

mgm> 15 start

2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Start initiated (version 5.0.21)
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Start phase 0 completed 
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 11 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 12 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 13 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 14 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 16 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 17 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Communication to Node 18 opened
2006-08-10 09:43:39 [MgmSrvr] INFO     -- Node 15: Waiting 30 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:41 [MgmSrvr] INFO     -- Node 15: Waiting 28 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:44 [MgmSrvr] INFO     -- Node 15: Waiting 25 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:47 [MgmSrvr] INFO     -- Node 15: Waiting 22 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:50 [MgmSrvr] INFO     -- Node 15: Waiting 19 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:53 [MgmSrvr] INFO     -- Node 15: Waiting 16 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:56 [MgmSrvr] INFO     -- Node 15: Waiting 13 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:43:59 [MgmSrvr] INFO     -- Node 15: Waiting 10 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:44:03 [MgmSrvr] INFO     -- Node 15: Waiting 7 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:44:06 [MgmSrvr] INFO     -- Node 15: Waiting 4 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:44:09 [MgmSrvr] INFO     -- Node 15: Waiting 1 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]
2006-08-10 09:44:12 [MgmSrvr] INFO     -- Node 1: Node 15 Connected
2006-08-10 09:44:12 [MgmSrvr] ALERT    -- Node 15: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug). Temporary error, res
2006-08-10 09:44:21 [MgmSrvr] WARNING  -- Node 14: Failure handling of node 15 has not completed in 1 min. - state = 3

===ndb_15_error.log===
Time: Thursday 10 August 2006 - 09:44:08
Status: Temporary error, restart node
Message: Conflict when selecting restart type (Internal error, programming error or missing error message, please report a bug)
Error: 2311
Error data: Unable to start missing node group!  starting: 0000000000008000 (missing fs for: 0000000000000000)
Error object: QMGR (Line: 1356) 0x0000000a
Program: ndbd
Pid: 28192
Trace: /data/mysql-cluster/ndbd/ndb_15_trace.log.7
Version: Version 5.0.21
***EOM***

===ndb_15_out.log===
2006-08-10 09:43:18 [ndbd] INFO     -- Restarting system
2006-08-10 09:43:20 [ndbd] INFO     -- Node 15: Node shutdown completed, restart
ing, no start.
2006-08-10 09:43:20 [ndbd] INFO     -- Ndb has terminated (pid 853) restarting
2006-08-10 09:43:20 [ndbd] INFO     -- Angel pid: 852 ndb pid: 28192
2006-08-10 09:43:20 [ndbd] INFO     -- NDB Cluster -- DB node 15
2006-08-10 09:43:20 [ndbd] INFO     -- Version 5.0.21 --
2006-08-10 09:43:20 [ndbd] INFO     -- Configuration fetched at 10.10.250.1 por
t 1186
2006-08-10 09:43:20 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2006-08-10 09:44:08 [ndbd] INFO     -- Error handler startup shutting down syste
m
2006-08-10 09:44:08 [ndbd] INFO     -- Error handler shutdown completed - exitin
g
2006-08-10 09:44:08 [ndbd] INFO     -- Angel received ndbd startup failure count
 1.
2006-08-10 09:44:08 [ndbd] ALERT    -- Node 15: Forced node shutdown completed. 
Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Confl
ict when selecting restart type(Internal error, programming error or missing err
or message, please report a bug). Temporary error, res

Simply waiting 2 minutes, and running the 'ndbd' process brought the process back up correctly. But perhaps a bug, attaching trace.7.

Saw this in dmesg, however no timestamps so I am uncertain that this is related to this mornings event;

ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x84 { DriveStatusError BadCRC }
Messages repeat ~ 20 times.

RHEL4 Linux on my system.

Hi,

From reading Adam log i gather
Waiting 30 sec for nodes 0000000000077800 to connect, nodes [ all: 00000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ]

Starting node not been able to connect to any other node in cluster.
connected: 80000 (i.e node 15, and it's node 15 starting)

Do you seen any "Communication to Node 15 opened" in your cluster log ?
i.e. other nodes accepting node 15 to connect...

/Jonas

The lines just after the error 2311;
2006-08-10 09:44:12 [MgmSrvr] ALERT    -- Node 15: Forced node shutdown
completed. Occured during startphase 1. Initiated by signal 0. Caused by error
2311: 'Conflict when selecting restart type(Internal error, programming error or
missing error message, please report a bug). Temporary error, res

2006-08-10 09:44:21 [MgmSrvr] WARNING  -- Node 14: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:21 [MgmSrvr] WARNING  -- Node 17: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:21 [MgmSrvr] WARNING  -- Node 13: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:22 [MgmSrvr] WARNING  -- Node 11: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:22 [MgmSrvr] WARNING  -- Node 16: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:22 [MgmSrvr] WARNING  -- Node 12: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:22 [MgmSrvr] WARNING  -- Node 18: Failure handling of node 15 has not completed in 1 min. - state = 3
2006-08-10 09:44:26 [MgmSrvr] INFO     -- Node 18: Communication to Node 15 opened
2006-08-10 09:44:26 [MgmSrvr] INFO     -- Node 14: Communication to Node 15 opened
2006-08-10 09:44:26 [MgmSrvr] INFO     -- Node 17: Communication to Node 15 opened
2006-08-10 09:44:26 [MgmSrvr] INFO     -- Node 13: Communication to Node 15 opened
2006-08-10 09:44:27 [MgmSrvr] INFO     -- Node 11: Communication to Node 15 opened
2006-08-10 09:44:27 [MgmSrvr] INFO     -- Node 16: Communication to Node 15 opened
2006-08-10 09:44:27 [MgmSrvr] INFO     -- Node 12: Communication to Node 15 opened
2006-08-10 09:44:30 [MgmSrvr] INFO     -- Node 16: Local checkpoint 33911 started. Keep GCI = 5553101 oldest restorable GCI = 5553262
2006-08-10 09:46:13 [MgmSrvr] INFO     -- Mgmt server state: nodeid 15 reserved for ip 10.10.250.15, m_reserved_nodes 000000000000800e.

Hope this clarifies.

I am getting this same error where ndbd terminates suddenly less than a minute after starting up. The error that shows up on ndb_mgm and in the log file is

Node 2: Forced node shutdown completed. Occur during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type (Internal error, programming error or missing error message, please report a bug) Temporary error, rest

I am using mysql version 5.0.22 on RHEL4. The problem does not occur when I start ndbd with the --initial parameter. It only occurs when I restart the cluster data nodes without the --initial parameter. The cluster has 4 data nodes and 2 node groups. Individual ndbd processes gets killed a short while after starting up, so the show command on ndb_mgm client displays status as starting for node 2 for a while and then starts displaying waiting for connection again.

The same problem does not occur on the test bed (another set of HP intel based machines) which is same version of mysql and OS! However test bed configuration has only 2 data nodes.

I would like to send you the error log and trace files, but I am unable to add files to this bug.

Any further information required here?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I do not have anything "new" here, but as we are experiencing this bug also (very annoying) and i have just replicated it on a "fresh" setup, i'm supplying some traces/logfiles.

We are running 4 machines (AMD64 w SUSE 10 Linux), 4 CPUs each.
We've run several test-setups (varying the cluster-config in machines, and allocated ram), but this bug persisted.

In the files submitted the setup was 16dns (1 per CPU), 4 sql (1 per machine), 1 mgmd (on 1 machine).

If i'm restarting the nodes with "no start" (ndbd -n) i am able to avoid this error.

Gotta correct the last comment. It's only working with "-n" if i have only minimal data inside the cluster. With about 2 GB data it is crashing as always.

Unfortunately i cannot submit my files here either. I will try to open a new bug to upload the files.

We had the same issue.  I noticed by starting all of the nodes quickly (running ndbd on the data nodes within seconds), the system would start.  I suspect this is because our DB is large (10GB) and we are running on 4x8GB nodes, so the DB cannot fit into a single node.

All reporters,

Please, try to repeat with the newer version, 5.0.26, and inform about the results.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Hello,

bug description #40229 shows that the issue also occurs with versions 5.0.46 and 5.0.67.

Regards,

Martin

Only observed in 5.0, never observed on 5.1. Please try a new verison.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".