Bug #21509 | Cluster node shuts down caused by error 2311 | ||
---|---|---|---|
Submitted: | 8 Aug 2006 15:32 | Modified: | 3 May 2012 7:02 |
Reporter: | Alexander Bangert | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 5.0.22 | OS: | Linux (Gentoo Linux, Kernel 2.6.16) |
Assigned to: | CPU Architecture: | Any |
[8 Aug 2006 15:32]
Alexander Bangert
[8 Aug 2006 15:41]
MySQL Verification Team
Updating Category.
[8 Aug 2006 16:19]
Jonas Oreland
Please provide cluster log and trace/error logs
[10 Aug 2006 0:47]
Adam Dixon
I have experianced This error on 5.0.21 when doing the following; 8dn, 1mgm, managed node restart in same node group as 'Master' node. mgm> 15 restart -n < Normal Node shutdown 'closed' messages > 2006-08-10 09:43:23 [MgmSrvr] ALERT -- Node 16: Arbitration check won - node group majority 2006-08-10 09:43:23 [MgmSrvr] INFO -- Node 16: President restarts arbitration thread [state=6] 2006-08-10 09:43:23 [MgmSrvr] INFO -- Node 15: Node shutdown completed, restarting, no start. 2006-08-10 09:43:23 [MgmSrvr] INFO -- Mgmt server state: nodeid 15 reserved for ip 10.10.250.15, m_reserved_nodes 000000000000800e. 2006-08-10 09:43:24 [MgmSrvr] INFO -- Node 1: Node 15 Connected 2006-08-10 09:43:24 [MgmSrvr] INFO -- Mgmt server state: nodeid 15 freed, m_reserved_nodes 000000000000000e. mgm> 15 start 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Start initiated (version 5.0.21) 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Start phase 0 completed 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 11 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 12 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 13 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 14 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 16 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 17 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Communication to Node 18 opened 2006-08-10 09:43:39 [MgmSrvr] INFO -- Node 15: Waiting 30 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:41 [MgmSrvr] INFO -- Node 15: Waiting 28 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:44 [MgmSrvr] INFO -- Node 15: Waiting 25 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:47 [MgmSrvr] INFO -- Node 15: Waiting 22 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:50 [MgmSrvr] INFO -- Node 15: Waiting 19 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:53 [MgmSrvr] INFO -- Node 15: Waiting 16 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:56 [MgmSrvr] INFO -- Node 15: Waiting 13 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:43:59 [MgmSrvr] INFO -- Node 15: Waiting 10 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:44:03 [MgmSrvr] INFO -- Node 15: Waiting 7 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:44:06 [MgmSrvr] INFO -- Node 15: Waiting 4 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:44:09 [MgmSrvr] INFO -- Node 15: Waiting 1 sec for nodes 0000000000077800 to connect, nodes [ all: 000000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] 2006-08-10 09:44:12 [MgmSrvr] INFO -- Node 1: Node 15 Connected 2006-08-10 09:44:12 [MgmSrvr] ALERT -- Node 15: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug). Temporary error, res 2006-08-10 09:44:21 [MgmSrvr] WARNING -- Node 14: Failure handling of node 15 has not completed in 1 min. - state = 3 ===ndb_15_error.log=== Time: Thursday 10 August 2006 - 09:44:08 Status: Temporary error, restart node Message: Conflict when selecting restart type (Internal error, programming error or missing error message, please report a bug) Error: 2311 Error data: Unable to start missing node group! starting: 0000000000008000 (missing fs for: 0000000000000000) Error object: QMGR (Line: 1356) 0x0000000a Program: ndbd Pid: 28192 Trace: /data/mysql-cluster/ndbd/ndb_15_trace.log.7 Version: Version 5.0.21 ***EOM*** ===ndb_15_out.log=== 2006-08-10 09:43:18 [ndbd] INFO -- Restarting system 2006-08-10 09:43:20 [ndbd] INFO -- Node 15: Node shutdown completed, restart ing, no start. 2006-08-10 09:43:20 [ndbd] INFO -- Ndb has terminated (pid 853) restarting 2006-08-10 09:43:20 [ndbd] INFO -- Angel pid: 852 ndb pid: 28192 2006-08-10 09:43:20 [ndbd] INFO -- NDB Cluster -- DB node 15 2006-08-10 09:43:20 [ndbd] INFO -- Version 5.0.21 -- 2006-08-10 09:43:20 [ndbd] INFO -- Configuration fetched at 10.10.250.1 por t 1186 2006-08-10 09:43:20 [ndbd] INFO -- WatchDog timer is set to 6000 ms 2006-08-10 09:44:08 [ndbd] INFO -- Error handler startup shutting down syste m 2006-08-10 09:44:08 [ndbd] INFO -- Error handler shutdown completed - exitin g 2006-08-10 09:44:08 [ndbd] INFO -- Angel received ndbd startup failure count 1. 2006-08-10 09:44:08 [ndbd] ALERT -- Node 15: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Confl ict when selecting restart type(Internal error, programming error or missing err or message, please report a bug). Temporary error, res Simply waiting 2 minutes, and running the 'ndbd' process brought the process back up correctly. But perhaps a bug, attaching trace.7.
[10 Aug 2006 1:00]
Adam Dixon
Saw this in dmesg, however no timestamps so I am uncertain that this is related to this mornings event; ata1: status=0x51 { DriveReady SeekComplete Error } ata1: error=0x84 { DriveStatusError BadCRC } Messages repeat ~ 20 times. RHEL4 Linux on my system.
[10 Aug 2006 7:29]
Jonas Oreland
Hi, From reading Adam log i gather Waiting 30 sec for nodes 0000000000077800 to connect, nodes [ all: 00000000007f800 connected: 0000000000008000 no-wait: 0000000000000000 ] Starting node not been able to connect to any other node in cluster. connected: 80000 (i.e node 15, and it's node 15 starting) Do you seen any "Communication to Node 15 opened" in your cluster log ? i.e. other nodes accepting node 15 to connect... /Jonas
[10 Aug 2006 7:48]
Adam Dixon
The lines just after the error 2311; 2006-08-10 09:44:12 [MgmSrvr] ALERT -- Node 15: Forced node shutdown completed. Occured during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type(Internal error, programming error or missing error message, please report a bug). Temporary error, res 2006-08-10 09:44:21 [MgmSrvr] WARNING -- Node 14: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:21 [MgmSrvr] WARNING -- Node 17: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:21 [MgmSrvr] WARNING -- Node 13: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:22 [MgmSrvr] WARNING -- Node 11: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:22 [MgmSrvr] WARNING -- Node 16: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:22 [MgmSrvr] WARNING -- Node 12: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:22 [MgmSrvr] WARNING -- Node 18: Failure handling of node 15 has not completed in 1 min. - state = 3 2006-08-10 09:44:26 [MgmSrvr] INFO -- Node 18: Communication to Node 15 opened 2006-08-10 09:44:26 [MgmSrvr] INFO -- Node 14: Communication to Node 15 opened 2006-08-10 09:44:26 [MgmSrvr] INFO -- Node 17: Communication to Node 15 opened 2006-08-10 09:44:26 [MgmSrvr] INFO -- Node 13: Communication to Node 15 opened 2006-08-10 09:44:27 [MgmSrvr] INFO -- Node 11: Communication to Node 15 opened 2006-08-10 09:44:27 [MgmSrvr] INFO -- Node 16: Communication to Node 15 opened 2006-08-10 09:44:27 [MgmSrvr] INFO -- Node 12: Communication to Node 15 opened 2006-08-10 09:44:30 [MgmSrvr] INFO -- Node 16: Local checkpoint 33911 started. Keep GCI = 5553101 oldest restorable GCI = 5553262 2006-08-10 09:46:13 [MgmSrvr] INFO -- Mgmt server state: nodeid 15 reserved for ip 10.10.250.15, m_reserved_nodes 000000000000800e. Hope this clarifies.
[10 Aug 2006 9:08]
Pankaj Saini
I am getting this same error where ndbd terminates suddenly less than a minute after starting up. The error that shows up on ndb_mgm and in the log file is Node 2: Forced node shutdown completed. Occur during startphase 1. Initiated by signal 0. Caused by error 2311: 'Conflict when selecting restart type (Internal error, programming error or missing error message, please report a bug) Temporary error, rest I am using mysql version 5.0.22 on RHEL4. The problem does not occur when I start ndbd with the --initial parameter. It only occurs when I restart the cluster data nodes without the --initial parameter. The cluster has 4 data nodes and 2 node groups. Individual ndbd processes gets killed a short while after starting up, so the show command on ndb_mgm client displays status as starting for node 2 for a while and then starts displaying waiting for connection again. The same problem does not occur on the test bed (another set of HP intel based machines) which is same version of mysql and OS! However test bed configuration has only 2 data nodes. I would like to send you the error log and trace files, but I am unable to add files to this bug.
[6 Sep 2006 2:18]
Adam Dixon
Any further information required here?
[8 Sep 2006 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[18 Sep 2006 8:28]
Stefan Pasel
I do not have anything "new" here, but as we are experiencing this bug also (very annoying) and i have just replicated it on a "fresh" setup, i'm supplying some traces/logfiles. We are running 4 machines (AMD64 w SUSE 10 Linux), 4 CPUs each. We've run several test-setups (varying the cluster-config in machines, and allocated ram), but this bug persisted. In the files submitted the setup was 16dns (1 per CPU), 4 sql (1 per machine), 1 mgmd (on 1 machine).
[19 Sep 2006 6:30]
Stefan Pasel
If i'm restarting the nodes with "no start" (ndbd -n) i am able to avoid this error.
[19 Sep 2006 6:59]
Stefan Pasel
Gotta correct the last comment. It's only working with "-n" if i have only minimal data inside the cluster. With about 2 GB data it is crashing as always. Unfortunately i cannot submit my files here either. I will try to open a new bug to upload the files.
[21 Sep 2006 17:31]
Jeffrey Huckaby
We had the same issue. I noticed by starting all of the nodes quickly (running ndbd on the data nodes within seconds), the system would start. I suspect this is because our DB is large (10GB) and we are running on 4x8GB nodes, so the DB cannot fit into a single node.
[22 Oct 2006 10:39]
Valeriy Kravchuk
All reporters, Please, try to repeat with the newer version, 5.0.26, and inform about the results.
[23 Nov 2006 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[17 Nov 2010 15:18]
Martin Herbort
Hello, bug description #40229 shows that the issue also occurs with versions 5.0.46 and 5.0.67. Regards, Martin
[22 Dec 2010 9:59]
Jørgen Austvik
Only observed in 5.0, never observed on 5.1. Please try a new verison.
[23 Jan 2011 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".