Bug #62632 | NDB FS corruption on a Node + Cluster restart --> Cluster does not start | ||
---|---|---|---|
Submitted: | 5 Oct 2011 14:35 | Modified: | 9 Jan 2015 16:03 |
Reporter: | Santo Leto | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 6.3, 7.1 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[5 Oct 2011 14:35]
Santo Leto
[5 Oct 2011 14:36]
Santo Leto
In our cluster log we can verify the errors: 2011-10-05 15:09:24 [MgmSrvr] INFO -- NDB Cluster Management Server. mysql-5.1.51 ndb-6.3.39-GA 2011-10-05 15:09:24 [MgmSrvr] INFO -- Id: 1, Command port: *:1186 2011-10-05 15:10:12 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved for ip 127.0.0.1, m_reserved_nodes 000000000000000000000000000000000000000000000000000000000000000a. 2011-10-05 15:10:12 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2011-10-05 15:10:13 [MgmSrvr] INFO -- Node 3: Start phase 0 completed 2011-10-05 15:10:13 [MgmSrvr] INFO -- Node 3: Communication to Node 4 opened 2011-10-05 15:10:13 [MgmSrvr] INFO -- Node 3: Waiting 30 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:13 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000002. 2011-10-05 15:10:16 [MgmSrvr] INFO -- Node 3: Waiting 27 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:19 [MgmSrvr] INFO -- Node 3: Waiting 24 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:22 [MgmSrvr] INFO -- Node 3: Waiting 21 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:25 [MgmSrvr] INFO -- Node 3: Waiting 18 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:28 [MgmSrvr] INFO -- Node 3: Waiting 15 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:31 [MgmSrvr] INFO -- Node 3: Waiting 12 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:34 [MgmSrvr] INFO -- Node 3: Waiting 9 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ] 2011-10-05 15:10:36 [MgmSrvr] INFO -- Mgmt server state: nodeid 4 reserved for ip 127.0.0.1, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000012. 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 1: Node 4 Connected 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 4: Start phase 0 completed 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 4: Waiting 30 sec for nodes 0000000000000008 to connect, nodes [ all: 0000000000000018 connected: 0000000000000010 no-wait: 0000000000000000 ] 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 4: Node 3 Connected 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 3: Node 4 Connected 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 3: Start with all nodes 0000000000000018 2011-10-05 15:10:36 [MgmSrvr] INFO -- Node 3: CM_REGCONF president = 3, own Node = 3, our dynamic id = 0/1 2011-10-05 15:10:37 [MgmSrvr] INFO -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000002. 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 3: Node 4: API mysql-5.1.51 ndb-6.3.39 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 3: Start phase 1 completed 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 0/2 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 4: Node 3: API mysql-5.1.51 ndb-6.3.39 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 4: Start phase 1 completed 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 3: Start phase 2 completed (system restart) 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 4: Start phase 2 completed (system restart) 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 3: Start phase 3 completed (system restart) 2011-10-05 15:10:39 [MgmSrvr] INFO -- Node 4: Start phase 3 completed (system restart) 2011-10-05 15:10:39 [MgmSrvr] ALERT -- Node 1: Node 4 Disconnected 2011-10-05 15:10:39 [MgmSrvr] ALERT -- Node 4: Forced node shutdown completed. Occured during startphase 4. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'. 2011-10-05 15:10:40 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2011-10-05 15:10:40 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
[5 Oct 2011 14:37]
Santo Leto
* WORKAROUND: 1. start the cluster only with Node 3 (i.e. all nodes except the one with corrupted file system): Cluster will start successfully: ndb_mgm> show Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=3 @127.0.0.1 (mysql-5.1.51 ndb-6.3.39, starting, Nodegroup: 0) id=4 (not connected, accepting connect from localhost) [ndb_mgmd(MGM)] 1 node(s) id=1 @localhost (mysql-5.1.51 ndb-6.3.39) [mysqld(API)] 1 node(s) id=50 (not connected, accepting connect from any host) ndb_mgm> Node 3: Started (version 6.3.39) 2. Now start NDB Node 4 with --initial option: $HOME/mysql-official-releases/mysql-cluster-com-6.3.39-linux-x86_64-glibc23/bin/ndbd -c localhost:1186 --ndb-nodeid=4 --initial Node 4 will start and the cluster is still up and running: ndb_mgm> show Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=3 @127.0.0.1 (mysql-5.1.51 ndb-6.3.39, Nodegroup: 0, Master) id=4 @127.0.0.1 (mysql-5.1.51 ndb-6.3.39, Nodegroup: 0) [ndb_mgmd(MGM)] 1 node(s) id=1 @127.0.0.1 (mysql-5.1.51 ndb-6.3.39) [mysqld(API)] 1 node(s) id=50 @127.0.0.1 (mysql-5.1.51 ndb-6.3.39) ndb_mgm>
[3 Dec 2011 18:41]
Santo Leto
Same problem if on Node 4 we have error 2352 (instead of 2815): [MgmSrvr] ALERT -- Node 4: Forced node shutdown completed. Occurred during startphase 4. Caused by error 2352: 'Invalid LCP(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.