Bug #25149 Unable to restart data node
Submitted: 18 Dec 2006 16:09 Modified: 27 Jan 2008 8:10
Reporter: Lars Bo Svenningsen Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1.14-beta-linux-i686-glibc23 OS:Linux (Debian)
Assigned to: Assigned Account CPU Architecture:Any
Tags: Ndbd file system error, REDO Log

[18 Dec 2006 16:09] Lars Bo Svenningsen
Description:
Tried to restart a data node ndb_mgm> 3 restart
The node goes down, but fails to become available again. It remains in the state "starting":
 Connected to Management Server at: 192.168.1.99:1186
 Cluster Configuration
 ---------------------
 [ndbd(NDB)]     2 node(s) 
 id=2    @192.168.99.17  (Version: 5.1.14, Nodegroup: 0, Master)
 id=3    @192.168.99.18  (Version: 5.1.14, starting, Nodegroup: 0)

In ndb_mgm the follwing can be seen:
Node 3: Forced node shutdown completed, restarting. Occured during startphase 5. Initiated by signal 0. Caused by error 2310: 'Error while reading the REDO log(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'. - Unknown error code: Unknown result: Unknown error code

Tried to do '--initial' on the node, but it didn't help. Also tried to remove the /var/lib/mysql-cluster directory before doing "--initial"
The other data node started without problems.

Log files will follow ...

How to repeat:
Happens everytime I try and start the node.
[18 Dec 2006 16:12] Lars Bo Svenningsen
ndb_3_out.log

Attachment: ndb_3_out.log (text/x-log), 4.04 KiB.

[18 Dec 2006 16:13] Lars Bo Svenningsen
ndb1_cluster.log

Attachment: ndb1_cluster.log (text/x-log), 10.63 KiB.

[18 Dec 2006 16:15] Lars Bo Svenningsen
Setup: Debian Etch - Linux 2.6.17
[18 Dec 2006 23:10] Jonas Oreland
Hi,

Can you also upload error/tracefile for failing node ?

/Jonas
[19 Dec 2006 7:54] Lars Bo Svenningsen
ndb_3_trace.log.9

Attachment: bug-data-25149.tgz (application/x-gzip, text), 42.02 KiB.

[19 Dec 2006 8:43] Lars Bo Svenningsen
Found a SQL Node still attached to the cluster, stopped it and --initial again.
I now managed to get the data note up'n'running, but still I see some disturbing messages (Permanent error: Application error) in ndb_1_cluster.log:

2006-12-19 09:36:01 [MgmSrvr] INFO     -- Node 3: Node shutdown initiated
2006-12-19 09:36:11 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2006-12-19 09:36:11 [MgmSrvr] INFO     -- Node 2: We are Node 2 with dynamic ID 1, our left neighbour is Node 65535, our right is Node 65535
2006-12-19 09:36:11 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2006-12-19 09:36:11 [MgmSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
2006-12-19 09:36:11 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=7]
2006-12-19 09:36:11 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2006-12-19 09:36:11 [MgmSrvr] INFO     -- Node 3: Node shutdown completed, restarting, no start.
2006-12-19 09:36:11 [MgmSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 1
2006-12-19 09:36:11 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 
2006-12-19 09:36:12 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=04cc002799daea3f]
2006-12-19 09:36:12 [MgmSrvr] WARNING  -- Cluster refused allocation of id 3. Connection from ip 192.168.99.18. Returned error string "Cluste
r refused allocation of id 3. Error: 1703 (Node failure handling not completed: Permanent error: Application error)."
2006-12-19 09:36:15 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened
2006-12-19 09:36:15 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.168.99.18, m_reserved_nodes 000000000000000a.
2006-12-19 09:36:15 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002.
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 2: We are Node 2 with dynamic ID 1, our left neighbour is Node 3, our right is Node 3
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 2: Node 3: API version 5.1.14
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: Node 2 Connected
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: CM_REGCONF president = 2, own Node = 3, our dynamic id = 40
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: We are Node 3 with dynamic ID 40, our left neighbour is Node 2, our right is Node 2
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: Node 2: API version 5.1.14
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed 
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (node restart)
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 2: DICT: locked by node 3 for NodeRestart
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 4 ops: 0 poll: 0 cnt: 0 queue: 3L
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (node restart)
2006-12-19 09:36:16 [MgmSrvr] INFO     -- Node 3: Start phase 4 completed (node restart)
2006-12-19 09:36:17 [MgmSrvr] INFO     -- Node 3: Receive arbitrator node 1 [ticket=04cc002799daea3f]
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 3 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 10 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 11 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 13 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 14 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 16 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 17 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 18 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 19 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 21 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 23 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 25 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 27 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 29 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 31 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 33 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 35 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 37 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 39 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 3: DICT: index 41 activated
2006-12-19 09:36:23 [MgmSrvr] INFO     -- Node 2: Node: 3 StartLog: [GCI Keep: 270304 LastCompleted: 270475 NewestRestorable: 270482]
2006-12-19 09:36:30 [MgmSrvr] INFO     -- Node 2: Local checkpoint 213 started. Keep GCI = 270404 oldest restorable GCI = 270404
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Suma: asking node 2 to recreate subscriptions on me
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Suma: node 2 has completed restoring me
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Start phase 5 completed (node restart)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Start phase 6 completed (node restart)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Start phase 7 completed (node restart)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 7 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 8 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 9 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 10 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 11 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Communication to Node 12 opened
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Start phase 8 completed (node restart)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Start phase 9 completed (node restart)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Start phase 100 completed (node restart)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Suma: initiate handover with nodes 0000000000000004 GCI: 270494
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Suma: handover from node 2 gci: 270494 buckets: 00000002 (2)
2006-12-19 09:36:42 [MgmSrvr] INFO     -- Node 3: Node 1: API version 5.1.14
2006-12-19 09:36:48 [MgmSrvr] INFO     -- Node 2: DICT: unlocked by node 3 for NodeRestart
2006-12-19 09:36:48 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 
2006-12-19 09:36:48 [MgmSrvr] INFO     -- Node 3: Start phase 101 completed (node restart)
2006-12-19 09:36:48 [MgmSrvr] INFO     -- Node 3: Started (version 5.1.14)
[20 Dec 2006 10:38] Jonas Oreland
hi,

can you also upload ndb_3_error.log

do you still get this?
would it be possible to also upload actual filesystem (ndb_3_fs/*)
  when getting this?

/Jonas