MySQL Bugs: #16308: Storage node fails to restart after failure

Bug #16308	Storage node fails to restart after failure
Submitted:	9 Jan 2006 18:32	Modified:	11 Apr 2006 19:42
Reporter:	Jon Anderson	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.16 (max)	OS:	Linux (Linux (Debian, 2.6.12/14 kernel))
Assigned to:		CPU Architecture:	Any

Description:
Summary: In a small cluster (2 storage nodes, 2 sql nodes on 2 machines), one storage node failed and could not be restarted.

Note: This may be related to http://bugs.mysql.com/bug.php?id=14408, though I'm not certain. The bugs are reasonably different, yet share an important similarity: hang during startup phase 5.

I'm using MySQL 5.0.16 (max-x86 binaries) on a cluster with two storage nodes. One storage node (ndbd) failed, and the node won't start back up again - any progress ceases after phase 4.

The NDBD log itself shows absolutely nothing about the problem. The mgmd node's logs seem to show that the ndbd never gets past startup phase 5.

The last line from the first attempted restart was this:
2005-12-20 13:31:24 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (node restart)

ndb_mgm shows that node 4 is "Starting" for days, and there is no information from Node 4 (literally nothing - not one message) in the log until days later, when I killed the process to try again. From the 23rd to the 28th there was almost no activity/load on the servers due to the holidays.

Result of 'killall ndbd':
2005-12-28 11:50:33 [MgmSrvr] INFO     -- Node 4: Node shutdown completed. Initiated by signal 15. 

I tried to restart again with --initial, and it still didn't progress beyond phase 5.

Since there's no error or crash during restart, there's no error or trace logs (and the out log doesn't show anything either), so I'm completely in the dark on this one.

I do however have logs and traces for the error that lead up to the problem (a snippet, with some extra lines on each side for context):

--- ndb_1_cluster.log.1, lines 10187-10228 ---
2005-12-19 10:50:44 [MgmSrvr] INFO     -- Node 3: Local checkpoint 826 started. Keep GCI = 1208164 oldest restorable GCI = 1208016
2005-12-19 11:51:38 [MgmSrvr] INFO     -- Node 3: Local checkpoint 827 started. Keep GCI = 1209576 oldest restorable GCI = 1209647
2005-12-19 12:52:36 [MgmSrvr] INFO     -- Node 3: Local checkpoint 828 started. Keep GCI = 1211233 oldest restorable GCI = 1210076
2005-12-19 13:36:23 [MgmSrvr] INFO     -- Node 3: Local checkpoint 829 started. Keep GCI = 1212895 oldest restorable GCI = 1210076
2005-12-19 14:33:09 [MgmSrvr] INFO     -- Node 3: Local checkpoint 830 started. Keep GCI = 1214078 oldest restorable GCI = 1214074
2005-12-19 14:34:13 [MgmSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
2005-12-19 14:34:15 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2005-12-19 14:34:15 [MgmSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 3
2005-12-19 14:34:16 [MgmSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 4
2005-12-19 14:34:16 [MgmSrvr] ALERT    -- Node 3: Node 4 declared dead due to missed heartbeat
2005-12-19 14:34:16 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2005-12-19 14:34:16 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2005-12-19 14:34:16 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2005-12-19 14:34:16 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2005-12-19 14:34:17 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2005-12-19 14:34:17 [MgmSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=499c00044483d72e]
2005-12-19 14:34:18 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2005-12-19 14:35:12 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 opened
2005-12-20 13:31:17 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 reserved for ip 192.168.1.12, m_reserved_nodes 0000000000000076.
2005-12-20 13:31:17 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2005-12-20 13:31:18 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 freed, m_reserved_nodes 0000000000000066.
2005-12-20 13:31:22 [MgmSrvr] INFO     -- Node 4: Node 3 Connected
2005-12-20 13:31:22 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2005-12-20 13:31:22 [MgmSrvr] INFO     -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 5
2005-12-20 13:31:22 [MgmSrvr] INFO     -- Node 3: Node 4: API version 5.0.16
2005-12-20 13:31:22 [MgmSrvr] INFO     -- Node 4: Node 3: API version 5.0.16
2005-12-20 13:31:22 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed
2005-12-20 13:31:23 [MgmSrvr] INFO     -- Node 4: Receive arbitrator node 1 [ticket=499c00044483d72e]
2005-12-20 13:31:23 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (node restart)
2005-12-20 13:31:23 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (node restart)
2005-12-20 13:31:24 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (node restart)
2005-12-22 10:57:02 [MgmSrvr] INFO     -- Node 3: Data usage increased to 80%(13133 32K pages of total 16384)
2005-12-22 10:57:08 [MgmSrvr] INFO     -- Node 3: Data usage decreased to 79%(12989 32K pages of total 16384)
2005-12-22 11:19:33 [MgmSrvr] INFO     -- Node 3: Data usage increased to 80%(13123 32K pages of total 16384)
2005-12-26 18:02:58 [MgmSrvr] INFO     -- Node 3: Data usage increased to 90%(14857 32K pages of total 16384)
2005-12-28 11:50:32 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2005-12-28 11:50:32 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2005-12-28 11:50:32 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2005-12-28 11:50:32 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2005-12-28 11:50:32 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2005-12-28 11:50:32 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2005-12-28 11:50:33 [MgmSrvr] INFO     -- Node 4: Node shutdown completed. Initiated by signal 15.
--- end ndb_1_cluster.log.1 ---

--- ndb_4_error.log, lines 64-74  ---
Time: Monday 19 December 2005 - 14:34:18
Status: Temporary error, restart node
Message: Arbitrator shutdown, please investigate error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 3820) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 21576
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.8
Version: Version 5.0.16
***EOM***
--- end ndb_4_error.log ---

If I can provide any other information that might be useful, please let me know - I'd be glad to share any information you could use to improve mysql cluster.

How to repeat:
Unknown: I don't know for certain what caused the ndbd failure in the first place (thus I don't know how to repeat it), but any attempt to restart the server (with or without --initial) after the ndbd failure would result in the same problem: ndbd can't get past startup phase 5.

I've read through all the cluster documentation I can find, and the only thing that stands out is that ALTER TABLE is not recommended for NDB tables. ALTER TABLE would have been used a couple times. Could that be related?

We also have a second cluster that was running 5.0.17-max-x86_64, and we shutdown it's second storage node ('3 stop') for maintenance, and it would also fail to restart in the same manner, but there is literally no information I can provide on that one because there is no "error", just a log that stops at Phase 4 completed.

Suggested fix:
Running on 1 ndbd node was the only solution for us for several days.

We decided on a maintenance window, informed users of maintenance related downtime, backed up the cluster data, and shut down the entire cluster. We ran an upgrade to 5.0.18, started both ndbd nodes one at a time, and both nodes started fine. We did not need to restore from backup, all data appeared to be intact.

This is adequate, but it would be preferable if we could recover from a node failure without downtime. :-)

Here is another instance of this issue, this time with 5.0.18. The ndbd process has been in startup phase 5 for just over a day now.

This time, the ndbd process was killed by the kernel (OOM) because of a runaway process run by a developer January 16 2006 at about 14:35 EST, and restarted about 3 minutes later. It is still "starting" now - Jan 17, 14:57 EST. (There is no mysql error associated with the ndbd process termination, because it was not a mysql fault.)

Trying to start ndbd just doesn't work. Running either 'ndbd' or 'ndbd --initial' leads to the same thing: "id=4    @<ip>  (Version: 5.0.18, starting, Nodegroup: 0)"

The last lines of ndb_4_out.log shows no useful information as to why it won't start:

--- ndb_4_out.log snippet ---
2006-01-16 14:35:28 [ndbd] INFO     -- Angel pid: 26759 ndb pid: 21916
2006-01-16 14:35:28 [ndbd] INFO     -- NDB Cluster -- DB node 4
2006-01-16 14:35:28 [ndbd] INFO     -- Version 5.0.18 --
2006-01-16 14:35:28 [ndbd] INFO     -- Configuration fetched at 192.168.1.11 port 1186
2006-01-16 14:35:28 [ndbd] INFO     -- Start initiated (version 5.0.18)
--- end ndb_4_out.log ---

The last lines of the cluster log also do not seem to show useful information.

--- ndb_2_cluster.log snippet ---
2006-01-16 14:34:53 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 opened
2006-01-16 14:35:30 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 4: Start phase 0 completed
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 4: Node 3 Connected
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 6
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 3: Node 4: API version 5.0.18
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 4: Node 3: API version 5.0.18
2006-01-16 14:37:51 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed
2006-01-16 14:37:52 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (initial node restart)
2006-01-16 14:37:52 [MgmSrvr] INFO     -- Node 4: Receive arbitrator node 1 [ticket=00590005d4b66637]
2006-01-16 14:38:47 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (initial node restart)
2006-01-16 14:38:53 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (initial node restart)
--- end ndb_2_cluster.log ---

I don't know what other information I can provide, because there isn't any "error" of any kind. It just doesn't start. Is there any way I can get more information from the cluster to send to you?

I encountered a similiar problem.
I'm running 4.1.18 with 2 storage nodes, 2 SQL nodes and 2 management nodes.
The storage nodes are running with 12GB RAM each on a 64bit RedHat ES.
During my initial tests killing a storage node and bringing it back was no problem.

After filling the cluster with data 9GB worth I killed the nodegroup master again. Restarting it with "ndbd" brings the same symptoms as told by the other posters (my logfile looks identical). I waited for several hours with my storage nodes idleing. I killed it again and tried it with --initial. This time the nodes idled for 25 minutes until the restarted node begun catching up. (During both restart tests I didn't induce any select or update load on the cluster from the SQL nodes!) 

So I was wondering if any data was sent between the nodes during the initial idle times - and found something interesting that might be a lead. The management nodes continuously bombard the restarting node with TCP connections, all saying the same:

ndbd
ndbd passwd
OK
31 1

Please find attached to this bug a pcap file with 2 seconds of captures (cleaned from other stuff that was traced). This capture was done during the above mentioned 25 minutes idle time (that luckily turned into a working catch-up).

I just found out I'm not allowed to attach a file to this bug as I'm neither the original poster nor a developer.

So below is a plain text output of etherreal of the first connection I traced. If anybody is interested in getting the pcap file please let me know and I'll send it to you.

During I write this I notice that EVERY answer packet from the storage node containing the "ok" has a wrong TCP checksum!

Jon, could you please tcpdump such an event and look if you also encounter wrong tcp checksums?

I know that the ndbd process is not responsible for wrong TCP checksums, but I find it worthwhile mentioning that only exactly one type of payload gets corrupted every time.

No.     Time        Source                Destination           Protocol Info
      1 0.000000    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [SYN] Seq=0 Ack=0 Win=5840 Len=0 MSS=1460 TSV=136845986 TSER=0 WS=2

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 3c ac ae 40 00 40 06 8b d5 c0 a8 c0 64 c0 a8   .<..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 0c 00 00 00 00 a0 02   ................
0030  16 d0 e0 bc 00 00 02 04 05 b4 04 02 08 0a 08 28   ...............(
0040  1a a2 00 00 00 00 01 03 03 02                     ..........

No.     Time        Source                Destination           Protocol Info
      2 0.000016    192.168.192.130       192.168.192.100       TCP      2202 > 33044 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=419220897 TSER=136845986 WS=2

0000  00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00   ..#.....#.....E.
0010  00 3c 00 00 40 00 40 06 38 84 c0 a8 c0 82 c0 a8   .<..@.@.8.......
0020  c0 64 08 9a 81 14 62 e2 0c bd eb ba b6 0d a0 12   .d....b.........
0030  16 a0 8a 9e 00 00 02 04 05 b4 04 02 08 0a 18 fc   ................
0040  cd a1 08 28 1a a2 01 03 03 02                     ...(......

No.     Time        Source                Destination           Protocol Info
      3 0.000501    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=136845987 TSER=419220897

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 34 ac af 40 00 40 06 8b dc c0 a8 c0 64 c0 a8   .4..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 0d 62 e2 0c be 80 10   ..........b.....
0030  05 b4 ca 50 00 00 01 01 08 0a 08 28 1a a3 18 fc   ...P.......(....
0040  cd a1                                             ..

No.     Time        Source                Destination           Protocol Info
      4 0.000517    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=5 TSV=136845987 TSER=419220897

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 39 ac b0 40 00 40 06 8b d6 c0 a8 c0 64 c0 a8   .9..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 0d 62 e2 0c be 80 18   ..........b.....
0030  05 b4 ef 7a 00 00 01 01 08 0a 08 28 1a a3 18 fc   ...z.......(....
0040  cd a1 6e 64 62 64 0a                              ..ndbd.

No.     Time        Source                Destination           Protocol Info
      5 0.000526    192.168.192.130       192.168.192.100       TCP      2202 > 33044 [ACK] Seq=1 Ack=6 Win=5792 Len=0 TSV=419220898 TSER=136845987

0000  00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00   ..#.....#.....E.
0010  00 34 c1 1c 40 00 40 06 77 6f c0 a8 c0 82 c0 a8   .4..@.@.wo......
0020  c0 64 08 9a 81 14 62 e2 0c be eb ba b6 12 80 10   .d....b.........
0030  05 a8 ca 56 00 00 01 01 08 0a 18 fc cd a2 08 28   ...V...........(
0040  1a a3                                             ..

No.     Time        Source                Destination           Protocol Info
      6 0.001001    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [PSH, ACK] Seq=6 Ack=1 Win=5840 Len=12 TSV=136845987 TSER=419220898

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 40 ac b1 40 00 40 06 8b ce c0 a8 c0 64 c0 a8   .@..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 12 62 e2 0c be 80 18   ..........b.....
0030  05 b4 a0 08 00 00 01 01 08 0a 08 28 1a a3 18 fc   ...........(....
0040  cd a2 6e 64 62 64 20 70 61 73 73 77 64 0a         ..ndbd passwd.

No.     Time        Source                Destination           Protocol Info
      7 0.001006    192.168.192.130       192.168.192.100       TCP      2202 > 33044 [ACK] Seq=1 Ack=18 Win=5792 Len=0 TSV=419220898 TSER=136845987

0000  00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00   ..#.....#.....E.
0010  00 34 c1 1e 40 00 40 06 77 6d c0 a8 c0 82 c0 a8   .4..@.@.wm......
0020  c0 64 08 9a 81 14 62 e2 0c be eb ba b6 1e 80 10   .d....b.........
0030  05 a8 ca 4a 00 00 01 01 08 0a 18 fc cd a2 08 28   ...J...........(
0040  1a a3                                             ..

No.     Time        Source                Destination           Protocol Info
      8 0.001062    192.168.192.130       192.168.192.100       TCP      2202 > 33044 [PSH, ACK] Seq=1 Ack=18 Win=5792 [TCP CHECKSUM INCORRECT] Len=3 TSV=419220898 TSER=136845987

0000  00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00   ..#.....#.....E.
0010  00 37 c1 20 40 00 40 06 77 68 c0 a8 c0 82 c0 a8   .7. @.@.wh......
0020  c0 64 08 9a 81 14 62 e2 0c be eb ba b6 1e 80 18   .d....b.........
0030  05 a8 02 62 00 00 01 01 08 0a 18 fc cd a2 08 28   ...b...........(
0040  1a a3 6f 6b 0a                                    ..ok.

No.     Time        Source                Destination           Protocol Info
      9 0.001502    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [ACK] Seq=18 Ack=4 Win=5840 Len=0 TSV=136845987 TSER=419220898

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 34 ac b2 40 00 40 06 8b d9 c0 a8 c0 64 c0 a8   .4..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 1e 62 e2 0c c1 80 10   ..........b.....
0030  05 b4 ca 3b 00 00 01 01 08 0a 08 28 1a a3 18 fc   ...;.......(....
0040  cd a2                                             ..

No.     Time        Source                Destination           Protocol Info
     10 0.001507    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [PSH, ACK] Seq=18 Ack=4 Win=5840 Len=5 TSV=136845987 TSER=419220898

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 39 ac b3 40 00 40 06 8b d3 c0 a8 c0 64 c0 a8   .9..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 1e 62 e2 0c c1 80 18   ..........b.....
0030  05 b4 6c cc 00 00 01 01 08 0a 08 28 1a a3 18 fc   ..l........(....
0040  cd a2 33 31 20 31 0a                              ..31 1.

No.     Time        Source                Destination           Protocol Info
     11 0.001544    192.168.192.130       192.168.192.100       TCP      2202 > 33044 [FIN, ACK] Seq=4 Ack=23 Win=5792 Len=0 TSV=419220899 TSER=136845987

0000  00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00   ..#.....#.....E.
0010  00 34 c1 22 40 00 40 06 77 69 c0 a8 c0 82 c0 a8   .4."@.@.wi......
0020  c0 64 08 9a 81 14 62 e2 0c c1 eb ba b6 23 80 11   .d....b......#..
0030  05 a8 ca 40 00 00 01 01 08 0a 18 fc cd a3 08 28   ...@...........(
0040  1a a3                                             ..

No.     Time        Source                Destination           Protocol Info
     12 0.002001    192.168.192.100       192.168.192.130       TCP      33044 > 2202 [FIN, ACK] Seq=23 Ack=5 Win=5840 Len=0 TSV=136845987 TSER=419220899

0000  00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00   ..#.....#.....E.
0010  00 34 ac b4 40 00 40 06 8b d7 c0 a8 c0 64 c0 a8   .4..@.@......d..
0020  c0 82 81 14 08 9a eb ba b6 23 62 e2 0c c2 80 11   .........#b.....
0030  05 b4 ca 33 00 00 01 01 08 0a 08 28 1a a3 18 fc   ...3.......(....
0040  cd a3                                             ..

I have encountered the same problem with 5.0.19 (mysql-max-5.0.19-solaris10-sparc-64bit.tar.gz). I have a small test configuration: 2 machines with a storage node and SQL node each, and a third machine running a management node.

I am testing recovery from a data node failure. I induce the data node to fail by sending it either a SIGKILL or SIGSEGV, typically while performing insertion operations at a rate of around 15 per second. As with the other postings, the data node sometimes fails to complete its startup, hanging in Phase 5. During the recovery time, there is no database activity. While it is hard to establish a firm correlation, it seems that this problem is more likely to occur if the data node was killed after it had been running for a while (say, an hour or more).

The only way that I have been able to work around the problem is to restart the whole cluster.

The normal startup time for the data node in my test environment is typically between 1.5 and 3 minutes (the database contains about 10000 records). As a result, I normally wait only about 5 minutes before I consider the node startup to be hung, and restart the whole cluster. However, I have seen cases (when I was patient enough) where the node had remained hung overnight.

I agree with the original posting that this bug may be related to http://bugs.mysql.com/bug.php?id=14408, even though I am have performing only about 15 insertions per second (though each insertion is of 100 records) when the data node is induced to fail. As well, ndbd --initial does not help me.

Since we have been able to reproduce this,
  could you try a newer 5.0 version...there has been
  a few bug fixes that could match your problem

More recent than the 5.0.19 that I am using?  It appears to me that 5.0.19 is the most recent version.  Or do you mean more recent than the 5.0.18 that Jon Anderson was using?

Sorry about that...
I was refering to more recent that 5.0.16 that is mentioned in header of bug...

Hi
can you check if LocalCheckpoint seem to be running.

The simplest way of doing this is
> ndb_mgm -e "all dump 7099" 

Then you should see Local Checkpoint XXX soon afterwards in cluster log.

The last time that I had conclusively seen the problem was Mar 29.  My test environment would have changed only slightly since then; however, I have not been able to reproduce the problem in several attempts over the past day.

I am concerned that this problem continues to exist ... the worst problems are those that cannot be reproduced.

In the meantime, even though I could not reproduce the problem, I did run:
> ndb_mgm -e "all dump 7099"

Looking in the log, I did see "Node x: Local checkpoint xxx started. ..." for one of the nodes, but not the other.  Ahh ... okay ... it seems only the node acting as the master performs the checkpoint.

I will let you know when I see the problem again.

>The last time that I had conclusively seen the problem was Mar 29.  My test
>environment would have changed only slightly since then; however, I have not
>been able to reproduce the problem in several attempts over the past day.

>I am concerned that this problem continues to exist ... the worst problems are
>those that cannot be reproduced.

agree :-(

>In the meantime, even though I could not reproduce the problem, I did run:
>> ndb_mgm -e "all dump 7099"

>Looking in the log, I did see "Node x: Local checkpoint xxx started. ..." for
>one of the nodes, but not the other.  Ahh ... okay ... it seems only the node
>acting as the master performs the checkpoint.

All nodes participate in checkpoint, but only master makes this printout.

>I will let you know when I see the problem again.

ok, thx

/Jonas

I keep our cluster nodes up to date, so we're running 5.0.19 accross the board, and I'm downloading 5.0.20 as I write this. (It's not on the mirror page, but there are binaries for it which I will install on our test cluster now, and on the production cluster in a few days if all goes well.)

We have not seen this bug since version 5.0.19 was installed (and maybe even earlier - I don't remember). I haven't commented again so far because I haven't seen the error recently. I don't consider this bug to be a problem anymore simply because it hasn't occurred in months, where it used to occur often. We had a critical failure last night involving data loss (which I'm about to file a bug about), one node had to be restarted but it started back up normally.

I spoke too soon:

Currently on our development cluster:

ndb_mgm> show
Connected to Management Server at: 192.168.1.14:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @192.168.1.14  (Version: 5.0.19, Nodegroup: 0, Master)
id=3    @192.168.1.15  (Version: 5.0.19, starting, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @192.168.1.14  (Version: 5.0.19)

[mysqld(API)]   2 node(s)
id=4    @192.168.1.14  (Version: 5.0.19)
id=5    @192.168.1.15  (Version: 5.0.19)

This time, this node is stuck in phase 1. I'm going to install 5.0.20 now. I'll keep you posted as to what I find. If there's any other piece of information you'd like from me, please let me know.

This could possibly be #18612
Which I very recently fixed (5.0.21)

Anyway, please upload clusterlog when you get hanging restarts.
Also, if you done any changes in your config.ini wrt Start*Timeout
  please upload that too.

/Jonas

I didn't attach the cluster log, because there isn't much to see in it. The node connects, goes through phase 0 and 1, then no further messages from that node. This doesn't appear to happen very often. I will be upgrading to 5.0.21 as soon as it's released. Thanks again. :-)

-- ndb_1_cluster.log, partial --
...
2006-04-07 11:08:20 [MgmSrvr] INFO     -- Node 2: Start phase 9 completed (system restart)
2006-04-07 11:08:20 [MgmSrvr] INFO     -- Node 2: Started (version 5.0.19)
2006-04-07 11:08:20 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2006-04-07 11:08:20 [MgmSrvr] INFO     -- Node 2: Node 5 Connected
2006-04-07 11:08:20 [MgmSrvr] INFO     -- Node 2: Node 4: API version 5.0.19
2006-04-07 11:08:20 [MgmSrvr] INFO     -- Node 2: Node 5: API version 5.0.19
2006-04-07 11:08:21 [MgmSrvr] INFO     -- Node 2: Node 1: API version 5.0.19
2006-04-07 11:08:22 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=042c000174e4b37d]
2006-04-07 11:30:32 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.168.1.15, m_reserved_nodes 000000000000003a.
2006-04-07 11:30:32 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2006-04-07 11:30:33 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000032.
2006-04-07 11:31:59 [MgmSrvr] INFO     -- Node 3: Start phase 0 completed
2006-04-07 11:31:59 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 opened
2006-04-07 11:32:29 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed
2006-04-07 12:05:47 [MgmSrvr] INFO     -- Node 2: Local checkpoint 1828 started. Keep GCI = 2782960 oldest restorable GCI = 2782960
2006-04-07 13:23:27 [MgmSrvr] INFO     -- Node 2: Local checkpoint 1829 started. Keep GCI = 2784489 oldest restorable GCI = 2784685
2006-04-07 14:41:04 [MgmSrvr] INFO     -- Node 2: Local checkpoint 1830 started. Keep GCI = 2786503 oldest restorable GCI = 2784685
...

I actually found the cluster log _very_ informative.
The lack of connection establishment before start phase 1 is complete
makes me almost 100% sure that this is bug#18612.

I made significat changes in 5.0.21 to prevent this from happing,
so I suggest changing to this version.

I suggest adding the following line to your config.ini
which will decrease likelyhood of this happening.

StartPartialTimeout: 60000 
This will force node to wait 60s before starting on it's own.
(look in documentation)

(in 5.0.21 this value has changed to 90s, and we're concidering
changing to ~s as
 we also have introduced some new options for forcing partial starts)

I'm also closing this as duplicate,
please reopen it again if you encounter this again.

/Jonas

When I had the problem, the storage node would hang in start phase 5 (exclusively, I believe).  Does the fix apply for this case?