MySQL Bugs: #22475: ndbd hangs in phase 4

Bug #22475	ndbd hangs in phase 4
Submitted:	19 Sep 2006 13:21	Modified:	27 Oct 2006 13:55
Reporter:	Jim Dowling	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.1.11	OS:	Linux (Linux)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	ndbd restart, phase 4

Description:
Restarting a node from ndb_mgm can cause a node to hang in phase 4. When a node hangs, it can't be stopped or restarted from ndb_mgm.  The process has to be killed. 
ndb_mgm> 2 restart
.....
ndb_mgm> all status
Node 1: started (Version 5.1.11)
Node 2: starting (Phase 4) (Version 5.1.11)

[Node2 just hangs here]

No trace is produced (as the node doesn't crash). 

ndb2_stdout_stderr:
[SNIP]

2006-09-19 11:56:45 [ndbd] INFO     -- Restarting system
2006-09-19 11:56:45 [ndbd] INFO     -- Node 2: Node shutdown completed, restarting, no start.
2006-09-19 11:56:45 [ndbd] INFO     -- Ndb has terminated (pid 18326) restarting
2006-09-19 11:56:46 [ndbd] INFO     -- Angel pid: 17763 ndb pid: 24687
2006-09-19 11:56:46 [ndbd] INFO     -- NDB Cluster -- DB node 2
2006-09-19 11:56:46 [ndbd] INFO     -- Version 5.1.11 (beta) --
2006-09-19 11:56:46 [ndbd] INFO     -- Configuration fetched at localhost port 23131
2006-09-19 11:56:50 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 20Mb initial: 20Mb
WOPool::init(61, 9)
RWPool::init(82, 13)
RWPool::init(a2, 18)
RWPool::init(c2, 13)
RWPool::init(122, 17)
RWPool::init(142, 15)
WOPool::init(41, 8)
RWPool::init(e2, 12)
RWPool::init(102, 51)
WOPool::init(21, 6)

[NODE 2 hangs at this point]

CLUSTER LOGS

ndb_62_cluster.log (mgmd 1):

2006-09-19 11:38:46 [MgmSrvr] INFO     -- NDB Cluster Management Server. Version 5.1.11 (beta)
[SNIP]
2006-09-19 11:39:09 [MgmSrvr] INFO     -- Node 1: Started (version 5.1.11)
2006-09-19 11:39:09 [MgmSrvr] INFO     -- Node 2: Node 63: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 1: Local checkpoint 2 started. Keep GCI = 1 oldest restorable GCI = 1
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 1: Node 63: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 1: Node 62: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 2: Node 62: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 2: Prepare arbitrator node 63 [ticket=47970001c5a86287]
2006-09-19 11:39:11 [MgmSrvr] INFO     -- Node 1: Started arbitrator node 63 [ticket=47970001c5a86287]
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 61: mysqld --server-id=0
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 59: mysqld --server-id=11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 59 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 61 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 59 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 61 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 61: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 61: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 59: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 59: API version 5.1.11
2006-09-19 11:56:36 [MgmSrvr] INFO     -- Node 2: Node shutdown initiated
2006-09-19 11:56:44 [MgmSrvr] INFO     -- Node 1: Local checkpoint 141 started. Keep GCI = 556 oldest restorable GCI = 89
2006-09-19 11:56:45 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 1: Communication to Node 2 closed
2006-09-19 11:56:45 [MgmSrvr] ALERT    -- Node 1: Network partitioning - arbitration required
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 1: President restarts arbitration thread [state=7]
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 62: Node 2 Connected
2006-09-19 11:56:45 [MgmSrvr] ALERT    -- Node 1: Arbitration won - positive reply from node 63
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 2: Node shutdown completed, restarting, no start.
2006-09-19 11:56:46 [MgmSrvr] INFO     -- Node 1: Started arbitrator node 62 [ticket=47970002c5b88202]
2006-09-19 11:56:47 [MgmSrvr] INFO     -- Node 62: Node 2 Connected
2006-09-19 11:56:48 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000000.
2006-09-19 11:56:48 [MgmSrvr] INFO     -- Node 2: Node 63 Connected
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Start initiated (version 5.1.11)
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Start phase 0 completed 
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Communication to Node 1 opened
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Waiting 30 sec for nodes 0000000000000002 to connect, nodes [ all: 0000000000000006 connected: 0000000000000004 no-wait: 0000000000000000 ]
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 1: Communication to Node 2 opened
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 1: Node 2: API version 5.1.11
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Node 1 Connected
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: CM_REGCONF president = 1, own Node = 2, our dynamic id = 3
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Node 1: API version 5.1.11
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 1 completed 
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 2 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 4 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Receive arbitrator node 62 [ticket=47970002c5b88202]

How to repeat:
The cluster data nodes use NFS to store LCPs and binary logs. This bug was not repeatable on my laptop.

Config.ini

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=80M  # Reduced to total 100M per replica
IndexMemory=20M
NoOfFragmentLogFiles=25
TimeBetweenLocalCheckpoints=6
MaxNoOfConcurrentOperations=12500
TransactionInactiveTimeout=30000        # 30seconds of inactivity=rollback

[NDB_MGMD]
Hostname=localhost
nodeid=62
portnumber=23131
DataDir=/var/lib/mysql-cluster/dbmgmd1

[NDB_MGMD]
Hostname=localhost
nodeid=63
portnumber=23132
DataDir=/var/lib/mysql-cluster/dbmgmd2

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata1
nodeid=1

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata2
nodeid=2
#
[MYSQLD]
nodeid=61
[MYSQLD]
nodeid=60

ndb_mgmd  --nodaemon --no-nodeid-checks  --config-file=/home/jdowling/etc/database/config.ini --ndb-connectstring=nodeid=63;localhost:23131,localhost:23132

ndbd --ndb-connectstring=nodeid=2;localhost:23131,localhost:23132  --nodaemon --nostart

Can you check if LCP is still running when it "hangs" (if reproducable)?
(check in cluster log...and you can force lcp to start with "all dump 7099")

FYI: I just fixed a bug in LCP that will cause this behaviour...
  (and I had been hunting that bug since 2000)

/Jonas

The node hangs in phase 4 - no events are written to the clusterlog indicating a LCP has started on the node, and when I send the "all dump 7099" command to the node, it doesn't receive it (no indication it received it from ndb_mgm).

I did an strace -f -p <PID> on the node, and it didn't appear to be making any system calls.
Reminder: this node stores its data on a NFS server.

ok...then it sounds like a duplicate of bug#20895.

this has been around for several year, so timing is a critical factor.
and running over NFS changes timing conditions dramatically...

I'll close this as duplicate.
You can try patch attached to other bug (or wait for it to be included in release)
  and reopen this bug if it does not solve problem.

/Jonas

BTW: if you upload your config.ini and do a "all dump 7012" when hang occurs
  (and also upload cluster log with output from dump) I can make 100% sure that
  this is a duplicate...

(changing to need feedback)

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".