Bug #22475 ndbd hangs in phase 4
Submitted: 19 Sep 2006 13:21 Modified: 27 Oct 2006 13:55
Reporter: Jim Dowling Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.11 OS:Linux (Linux)
Assigned to: Assigned Account CPU Architecture:Any
Tags: ndbd restart, phase 4

[19 Sep 2006 13:21] Jim Dowling
Description:
Restarting a node from ndb_mgm can cause a node to hang in phase 4. When a node hangs, it can't be stopped or restarted from ndb_mgm.  The process has to be killed. 
ndb_mgm> 2 restart
.....
ndb_mgm> all status
Node 1: started (Version 5.1.11)
Node 2: starting (Phase 4) (Version 5.1.11)

[Node2 just hangs here]

No trace is produced (as the node doesn't crash). 

ndb2_stdout_stderr:
[SNIP]

2006-09-19 11:56:45 [ndbd] INFO     -- Restarting system
2006-09-19 11:56:45 [ndbd] INFO     -- Node 2: Node shutdown completed, restarting, no start.
2006-09-19 11:56:45 [ndbd] INFO     -- Ndb has terminated (pid 18326) restarting
2006-09-19 11:56:46 [ndbd] INFO     -- Angel pid: 17763 ndb pid: 24687
2006-09-19 11:56:46 [ndbd] INFO     -- NDB Cluster -- DB node 2
2006-09-19 11:56:46 [ndbd] INFO     -- Version 5.1.11 (beta) --
2006-09-19 11:56:46 [ndbd] INFO     -- Configuration fetched at localhost port 23131
2006-09-19 11:56:50 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 20Mb initial: 20Mb
WOPool::init(61, 9)
RWPool::init(82, 13)
RWPool::init(a2, 18)
RWPool::init(c2, 13)
RWPool::init(122, 17)
RWPool::init(142, 15)
WOPool::init(41, 8)
RWPool::init(e2, 12)
RWPool::init(102, 51)
WOPool::init(21, 6)

[NODE 2 hangs at this point]

CLUSTER LOGS

ndb_62_cluster.log (mgmd 1):

2006-09-19 11:38:46 [MgmSrvr] INFO     -- NDB Cluster Management Server. Version 5.1.11 (beta)
[SNIP]
2006-09-19 11:39:09 [MgmSrvr] INFO     -- Node 1: Started (version 5.1.11)
2006-09-19 11:39:09 [MgmSrvr] INFO     -- Node 2: Node 63: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 1: Local checkpoint 2 started. Keep GCI = 1 oldest restorable GCI = 1
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 1: Node 63: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 1: Node 62: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 2: Node 62: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO     -- Node 2: Prepare arbitrator node 63 [ticket=47970001c5a86287]
2006-09-19 11:39:11 [MgmSrvr] INFO     -- Node 1: Started arbitrator node 63 [ticket=47970001c5a86287]
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 61: mysqld --server-id=0
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 59: mysqld --server-id=11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 59 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 61 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 59 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 61 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 61: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 61: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 1: Node 59: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO     -- Node 2: Node 59: API version 5.1.11
2006-09-19 11:56:36 [MgmSrvr] INFO     -- Node 2: Node shutdown initiated
2006-09-19 11:56:44 [MgmSrvr] INFO     -- Node 1: Local checkpoint 141 started. Keep GCI = 556 oldest restorable GCI = 89
2006-09-19 11:56:45 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 1: Communication to Node 2 closed
2006-09-19 11:56:45 [MgmSrvr] ALERT    -- Node 1: Network partitioning - arbitration required
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 1: President restarts arbitration thread [state=7]
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 62: Node 2 Connected
2006-09-19 11:56:45 [MgmSrvr] ALERT    -- Node 1: Arbitration won - positive reply from node 63
2006-09-19 11:56:45 [MgmSrvr] INFO     -- Node 2: Node shutdown completed, restarting, no start.
2006-09-19 11:56:46 [MgmSrvr] INFO     -- Node 1: Started arbitrator node 62 [ticket=47970002c5b88202]
2006-09-19 11:56:47 [MgmSrvr] INFO     -- Node 62: Node 2 Connected
2006-09-19 11:56:48 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000000.
2006-09-19 11:56:48 [MgmSrvr] INFO     -- Node 2: Node 63 Connected
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Start initiated (version 5.1.11)
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Start phase 0 completed 
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Communication to Node 1 opened
2006-09-19 11:56:50 [MgmSrvr] INFO     -- Node 2: Waiting 30 sec for nodes 0000000000000002 to connect, nodes [ all: 0000000000000006 connected: 0000000000000004 no-wait: 0000000000000000 ]
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 1: Communication to Node 2 opened
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 1: Node 2: API version 5.1.11
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Node 1 Connected
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: CM_REGCONF president = 1, own Node = 2, our dynamic id = 3
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Node 1: API version 5.1.11
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 1 completed 
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 2 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Start phase 4 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO     -- Node 2: Receive arbitrator node 62 [ticket=47970002c5b88202]

How to repeat:
The cluster data nodes use NFS to store LCPs and binary logs. This bug was not repeatable on my laptop.

Config.ini

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=80M  # Reduced to total 100M per replica
IndexMemory=20M
NoOfFragmentLogFiles=25
TimeBetweenLocalCheckpoints=6
MaxNoOfConcurrentOperations=12500
TransactionInactiveTimeout=30000        # 30seconds of inactivity=rollback

[NDB_MGMD]
Hostname=localhost
nodeid=62
portnumber=23131
DataDir=/var/lib/mysql-cluster/dbmgmd1

[NDB_MGMD]
Hostname=localhost
nodeid=63
portnumber=23132
DataDir=/var/lib/mysql-cluster/dbmgmd2

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata1
nodeid=1

[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata2
nodeid=2
#
[MYSQLD]
nodeid=61
[MYSQLD]
nodeid=60

ndb_mgmd  --nodaemon --no-nodeid-checks  --config-file=/home/jdowling/etc/database/config.ini --ndb-connectstring=nodeid=63;localhost:23131,localhost:23132

ndbd --ndb-connectstring=nodeid=2;localhost:23131,localhost:23132  --nodaemon --nostart
[26 Sep 2006 16:39] Jonas Oreland
Can you check if LCP is still running when it "hangs" (if reproducable)?
(check in cluster log...and you can force lcp to start with "all dump 7099")

FYI: I just fixed a bug in LCP that will cause this behaviour...
  (and I had been hunting that bug since 2000)

/Jonas
[27 Sep 2006 13:05] Jim Dowling
The node hangs in phase 4 - no events are written to the clusterlog indicating a LCP has started on the node, and when I send the "all dump 7099" command to the node, it doesn't receive it (no indication it received it from ndb_mgm).

I did an strace -f -p <PID> on the node, and it didn't appear to be making any system calls.
Reminder: this node stores its data on a NFS server.
[27 Sep 2006 13:47] Jonas Oreland
ok...then it sounds like a duplicate of bug#20895.

this has been around for several year, so timing is a critical factor.
and running over NFS changes timing conditions dramatically...

I'll close this as duplicate.
You can try patch attached to other bug (or wait for it to be included in release)
  and reopen this bug if it does not solve problem.

/Jonas
[27 Sep 2006 13:55] Jonas Oreland
BTW: if you upload your config.ini and do a "all dump 7012" when hang occurs
  (and also upload cluster log with output from dump) I can make 100% sure that
  this is a duplicate...

(changing to need feedback)
[27 Oct 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".