Description:
Restarting a node from ndb_mgm can cause a node to hang in phase 4. When a node hangs, it can't be stopped or restarted from ndb_mgm. The process has to be killed.
ndb_mgm> 2 restart
.....
ndb_mgm> all status
Node 1: started (Version 5.1.11)
Node 2: starting (Phase 4) (Version 5.1.11)
[Node2 just hangs here]
No trace is produced (as the node doesn't crash).
ndb2_stdout_stderr:
[SNIP]
2006-09-19 11:56:45 [ndbd] INFO -- Restarting system
2006-09-19 11:56:45 [ndbd] INFO -- Node 2: Node shutdown completed, restarting, no start.
2006-09-19 11:56:45 [ndbd] INFO -- Ndb has terminated (pid 18326) restarting
2006-09-19 11:56:46 [ndbd] INFO -- Angel pid: 17763 ndb pid: 24687
2006-09-19 11:56:46 [ndbd] INFO -- NDB Cluster -- DB node 2
2006-09-19 11:56:46 [ndbd] INFO -- Version 5.1.11 (beta) --
2006-09-19 11:56:46 [ndbd] INFO -- Configuration fetched at localhost port 23131
2006-09-19 11:56:50 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 20Mb initial: 20Mb
WOPool::init(61, 9)
RWPool::init(82, 13)
RWPool::init(a2, 18)
RWPool::init(c2, 13)
RWPool::init(122, 17)
RWPool::init(142, 15)
WOPool::init(41, 8)
RWPool::init(e2, 12)
RWPool::init(102, 51)
WOPool::init(21, 6)
[NODE 2 hangs at this point]
CLUSTER LOGS
ndb_62_cluster.log (mgmd 1):
2006-09-19 11:38:46 [MgmSrvr] INFO -- NDB Cluster Management Server. Version 5.1.11 (beta)
[SNIP]
2006-09-19 11:39:09 [MgmSrvr] INFO -- Node 1: Started (version 5.1.11)
2006-09-19 11:39:09 [MgmSrvr] INFO -- Node 2: Node 63: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO -- Node 1: Local checkpoint 2 started. Keep GCI = 1 oldest restorable GCI = 1
2006-09-19 11:39:10 [MgmSrvr] INFO -- Node 1: Node 63: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO -- Node 1: Node 62: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO -- Node 2: Node 62: API version 5.1.11
2006-09-19 11:39:10 [MgmSrvr] INFO -- Node 2: Prepare arbitrator node 63 [ticket=47970001c5a86287]
2006-09-19 11:39:11 [MgmSrvr] INFO -- Node 1: Started arbitrator node 63 [ticket=47970001c5a86287]
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 61: mysqld --server-id=0
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 59: mysqld --server-id=11
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 2: Node 59 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 2: Node 61 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 1: Node 59 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 1: Node 61 Connected
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 1: Node 61: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 2: Node 61: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 1: Node 59: API version 5.1.11
2006-09-19 11:39:17 [MgmSrvr] INFO -- Node 2: Node 59: API version 5.1.11
2006-09-19 11:56:36 [MgmSrvr] INFO -- Node 2: Node shutdown initiated
2006-09-19 11:56:44 [MgmSrvr] INFO -- Node 1: Local checkpoint 141 started. Keep GCI = 556 oldest restorable GCI = 89
2006-09-19 11:56:45 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2006-09-19 11:56:45 [MgmSrvr] INFO -- Node 1: Communication to Node 2 closed
2006-09-19 11:56:45 [MgmSrvr] ALERT -- Node 1: Network partitioning - arbitration required
2006-09-19 11:56:45 [MgmSrvr] INFO -- Node 1: President restarts arbitration thread [state=7]
2006-09-19 11:56:45 [MgmSrvr] INFO -- Node 62: Node 2 Connected
2006-09-19 11:56:45 [MgmSrvr] ALERT -- Node 1: Arbitration won - positive reply from node 63
2006-09-19 11:56:45 [MgmSrvr] INFO -- Node 2: Node shutdown completed, restarting, no start.
2006-09-19 11:56:46 [MgmSrvr] INFO -- Node 1: Started arbitrator node 62 [ticket=47970002c5b88202]
2006-09-19 11:56:47 [MgmSrvr] INFO -- Node 62: Node 2 Connected
2006-09-19 11:56:48 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000000.
2006-09-19 11:56:48 [MgmSrvr] INFO -- Node 2: Node 63 Connected
2006-09-19 11:56:50 [MgmSrvr] INFO -- Node 2: Start initiated (version 5.1.11)
2006-09-19 11:56:50 [MgmSrvr] INFO -- Node 2: Start phase 0 completed
2006-09-19 11:56:50 [MgmSrvr] INFO -- Node 2: Communication to Node 1 opened
2006-09-19 11:56:50 [MgmSrvr] INFO -- Node 2: Waiting 30 sec for nodes 0000000000000002 to connect, nodes [ all: 0000000000000006 connected: 0000000000000004 no-wait: 0000000000000000 ]
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 1: Communication to Node 2 opened
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 1: Node 2: API version 5.1.11
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Node 1 Connected
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: CM_REGCONF president = 1, own Node = 2, our dynamic id = 3
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Node 1: API version 5.1.11
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Start phase 1 completed
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Start phase 2 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Start phase 3 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Start phase 4 completed (node restart)
2006-09-19 11:56:51 [MgmSrvr] INFO -- Node 2: Receive arbitrator node 62 [ticket=47970002c5b88202]
How to repeat:
The cluster data nodes use NFS to store LCPs and binary logs. This bug was not repeatable on my laptop.
Config.ini
[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=80M # Reduced to total 100M per replica
IndexMemory=20M
NoOfFragmentLogFiles=25
TimeBetweenLocalCheckpoints=6
MaxNoOfConcurrentOperations=12500
TransactionInactiveTimeout=30000 # 30seconds of inactivity=rollback
[NDB_MGMD]
Hostname=localhost
nodeid=62
portnumber=23131
DataDir=/var/lib/mysql-cluster/dbmgmd1
[NDB_MGMD]
Hostname=localhost
nodeid=63
portnumber=23132
DataDir=/var/lib/mysql-cluster/dbmgmd2
[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata1
nodeid=1
[NDBD]
HostName=localhost
datadir=/var/lib/mysql-cluster/dbdata2
nodeid=2
#
[MYSQLD]
nodeid=61
[MYSQLD]
nodeid=60
ndb_mgmd --nodaemon --no-nodeid-checks --config-file=/home/jdowling/etc/database/config.ini --ndb-connectstring=nodeid=63;localhost:23131,localhost:23132
ndbd --ndb-connectstring=nodeid=2;localhost:23131,localhost:23132 --nodaemon --nostart