Bug #58220 Unable to restart a node
Submitted: 16 Nov 2010 7:48 Modified: 28 Jul 2016 17:50
Reporter: Hariharan Madhavan Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S2 (Serious)
Version:mysql-5.1-telco-6.2 OS:Linux (2.6.18-194.11.4.el5)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: mysql-5.1.23 ndb-6.2.15

[16 Nov 2010 7:48] Hariharan Madhavan
Description:

id=3    @192.168.xx.xxx  (mysql-5.1.23 ndb-6.2.15, starting, Nodegroup: 0)

There are two mysql data nodes, 2 and 3. Two is always up. We stop 3 for taking backups, after which it is started again.

Offlate 3 does not start again, but stays in this state .

2010-11-16 11:47:54 [MgmSrvr] INFO     -- Node 2: DICT: remove lock by failed node 3 for NodeRestart
2010-11-16 11:47:54 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2010-11-16 11:47:54 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=6f63000652245b08]
2010-11-16 11:47:58 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened
2010-11-16 11:48:27 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.168.20.101, m_reserved_nodes 000000000000000000000000000000000000000000000000000000000000001a.
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 2: Node 3: API mysql-5.1.23 ndb-6.2.15
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (node restart)
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 2: DICT: locked by node 3 for NodeRestart
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 4 ops: 0 poll: 0 cnt: 0 queue: 3L
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (node restart)
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 3: Start phase 4 completed (node restart)
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Node 3: Receive arbitrator node 1 [ticket=6f63000652245b08]
2010-11-16 11:48:28 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000012.
2010-11-16 11:48:36 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 reserved for ip 192.168.20.101, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000032.
2010-11-16 11:48:36 [MgmSrvr] INFO     -- Node 5: mysqld --server-id=0
2010-11-16 11:48:36 [MgmSrvr] INFO     -- Node 2: Node 5 Connected
2010-11-16 11:48:36 [MgmSrvr] INFO     -- Node 2: Node 5: API mysql-5.1.23 ndb-6.2.15
2010-11-16 12:32:52 [MgmSrvr] ALERT    -- Node 2: Node 5 Disconnected
2010-11-16 12:32:52 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 closed
2010-11-16 12:32:52 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2010-11-16 12:32:53 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000012.
2010-11-16 12:32:56 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 opened
2010-11-16 12:33:11 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 9.
2010-11-16 12:33:11 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2010-11-16 12:33:11 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2010-11-16 12:33:11 [MgmSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
2010-11-16 12:33:11 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=7]
2010-11-16 12:33:11 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2010-11-16 12:33:11 [MgmSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 1
2010-11-16 12:33:11 [MgmSrvr] INFO     -- Node 2: DICT: remove lock by failed node 3 for NodeRestart
2010-11-16 12:33:11 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2010-11-16 12:33:11 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=6f630007524dd0b6]
2010-11-16 12:33:15 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened
2010-11-16 12:33:38 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.168.20.101, m_reserved_nodes 000000000000000000000000000000000000000000000000000000000000001a.
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 2: Node 3: API mysql-5.1.23 ndb-6.2.15
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (node restart)
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 2: DICT: locked by node 3 for NodeRestart
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 4 ops: 0 poll: 0 cnt: 0 queue: 3L
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (node restart)
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Node 3: Start phase 4 completed (node restart)
2010-11-16 12:33:39 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000012.
2010-11-16 12:33:40 [MgmSrvr] INFO     -- Node 3: Receive arbitrator node 1 [ticket=6f630007524dd0b6]
2010-11-16 12:33:47 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 reserved for ip 192.168.20.101, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000032.
2010-11-16 12:33:47 [MgmSrvr] INFO     -- Node 5: mysqld --server-id=0
2010-11-16 12:33:48 [MgmSrvr] INFO     -- Node 2: Node 5 Connected
2010-11-16 12:33:48 [MgmSrvr] INFO     -- Node 2: Node 5: API mysql-5.1.23 ndb-6.2.15

# tail -n 50 ndb_3_out.log
2010-11-16 11:24:33 [ndbd] INFO     -- Configuration fetched at 192.168.10.150 port 1186
2010-11-16 11:24:33 [ndbd] INFO     -- Start initiated (mysql-5.1.23 ndb-6.2.15)
2010-11-16 11:24:33 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 80Mb initial: 100Mb
Adding 100Mb to ZONE_LO (1,3199)
WOPool::init(61, 9)
RWPool::init(22, 13)
RWPool::init(42, 18)
RWPool::init(62, 13)
RWPool::init(c2, 18)
RWPool::init(e2, 15)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 51)
WOPool::init(21, 6)
2010-11-16 11:44:48 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 9.
2010-11-16 11:45:22 [ndbd] INFO     -- Angel pid: 13131 ndb pid: 13132
2010-11-16 11:45:22 [ndbd] INFO     -- NDB Cluster -- DB node 3
2010-11-16 11:45:22 [ndbd] INFO     -- mysql-5.1.23 ndb-6.2.15 --
2010-11-16 11:45:22 [ndbd] INFO     -- Configuration fetched at 192.168.10.150 port 1186
2010-11-16 11:45:22 [ndbd] INFO     -- Start initiated (mysql-5.1.23 ndb-6.2.15)
2010-11-16 11:45:22 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 80Mb initial: 100Mb
Adding 100Mb to ZONE_LO (1,3199)
WOPool::init(61, 9)
RWPool::init(22, 13)
RWPool::init(42, 18)
RWPool::init(62, 13)
RWPool::init(c2, 18)
RWPool::init(e2, 15)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 51)
WOPool::init(21, 6)
2010-11-16 12:30:05 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 9.
2010-11-16 12:30:33 [ndbd] INFO     -- Angel pid: 13819 ndb pid: 13820
2010-11-16 12:30:33 [ndbd] INFO     -- NDB Cluster -- DB node 3
2010-11-16 12:30:33 [ndbd] INFO     -- mysql-5.1.23 ndb-6.2.15 --
2010-11-16 12:30:33 [ndbd] INFO     -- Configuration fetched at 192.168.10.150 port 1186
2010-11-16 12:30:33 [ndbd] INFO     -- Start initiated (mysql-5.1.23 ndb-6.2.15)
2010-11-16 12:30:33 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 80Mb initial: 100Mb
Adding 100Mb to ZONE_LO (1,3199)
WOPool::init(61, 9)
RWPool::init(22, 13)
RWPool::init(42, 18)
RWPool::init(62, 13)
RWPool::init(c2, 18)
RWPool::init(e2, 15)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 51)
WOPool::init(21, 6)

The ndb_error log is not getting updated..... It dates back to May...

ndb_trace log also is not getting updated...... Its dates back to May..

How to repeat:
Just restart the server, I have to shutdown the whole cluster and do a rolling restart to get things back to normal...

This means downtime :(
[20 Dec 2010 14:36] Jonas Oreland
"killed by signal 9" suggests that it's linux OOM killer in action.
Please check your syslog...

/Jonas
[28 Jul 2016 17:50] MySQL Verification Team
definitely not a bug