Bug #46472 management server does not relase nodeids
Submitted: 30 Jul 2009 9:36 Modified: 19 Sep 2009 14:07
Reporter: Johan Andersson Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.0.6 OS:Any
Assigned to: CPU Architecture:Any
Tags: ndb_mgmd

[30 Jul 2009 9:36] Johan Andersson
Description:
I stopped a data node (id=3):

2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 5: Node 3 has failed. The Node state at failure was 0
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 5: Node failure of 3 DBDICT completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 5: Node failure of 3 DBLQH completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 5: Node 3 Disconnected
2009-07-29 13:20:44 [MgmSrvr] INFO     -- Node 6: Communication to Node 3 closed
2009-07-29 13:20:44 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 6: Node 3 has failed. The Node state at failure was 0
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 6: Node failure of 3 DBDICT completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 4: Arbitration check won - node group majority
2009-07-29 13:20:44 [MgmSrvr] INFO     -- Node 4: President restarts arbitration thread [state=6]
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 4: Node 3 has failed. The Node state at failure was 0
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 4: Node failure of 3 DBDICT completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 4: Node failure of 3 DBLQH completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 5: Node failure of 3 DBTC completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 6: Node failure of 3 DBLQH completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 6: Node failure of 3 DBTC completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 4: Node failure of 3 DBTC completed
2009-07-29 13:20:44 [MgmSrvr] ALERT    -- Node 6: Node 3 Disconnected
2009-07-29 13:20:44 [MgmSrvr] INFO     -- Node 3: Node shutdown completed.

Clearly, Node 1 (ndb_mgmd) has seen the disconnect, but node the second management server (maybe it shouldnt, i don't know).

Then I start the data node 3 again. It dies with:
2009-07-30 13:24:10 [ndbd] INFO     -- Unable to alloc node id
2009-07-30 13:24:10 [ndbd] INFO     -- Error : Could not alloc node id at mystandby01 port 1186: Id 3 already allocated by another node.
error=2350
2009-07-30 13:24:10 [ndbd] INFO     -- Error handler restarting system
2009-07-30 13:24:10 [ndbd] INFO     -- Error handler shutdown completed - exiting
sphase=0
exit=-1

The following is printed in the clusterlog:

2009-07-29 13:22:26 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 timed out.
2009-07-29 13:22:29 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 timed out.
2009-07-29 13:22:32 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 timed out.
2009-07-29 13:22:32 [MgmSrvr] WARNING  -- Allocate nodeid (3) failed. Connection from ip 10.116.17.41. Returned error string "Id 3 already allocated by another node."
2009-07-29 13:22:32 [MgmSrvr] INFO     -- Mgmt server state: node id's 2 4 5 6 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 connected but not reserved
2009-07-29 13:22:32 [MgmSrvr] INFO     -- Mgmt server state: node id's 3 not connected but reserved
2009-07-29 13:22:39 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000002.
2009-07-29 13:23:00 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 10.116.17.41, m_reserved_nodes 000000000000000000000000000000000000000000000000000000000000000a.

It takes some time before the ndb_mgmd realize that node 3 is disconnected.
and only by waiting some time it is possible to reconnect:

2009-07-29 13:23:01 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2009-07-29 13:23:02 [MgmSrvr] INFO     -- Node 3: Start phase 0 completed 

As you can see it is several minutes before possible to reconnect!!

How to repeat:
Don't know really, but happens a lot on this system (4 data nodes, 2 ndb_mgmd, 8 mysqld (each having ndb_connpool=8) ).

Suggested fix:
-.
[19 Sep 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".