Description:
I stopped a data node (id=3):
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 5: Node 3 has failed. The Node state at failure was 0
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 5: Node failure of 3 DBDICT completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 5: Node failure of 3 DBLQH completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 5: Node 3 Disconnected
2009-07-29 13:20:44 [MgmSrvr] INFO -- Node 6: Communication to Node 3 closed
2009-07-29 13:20:44 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 6: Node 3 has failed. The Node state at failure was 0
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 6: Node failure of 3 DBDICT completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 4: Arbitration check won - node group majority
2009-07-29 13:20:44 [MgmSrvr] INFO -- Node 4: President restarts arbitration thread [state=6]
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 4: Node 3 has failed. The Node state at failure was 0
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 4: Node failure of 3 DBDICT completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 4: Node failure of 3 DBLQH completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 5: Node failure of 3 DBTC completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 6: Node failure of 3 DBLQH completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 6: Node failure of 3 DBTC completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 4: Node failure of 3 DBTC completed
2009-07-29 13:20:44 [MgmSrvr] ALERT -- Node 6: Node 3 Disconnected
2009-07-29 13:20:44 [MgmSrvr] INFO -- Node 3: Node shutdown completed.
Clearly, Node 1 (ndb_mgmd) has seen the disconnect, but node the second management server (maybe it shouldnt, i don't know).
Then I start the data node 3 again. It dies with:
2009-07-30 13:24:10 [ndbd] INFO -- Unable to alloc node id
2009-07-30 13:24:10 [ndbd] INFO -- Error : Could not alloc node id at mystandby01 port 1186: Id 3 already allocated by another node.
error=2350
2009-07-30 13:24:10 [ndbd] INFO -- Error handler restarting system
2009-07-30 13:24:10 [ndbd] INFO -- Error handler shutdown completed - exiting
sphase=0
exit=-1
The following is printed in the clusterlog:
2009-07-29 13:22:26 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 timed out.
2009-07-29 13:22:29 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 timed out.
2009-07-29 13:22:32 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 timed out.
2009-07-29 13:22:32 [MgmSrvr] WARNING -- Allocate nodeid (3) failed. Connection from ip 10.116.17.41. Returned error string "Id 3 already allocated by another node."
2009-07-29 13:22:32 [MgmSrvr] INFO -- Mgmt server state: node id's 2 4 5 6 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 connected but not reserved
2009-07-29 13:22:32 [MgmSrvr] INFO -- Mgmt server state: node id's 3 not connected but reserved
2009-07-29 13:22:39 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000002.
2009-07-29 13:23:00 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved for ip 10.116.17.41, m_reserved_nodes 000000000000000000000000000000000000000000000000000000000000000a.
It takes some time before the ndb_mgmd realize that node 3 is disconnected.
and only by waiting some time it is possible to reconnect:
2009-07-29 13:23:01 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2009-07-29 13:23:02 [MgmSrvr] INFO -- Node 3: Start phase 0 completed
As you can see it is several minutes before possible to reconnect!!
How to repeat:
Don't know really, but happens a lot on this system (4 data nodes, 2 ndb_mgmd, 8 mysqld (each having ndb_connpool=8) ).
Suggested fix:
-.