MySQL Bugs: #11659: cluster nodes can't rejoin after system reboot

Bug #11659	cluster nodes can't rejoin after system reboot
Submitted:	30 Jun 2005 10:40	Modified:	14 Sep 2005 5:22
Reporter:	Jan Kneschke	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	4.1.12	OS:	Linux (Linux/x86)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
After a system reboot neither API-nodes nor Storage nodes can rejoin the cluster.

How to repeat:
Setup a cluster with 1 mgmd, 4 nodes, 1 mysqld and put 2 ndbds + 1 mysqld on the same host.

reboot the system with the mysqld and the 2ndbds.

mysqld is nodeid = 7:

2005-06-30 12:06:16 [MgmSrvr] INFO     -- Mgmt server state: nodeid 7 reserved for ip 192.168.1.5, m_reserved_nodes 00000000000000fe.
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 5: Node 7 Connected
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 4: Node 7 Connected
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 2: Node 7 Connected
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 3: Node 7 Connected
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 2: Node 7: API version 4.1.12
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 3: Node 7: API version 4.1.12
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 4: Node 7: API version 4.1.12
2005-06-30 12:06:16 [MgmSrvr] INFO     -- Node 5: Node 7: API version 4.1.12
2005-06-30 12:08:58 [MgmSrvr] WARNING  -- Node 2: Node 7 missed heartbeat 2
2005-06-30 12:08:58 [MgmSrvr] WARNING  -- Node 4: Node 7 missed heartbeat 2
2005-06-30 12:08:58 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 2
2005-06-30 12:09:00 [MgmSrvr] WARNING  -- Node 2: Node 7 missed heartbeat 3
2005-06-30 12:09:00 [MgmSrvr] WARNING  -- Node 4: Node 7 missed heartbeat 3
2005-06-30 12:09:00 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 3
2005-06-30 12:09:00 [MgmSrvr] WARNING  -- Node 4: Node 3 missed heartbeat 2
2005-06-30 12:09:01 [MgmSrvr] WARNING  -- Node 2: Node 7 missed heartbeat 4
2005-06-30 12:09:01 [MgmSrvr] ALERT    -- Node 2: Node 7 declared dead due to missed heartbeat
2005-06-30 12:09:01 [MgmSrvr] INFO     -- Node 2: Communication to Node 7 closed
2005-06-30 12:09:01 [MgmSrvr] WARNING  -- Node 4: Node 7 missed heartbeat 4
2005-06-30 12:09:01 [MgmSrvr] ALERT    -- Node 4: Node 7 declared dead due to missed heartbeat
2005-06-30 12:09:01 [MgmSrvr] INFO     -- Node 4: Communication to Node 7 closed
2005-06-30 12:09:01 [MgmSrvr] ALERT    -- Node 2: Node 7 Disconnected
2005-06-30 12:09:01 [MgmSrvr] ALERT    -- Node 4: Node 7 Disconnected

later when the mysqld wants to rejoin:

2005-06-30 12:09:04 [MgmSrvr] INFO     -- Node 4: Communication to Node 7 opened
2005-06-30 12:09:05 [MgmSrvr] INFO     -- Node 2: Communication to Node 7 opened
2005-06-30 12:09:06 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
2005-06-30 12:09:06 [MgmSrvr] INFO     -- Node 4: Communication to Node 5 opened
2005-06-30 12:09:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened
2005-06-30 12:09:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 opened
2005-06-30 12:11:57 [MgmSrvr] WARNING  -- Allocate nodeid (0) failed. Connection from ip 192.168.1.5. Returned error string "No free node id found for ndbd(NDB)."
2005-06-30 12:11:57 [MgmSrvr] INFO     -- Mgmt server state: node id's  1 3 5 7 not connected but reserved
2005-06-30 12:12:00 [MgmSrvr] WARNING  -- Allocate nodeid (0) failed. Connection from ip 192.168.1.5. Returned error string "No free node id found for ndbd(NDB)."
2005-06-30 12:12:00 [MgmSrvr] INFO     -- Mgmt server state: node id's  1 3 5 7 not connected but reserved
2005-06-30 12:12:03 [MgmSrvr] WARNING  -- Allocate nodeid (0) failed. Connection from ip 192.168.1.5. Returned error string "No free node id found for ndbd(NDB)."
2005-06-30 12:12:03 [MgmSrvr] INFO     -- Mgmt server state: node id's  1 3 5 7 not connected but reserved

after a PURGE STALE SESSIONS it works again.

2005-06-30 12:13:27 [MgmSrvr] INFO     -- Mgmt server state: nodeid 7 freed, m_reserved_nodes 000000000000017e.
2005-06-30 12:13:27 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 000000000000015e.
2005-06-30 12:13:27 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000156.
2005-06-30 12:13:45 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.168.1.5, m_reserved_nodes 000000000000015e.

Suggested fix:
free the reserved connections after they are declared dead.

I see the same with 4.1.12, 5.0 works fine though. 

A PURGE STALE SESSIONS helps but the restarted nodes should be able to rejoin the cluster without manual interaction

Hi...

There is quite a big effort to remove the "purge stale" problem.

But...there is a work-around use spcified node ids.

And start ndb_mgmd --no-nodeid-checks
and start ndbd --nodeid=X

/Jonas

maybe this will be fixed in 5.1...not sure