MySQL Bugs: #6328: Cluster API node crashes, and cannot reenter cluster without restarting MGM node

Bug #6328	Cluster API node crashes, and cannot reenter cluster without restarting MGM node
Submitted:	29 Oct 2004 16:32	Modified:	10 Nov 2004 13:55
Reporter:	Russell Glaue	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	4.1.7	OS:	Linux (Red Hat Enterprise 3)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
I am testing out the MySQL 4.1.7 with ndbcluster on 4 test machines we had lying around. Pretty stable, but not 100%.

4 test computers (as indicated below). Computer 4 crashed unexpectedly. I restarted the computer and started the mysql API, expecting it to reneter the Cluster. But it did not do this. It gave an error about not being able to acquire a node (message below) and exited. After stopping and starting the MGM, then starting the mysql API on computer 4 again, computer 4 mysql API then was able to connect to the cluster.
Details are as follows:

So I was running the cluster as follows, all computer running Red Hat Linux Enterprise Server 3 with all updates:

c1 mgm api
c2 ndb api
c3 ndb api
c4 api

Okay, so computer c4 unexpectidly froze... nice.
Well I pushed the power button to reboot the system, and had a scan ran on the system disks. The system came up fine and I then went to start the MySQL API node.
Here is the error message I received:

041028 09:40:07 mysqld started
041028 9:40:07 InnoDB: Started; log sequence number 0 43728
Configuration error: Could not alloc node id at server.domain.com port 2200: No free node id found for mysqld(API).
041028 9:40:23 [ERROR] Can't init databases
041028 9:40:23 [ERROR] Aborting

I started up the ndb_mgm console on c1 and saw that the node was available. It said "not connected, accepting connect from any host" for the api node. It was available.
I stoped and started the c4 mysql api several times with no luck (got the same error each time).
I finally went to c1, did a kill {mgm.pid} and then started the mgm node up again. Then I started up the mysql api node on c4.
Finally I saw after that exact moment that the c4 api node was added to the cluster.
Everything is just fine now.

Today, 1 day later, c4 crashed again. When I recovered c4 and started the c4 mysql API, I go the same error as yesterday, "Could not alloc node id at server.domain.com port 2200: No free node id found for mysqld(API)." On the c1 computer I performed, ` echo "1 stop" | ndb_mgm` and then I started the MGM node again. Going back to c4 again to start the mysql API node, it started up successfully without any errors, and connected to the cluster just fine.

Since I was able to reproduce the problem two times over two different days, I would say this is a bug. Perhaps MGM is not releasing the node as available since the mysql API node disconnected uncleanly.

-RG

How to repeat:
Set up a mysql cluster as follows:

c1: mgm api
c2: ndb api
c3: ndb api
c4: api

When all in the cluster is connected together happily, pull out the power plug on c4.

Restart and restore the c4 computer.

Start the c4 mysql API node
Notice the error, and that you cannot connect the c4 API node to the cluster.

stop the MGM node on c1
start the MGM node on c1

Start the c4 mysql API node
Notice the API node starts correctly with no errors, and connects to the mysql cluster.

Suggested fix:
The ndb_mgm will show the API node as available, but will not release the node as free so a crashed computer mysql node can reconnect.
It tells the node that that no node is available.

So this needs to be fixed, supposedly in the ndb_mgmd service, to clean up a node connection after a computer crashes and drops it's node connection uncleanly.

Cluster API node crashes, and cannot reenter cluster without restarting MGM node

What we believe is the problem is that when a node dies in this manner some sockets are not closed.  This leads the management server to believe that this node is still making use of that node id.

We don't have a great solution to this at this point (for 4.1).

so the options that are there right now are as stated before:

1. restart the management server to reset this (erroneous) state
2. once you've figured out your nodeids and decided on a config, run the management server with --no-nodeid-checks and specify nodeids in the
connectstrings and avoid the issue as a while.

and

3. in 4.1.8 we will also offer a new command in the management server: PURGE STALE SESSIONS
   which fixes the issue without restarting the management server.

ndb_mgm> purge stale sessions
Purged sessions with node id's: 1
ndb_mgm> purge stale sessions
No sessions purged

For 5.0 we will redesign the protocol for reserving nodeids to avoid this situation (we want to avoid protocol changes in 4.1).

This bug is not scheduled to be fixed at this time.