MySQL Bugs: #51108: mysqld hangs after a node has killed during node restart after a GCP stop error

Bug #51108	mysqld hangs after a node has killed during node restart after a GCP stop error
Submitted:	11 Feb 2010 15:14	Modified:	9 Jan 2015 16:40
Reporter:	Robert Klikics	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-7.0	OS:	Linux (Debian 5.0)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	node killed gcp stop error mysqld hang, telco-7.0.9b

Description:
Hi,

after a GCP stop error, one of our ndb's was killed by the master:

2010-02-11 14:40:37 [MgmtSrvr] WARNING  -- Node 3: Detected GCP stop(3)...sending kill to [SignalCounter: m_count=1 0000000000000020]
2010-02-11 14:40:39 [MgmtSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

But during this restart, another failure occured:

2010-02-11 14:40:38 [ndbd] INFO     -- Node 5 killed this node because GCP stop was detected
2010-02-11 14:40:38 [ndbd] INFO     -- NDBCNTR (Line: 270) 0x00000008
2010-02-11 14:40:38 [ndbd] INFO     -- Error handler restarting system
2010-02-11 14:40:38 [ndbd] INFO     -- Error handler shutdown completed - exiting
2010-02-11 14:40:39 [ndbd] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-02-11 14:40:39 [ndbd] INFO     -- Ndb has terminated (pid 2705) restarting
2010-02-11 14:43:41 [ndbd] INFO     -- Unable to alloc node id
2010-02-11 14:43:41 [ndbd] INFO     -- Error : Could not alloc node id at 192.168.10.100 port 1186: Id 5 already allocated by another node.

It seem's that another node has the same node id?!

But the real cause was, that one of the mysqld api clients disconnects after the GCP stop error and could no reconnect to the cluster (same failure with id allready allocated). The next strange thing was, that while the ndb node restarts, none of the mysqld api clients could connect to the cluster data.

You can find a ndb_error_reporter report here:
http://85.25.144.101/files/ndb_error_report_20100211155216.tar.bz2

Sincerelly
R. Klikics

How to repeat:
No idea.

Hi,

One known problem is that when a node fails,
if a local-checkpoint is ongoing,
it may not reconnect to cluster until some specific
part of the checkpoint has been completed,

this manifest it self, by that when the datanode restarts
it will get "Could not alloc node", there is btw a bug/feature-request
open to improve that message in "http://bugs.mysql.com/bug.php?id=52253"

as for mysqld problems,
i can find nothing in cluster logs etc,
mysqld.err is missing to analyze any such problems.

Setting this to need feedback

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I too got the same error ..ndbd process is not running on one data node ..still the data is being updated on the failed node and it is updated on the second node . But I don't know if I restart the failed node ..what happens ..Plz help me in this regard how to resolve this ..

Thanks,
Umapathi
umapathi.b@gmail.com