MySQL Bugs: #49230: ndbmtd forced to restart while doing a GCP

Bug #49230	ndbmtd forced to restart while doing a GCP
Submitted:	30 Nov 2009 19:30	Modified:	1 Jan 2010 9:27
Reporter:	Robert Klikics	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux (Debian 5.0)
Assigned to:	Andrew Hutchings	CPU Architecture:	Any
Tags:	GCP, ndbmtd, telco-7.0.9b

Description:
One of our ndb nodes (which was the master before the forced restart) has restarted, seeming after a GCP stop. We've found the following error messages in the log files:

2009-11-30 19:24:41 [MgmtSrvr] WARNING  -- Node 2: Detected GCP stop(3)...sending kill to [SignalCounter: m_count=1 0000000000000008]
2009-11-30 19:24:45 [MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 2
2009-11-30 19:24:46 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node re
start by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

The error occures after a configuration update (LockAllPagesInMemory from 0 to 1, see bug report http://bugs.mysql.com/bug.php?id=49201). But i dont know, if the error is related to the configuration change.

Another question concerning the error logs on the ndb nodes: are they touched when a failure occurs? The strange thing is, that the last entry is from june this year, but the mtime and ctime timestamps are from the time the error ocured. This is a little bit confusing.

A ndb_error_reporter report, which was taken after the node restart, is attached here:

http://85.25.144.101/files/ndb_error_report_20091130195615.tar.bz2

Sincerelly
Martin P.

How to repeat:
No idea at this time. The error log asking for a bug report, here is it.

Hello Robert,

This looks like a normal GCP stop error please check out the information on preventing GCP stop entitled "Disk Data and GCP Stop errors." at the bottom of:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html

Also note that your cluster is complaining about send buffer problems and in fact at least one crash about 10 days before the GCP stop is due to this.  Please consider increasing SendBufferMemory.

Hi Andrew,

thank's in advance fpr you reply and hint. I've read the documentation and there is basicly spoken about GCP Stop errors associated with Disk Data Tables, which we're NOT using.

We've also spoken to the percona guy's, which have meant that the send buffer values are allright now.

So since we've switched back to ndbd instead off ndbmtd, the cluster run's "stable" since about 1 1/2 months. I know that multithreaded app's are hard to code, but imho the ndbmtd is not ready for productional use.

Sincerelly
Martin P.

Most of that is still valid without disk tables but unfortunately tuning your cluster is beyond the scope of a bug report.

We do have customers running very large clusters using ndbmtd and it is very stable for them, we are sorry this has not been your experience.

Hi Andrew,

this should not be a personal attack against you or the programmer's. But how did the customers get this running stable?

We've had a 2 day intensive training with the percona guy's, once a mysql-engineer from sun was here, which looked over the configuration and meant that's ok, we've tryed so many configurations and so on. But the longest time, the ndbmtd was running w/o a failure, was about 2 week's.

So please tell me, is there a hidden switch or something else :-) ?

BTW Happy New Year.

Sincerelly
Martin P.

Hello Martin,

In most cases it is a tuning effort which is greatly dependant on the application and hardware used.

Unfortunately this is beyond the scope of a bug report, but I would be happy to discuss this if you contact me directly, use the cluster mailing list / forum or alternatively our Professional Services team will be able to help you out.

Happy new year to you too! :)