MySQL Bugs: #42450: ndbmtd crash with error 2341 Internal program error

Bug #42450	ndbmtd crash with error 2341 Internal program error
Submitted:	29 Jan 2009 13:52	Modified:	31 Jan 2009 23:28
Reporter:	Guido Ostkamp	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	6.4.2	OS:	Solaris
Assigned to:		CPU Architecture:	Any

Description:
Hello,

during tests with MySQL Cluster the ndbmtd crashes with

2009-01-29 14:23:56 [ndbd] INFO     -- LocalProxy.cpp
2009-01-29 14:23:56 [ndbd] INFO     -- DBLQH (Line: 558) 0x0000000a
2009-01-29 14:23:56 [ndbd] INFO     -- Error handler shutting down system
2009-01-29 14:23:56 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-01-29 14:24:01 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(
Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

We are using revid jonathan.miller@sun.com-20090128150430-4gcp8g81gky8bgkg dated 2009-09-21 from repo mysql-5.1-telco-6.4 compiled with Solaris Workshop 12 as follows:

CC=cc CXX=CC CFLAGS="-xO5 -fast -g -mt -m64" CXXFLAGS="-xO5 -fast -g -mt -m64" ./configure --prefix=<somepath> --with-plugins=all --without-docs --without-man --with-ndbmtd

Regards

Guido Ostkamp

How to repeat:
The test is repeatable as follows:

* Run 2 node system with ndbmtd
* kill -9 ndbmtd on node 2
* Run insert/delete requests in a loop on node 1
* While executing requests, restart node 2
* During restart again kill ndbmtd on node 2
* Now ndmtd crashes on node 1

Though it crashes, we do not get any coredumps.

can you please upload config and logs/tracefiles
- cluster log
- error-logs
- tracefiles
- config.ini

/Jonas

Hi,

thanks for your bug report. Can you send us all log-, trace- and config- files ideally retrieved and packed with ndb_error_reporter tool?

Out of curiosity / for your information: --with-ndbmtd is not really a supported option and will simply be ignored. We build both ndbd and ndbmtd automatically.

Bernd

Requested debug data has been uploaded to bug-data-42450.tar.gz.

Please note that ndb_error_reporter failed to collect that data.
It was called with 'ndb_error_reporter config.ini root' on 
management console and seemed to be copying the files but then
ended in creating a 0 byte tarball (14 bytes in bzipped form).

Regards

Guido Ostkamp

Just to avoid confusion:

When I talk of node 1 in error description this means NDB node 2; and node 2 means NDB node 3. NDB node 1 is cluster console with NDB management daemon.

looks like a simple fix.
LocalProxy.cpp tries to keep of alive nodes.
But fail to handle case where NODE_FAILREP is sent wo/ a preceeding INCL_NODEREQ
(which can happen if node fails shortly after sp1)

Reasons for LocalProxy.cpp to try to maintain alive-node-list is unknown
(maybe LCP_FRAG_REP)

But it should regardlessly not be linked this way to NODE_FAILREP
(it should be better to change LCP_FRAG_REP, e.g send it to local DIH that
 can proxy it to rest of cluster)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64624

3240 Jonas Oreland	2009-01-30
      ndb - bug#42450 - fix incorrect assumptions about NODE_FAILREP/INCL_NODEREQ and rewrite NF_COMPLETE handling in LocalProxy. Note: more work is needed cause testNodeRestart -n MNF fails consistently in mt-lqh

pushed to 6.4.3

Pushed into 5.1.31-ndb-6.4.3 (revid:jonas@mysql.com-20090130143059-nbhc491rq3v6mdph) (version source revid:jonas@mysql.com-20090130130748-ft8bghfjufj3lp9q) (merge vers: 5.1.31-ndb-6.4.3) (pib:6)

Documented bugfix in the NDB-6.4.3 changelog as follows:

        When using ndbmtd for all data nodes, repeated failures of one
        data node during DML operations caused other data nodes to fail.

Also noted in docs that no special configure or compiler options are required to build ndbmtd binaries.