Bug #42450 ndbmtd crash with error 2341 Internal program error
Submitted: 29 Jan 2009 13:52 Modified: 31 Jan 2009 23:28
Reporter: Guido Ostkamp Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:6.4.2 OS:Solaris
Assigned to: CPU Architecture:Any

[29 Jan 2009 13:52] Guido Ostkamp
Description:
Hello,

during tests with MySQL Cluster the ndbmtd crashes with

2009-01-29 14:23:56 [ndbd] INFO     -- LocalProxy.cpp
2009-01-29 14:23:56 [ndbd] INFO     -- DBLQH (Line: 558) 0x0000000a
2009-01-29 14:23:56 [ndbd] INFO     -- Error handler shutting down system
2009-01-29 14:23:56 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-01-29 14:24:01 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(
Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

We are using revid jonathan.miller@sun.com-20090128150430-4gcp8g81gky8bgkg dated 2009-09-21 from repo mysql-5.1-telco-6.4 compiled with Solaris Workshop 12 as follows:

CC=cc CXX=CC CFLAGS="-xO5 -fast -g -mt -m64" CXXFLAGS="-xO5 -fast -g -mt -m64" ./configure --prefix=<somepath> --with-plugins=all --without-docs --without-man --with-ndbmtd

Regards

Guido Ostkamp

How to repeat:
The test is repeatable as follows:

* Run 2 node system with ndbmtd
* kill -9 ndbmtd on node 2
* Run insert/delete requests in a loop on node 1
* While executing requests, restart node 2
* During restart again kill ndbmtd on node 2
* Now ndmtd crashes on node 1

Though it crashes, we do not get any coredumps.
[29 Jan 2009 14:00] Jonas Oreland
can you please upload config and logs/tracefiles
- cluster log
- error-logs
- tracefiles
- config.ini

/Jonas
[29 Jan 2009 14:06] Bernd Ocklin
Hi,

thanks for your bug report. Can you send us all log-, trace- and config- files ideally retrieved and packed with ndb_error_reporter tool?

Out of curiosity / for your information: --with-ndbmtd is not really a supported option and will simply be ignored. We build both ndbd and ndbmtd automatically.

Bernd
[29 Jan 2009 15:14] Guido Ostkamp
Requested debug data has been uploaded to bug-data-42450.tar.gz.

Please note that ndb_error_reporter failed to collect that data.
It was called with 'ndb_error_reporter config.ini root' on 
management console and seemed to be copying the files but then
ended in creating a 0 byte tarball (14 bytes in bzipped form).

Regards

Guido Ostkamp
[29 Jan 2009 15:21] Guido Ostkamp
Just to avoid confusion:

When I talk of node 1 in error description this means NDB node 2; and node 2 means NDB node 3. NDB node 1 is cluster console with NDB management daemon.
[29 Jan 2009 17:41] Jonas Oreland
looks like a simple fix.
LocalProxy.cpp tries to keep of alive nodes.
But fail to handle case where NODE_FAILREP is sent wo/ a preceeding INCL_NODEREQ
(which can happen if node fails shortly after sp1)

Reasons for LocalProxy.cpp to try to maintain alive-node-list is unknown
(maybe LCP_FRAG_REP)

But it should regardlessly not be linked this way to NODE_FAILREP
(it should be better to change LCP_FRAG_REP, e.g send it to local DIH that
 can proxy it to rest of cluster)
[30 Jan 2009 13:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64624

3240 Jonas Oreland	2009-01-30
      ndb - bug#42450 - fix incorrect assumptions about NODE_FAILREP/INCL_NODEREQ and rewrite NF_COMPLETE handling in LocalProxy. Note: more work is needed cause testNodeRestart -n MNF fails consistently in mt-lqh
[30 Jan 2009 14:47] Jonas Oreland
pushed to 6.4.3
[30 Jan 2009 15:16] Bugs System
Pushed into 5.1.31-ndb-6.4.3 (revid:jonas@mysql.com-20090130143059-nbhc491rq3v6mdph) (version source revid:jonas@mysql.com-20090130130748-ft8bghfjufj3lp9q) (merge vers: 5.1.31-ndb-6.4.3) (pib:6)
[31 Jan 2009 23:28] Jon Stephens
Documented bugfix in the NDB-6.4.3 changelog as follows:

        When using ndbmtd for all data nodes, repeated failures of one
        data node during DML operations caused other data nodes to fail.

Also noted in docs that no special configure or compiler options are required to build ndbmtd binaries.