Bug #42450 | ndbmtd crash with error 2341 Internal program error | ||
---|---|---|---|
Submitted: | 29 Jan 2009 13:52 | Modified: | 31 Jan 2009 23:28 |
Reporter: | Guido Ostkamp | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 6.4.2 | OS: | Solaris |
Assigned to: | CPU Architecture: | Any |
[29 Jan 2009 13:52]
Guido Ostkamp
[29 Jan 2009 14:00]
Jonas Oreland
can you please upload config and logs/tracefiles - cluster log - error-logs - tracefiles - config.ini /Jonas
[29 Jan 2009 14:06]
Bernd Ocklin
Hi, thanks for your bug report. Can you send us all log-, trace- and config- files ideally retrieved and packed with ndb_error_reporter tool? Out of curiosity / for your information: --with-ndbmtd is not really a supported option and will simply be ignored. We build both ndbd and ndbmtd automatically. Bernd
[29 Jan 2009 15:14]
Guido Ostkamp
Requested debug data has been uploaded to bug-data-42450.tar.gz. Please note that ndb_error_reporter failed to collect that data. It was called with 'ndb_error_reporter config.ini root' on management console and seemed to be copying the files but then ended in creating a 0 byte tarball (14 bytes in bzipped form). Regards Guido Ostkamp
[29 Jan 2009 15:21]
Guido Ostkamp
Just to avoid confusion: When I talk of node 1 in error description this means NDB node 2; and node 2 means NDB node 3. NDB node 1 is cluster console with NDB management daemon.
[29 Jan 2009 17:41]
Jonas Oreland
looks like a simple fix. LocalProxy.cpp tries to keep of alive nodes. But fail to handle case where NODE_FAILREP is sent wo/ a preceeding INCL_NODEREQ (which can happen if node fails shortly after sp1) Reasons for LocalProxy.cpp to try to maintain alive-node-list is unknown (maybe LCP_FRAG_REP) But it should regardlessly not be linked this way to NODE_FAILREP (it should be better to change LCP_FRAG_REP, e.g send it to local DIH that can proxy it to rest of cluster)
[30 Jan 2009 13:07]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/64624 3240 Jonas Oreland 2009-01-30 ndb - bug#42450 - fix incorrect assumptions about NODE_FAILREP/INCL_NODEREQ and rewrite NF_COMPLETE handling in LocalProxy. Note: more work is needed cause testNodeRestart -n MNF fails consistently in mt-lqh
[30 Jan 2009 14:47]
Jonas Oreland
pushed to 6.4.3
[30 Jan 2009 15:16]
Bugs System
Pushed into 5.1.31-ndb-6.4.3 (revid:jonas@mysql.com-20090130143059-nbhc491rq3v6mdph) (version source revid:jonas@mysql.com-20090130130748-ft8bghfjufj3lp9q) (merge vers: 5.1.31-ndb-6.4.3) (pib:6)
[31 Jan 2009 23:28]
Jon Stephens
Documented bugfix in the NDB-6.4.3 changelog as follows: When using ndbmtd for all data nodes, repeated failures of one data node during DML operations caused other data nodes to fail. Also noted in docs that no special configure or compiler options are required to build ndbmtd binaries.