Bug #14051 NDBD randomly crashes with a pointer too large
Submitted: 15 Oct 2005 20:36 Modified: 26 Oct 2005 17:17
Reporter: Timothy Pearson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1.14 OS:Linux (Fedora Core 4 (Linux))
Assigned to: CPU Architecture:Any

[15 Oct 2005 20:36] Timothy Pearson
Description:
The NDBD process randomly crashes, disconnecting the storage node from the cluster.  The crash spreads across the entire cluster, crashing all of the NDBD processes on all the servers.

The log output:

Date/Time: Friday 14 October 2005 - 08:45:03
Type of error: error
Message: Pointer too large
Fault ID: 2306
Problem data: DbtupIndex.cpp
Object of reference: DBTUP (Line: 136) 0x0000000a
ProgramName: ndbd
ProcessID: 20573
TraceFile: /tmp/mysql-cluster/ndb_2_trace.log.1
Version 4.1.14
***EOM***

There are several of these in the log.

How to repeat:
Set up a 4.1.14 cluster with a couple storage nodes and let it run for a day / couple of days with applications adding and removing data now and then.

Suggested fix:
Make it so that it doesn't crash :-)
[16 Oct 2005 6:08] Timothy Pearson
I have upgraded this to S1, as now I'm losing data because of it.

This bug has made it so that I can't even use 4.1.14... :-(

I'm going to try a couple of other versions here to see if they'll work better.
[16 Oct 2005 6:38] Jonas Oreland
The error log indicates problem in index handling.
Could you
1) supply queries and some test data.
2) See if you can find somewhat reproducable way of
  getting bug using testdata from 1)
[16 Oct 2005 17:39] Timothy Pearson
> The error log indicates problem in index handling.
> Could you
> 1) supply queries and some test data.
Unfortunately, I do not have access to the queries being run, as the only "users" of the databases are the applications Bacula (www.bacula.org) and VPopMail.
> 2) See if you can find somewhat reproducable way of
>   getting bug using testdata from 1)
Not sure what you mean by "using testdata from 1)".  Could you please clarify?

As of right now, the easiest and fastest way to crash the cluster is simply to let it run!  Even with NO data access (read or write), and only one storage node online, it still crashes within 12 hours of being brought online with error 4009.

If you need any other files, such as my.cnf etc, I will be happy to post them privately.
[26 Oct 2005 17:17] Timothy Pearson
This bug appears to be fixed in 5.0.15.