Bug #24476 ndbd crashed because of failed ndbrequire in dbacc/DbaccMain.cpp.
Submitted: 21 Nov 2006 16:30 Modified: 6 Dec 2006 12:01
Reporter: Anatoly Pidruchny (Candidate Quality Contributor)
Status: Closed
Category:Server: Cluster Severity:S3 (Non-critical)
Version:5.1.12 OS:Linux (Linux x86_64)
Assigned to: Jonas Oreland Target Version:
Tags: Line 3194, DBACC, error 2341, ndbrequire

[21 Nov 2006 16:30] Anatoly Pidruchny
Description:
ndbd node crashed because of an internal program error (failed ndbrequire). The error log
(ndb_3_error.log) is copy-pasted below. Please see the attached out log, trace log,
cluster log and information from mysqlbug for more details.

The NDB Cluster was doing fine under our test load for several hours. Then it looked like
ndbd processes started to take high CPU and operations started to become very slow.
Finally node 3 crashed. The other data node 2 stayed up, CPU utilization lowered and DB
operations started to work fast again.

******************** ndb_3_error.log ************************************
Current byte-offset of file-pointer is: 568                       

Time: Tuesday 21 November 2006 - 09:08:00
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: dbacc/DbaccMain.cpp
Error object: DBACC (Line: 3194) 0x0000000a
Program: ndbd
Pid: 4986
Trace: /sm/mysql/data/ndb_3_trace.log.1
Version: Version 5.1.12 (beta)
***EOM***
******************** ndb_3_error.log ************************************

How to repeat:
It is unknown if the problem is easily reproducible. NDB Cluster has to be running under
load for several hours.
[21 Nov 2006 16:32] Anatoly Pidruchny
Changed category to Cluster.
[21 Nov 2006 16:38] Anatoly Pidruchny
Mysqlbug info

Attachment: mysqlbug.txt (text/plain), 3.46 KiB.

[21 Nov 2006 16:44] Anatoly Pidruchny
Error log of the ndbd node

Attachment: ndb_3_error.log (application/octet-stream, text), 568 bytes.

[21 Nov 2006 16:44] Anatoly Pidruchny
Out log of the ndbd node.

Attachment: ndb_3_out.log (application/octet-stream, text), 2.69 KiB.

[21 Nov 2006 16:45] Anatoly Pidruchny
gzipped trace log of the ndbd node

Attachment: ndb_3_trace.log.1.gz (application/x-gzip, text), 44.57 KiB.

[21 Nov 2006 16:49] Anatoly Pidruchny
Cluster log file

Attachment: ndb_1_cluster.log (application/octet-stream, text), 64.57 KiB.

[21 Nov 2006 16:51] Anatoly Pidruchny
One more thing. After the data node crashed, it was not restarted by the angel process.
May be it is OK, because I am not sure that the angel is supposed to restart it in the
case like this.
[21 Nov 2006 16:54] Jonas Oreland
Hi,

Could you test bug fix for http://bugs.mysql.com/bug.php?id=24039
if you do index scan + delete, this is bug is quite easy to trigger..

/Jonas
[21 Nov 2006 17:46] Anatoly Pidruchny
Hi, Jonas,

I do not think it is the same problem. But I will definitely try the bug fix for
http://bugs.mysql.com/bug.php?id=24039.

Thanks,
Anatoly.
[22 Nov 2006 14:07] Valeriy Kravchuk
Please, try, and inform about the results.
[22 Nov 2006 16:42] Anatoly Pidruchny
Yesterday I applied the patch for bug http://bugs.mysql.com/bug.php?id=24039 and then we
ran our testing over night. This morning one of the nodes crashed again with the same
error messages. Below is the contents of the ndb_3_error.log. Please let me know if you
need more information.

Time: Wednesday 22 November 2006 - 10:20:55
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: dbacc/DbaccMain.cpp
Error object: DBACC (Line: 3194) 0x0000000a
Program: ndbd
Pid: 7709
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.2
Version: Version 5.1.12 (beta)
***EOM***
                                                                                     
Time: Wednesday 22 November 2006 - 10:26:28
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: restore.cpp
Error object: RESTORE (Line: 1154) 0x0000000a
Program: ndbd
Pid: 9822
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.3
Version: Version 5.1.12 (beta)
***EOM***
[22 Nov 2006 16:49] Jonas Oreland
Hi,

Thx for the help,

1) Can you supply sm/mysql/ndb_data/ndb_3_trace.log.3
   (error during RESTORE)

2) Could you test running a debug compiled ndb version
   (or compile ndb with "export NDB_EXTRA_FLAGS=-DACC_SAFE_QUEUE" 
    before compiling acc)

I'v never seen this error before,
basically it finds a uncommited record wo/ lock, which ofcourse is incorrect.

Could you also say which transactions are running when node fails.
(i.e for all the transactions that will get aborted on this error, list what they've done
sofar and what they were trying to at failure)

/Jonas
[22 Nov 2006 17:10] Anatoly Pidruchny
gzipped ndb_3_trace.log.3 (error during RESTORE)

Attachment: ndb_3_trace.log.3.gz (application/x-gzip, text), 30.42 KiB.

[22 Nov 2006 17:59] Anatoly Pidruchny
Jonas,

As you asked, I have just submitted the sm/mysql/ndb_data/ndb_3_trace.log.3 (error during
RESTORE).

Next thing, I will compile debug version of MySQL/NDB.

Do you want to know what transactions were running during the first crash (failed
ndbrequire in DBACC) or the second crash (failed ndbrequire in RESTORE)? 
We are using our own server application (two of them on two nodes) that access NDB
directly using NDB API. All the transactions are very small. Looking at our traces, the
following transactions/operations were running during the first crash:

============== First node =====================

1. 4 transactions doing:
myOperation->nextResult(true, FORCE_EXEC != 0);
Here myOperation is an NdbIndexScanOperation on a primary key of the table. The
readTuples line was:

myOperation->readTuples(NdbOperation::LM_CommittedRead,
                                NdbScanOperation::SF_OrderBy | (lowToHigh ? 0 :
NdbScanOperation::SF_Descending))

The transaction execute line was:

myTransaction->execute(NdbTransaction::NoCommit, NdbTransaction::AbortOnError,
FORCE_EXEC)

2. 2 transactions deleting a single record using a primary key access.
3. 6 transactions reading a single record using a primary key access.
4. 4 transactions updating a single record using a primary key access.

============== Second node =====================

1. 2 transactions doing nextResult, as described above.
2. 3 transactions deleting a single record using a primary key access.
3. 2 transactions updating a single record using interpretedUpdateTuple(). The
interpreted program does a simple check before the update and can choose not to update
the record if the condition is not satisfied.
4. 1 transaction reading a single record using a primary key access.
5. 1 transaction updating a single record using a primary key access.

During the second crash, no transactions failed in our application. But you can see what
kinds of transactions we are running all the time from the information above.
[23 Nov 2006 22:40] Jonas Oreland
Hi again,

looking forward to hearing if compiling with ACC_SAFE_QUEUE produces an "earlier" crash

the RESTORE crash is very very likely a "follup" bug from the ACC problem.

/Jonas
[24 Nov 2006 1:50] Anatoly Pidruchny
gzipped ndb_3_signal.log for the 11/23/06 crash

Attachment: ndb_3_signal.log.gz (application/x-gzip, text), 18.11 KiB.

[24 Nov 2006 1:52] Anatoly Pidruchny
gzipped trace log for the 11/23/06 crash

Attachment: ndb_3_trace.log.4.gz (application/x-gzip, text), 73.00 KiB.

[24 Nov 2006 1:53] Anatoly Pidruchny
gzipped out log for the 11/23/06 crash

Attachment: ndb_3_out.log.gz (application/x-gzip, text), 182.80 KiB.

[24 Nov 2006 2:10] Anatoly Pidruchny
Hi, Jonas.

Yesterday I compiled the debug version of MySQL/NDB. I did not use the ACC_SAFE_QUEUE.
Please let me know if you need one more round of testing with ACC_SAFE_QUEUE. After
several hours of testing the ndbd process crashed again. I think this time it crashed
earlier, looks like when the "start backup" was initiated by a cron job at 1 a.m. Here is
what was written in the error log:

======================== ndb_3_error.log ===============================
Time: Thursday 23 November 2006 - 01:00:25
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error
message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: main.cpp
Program: ndbd
Pid: 5677
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.4
Version: Version 5.1.12 (beta)
***EOM***
======================== ndb_3_error.log ===============================

I have just submitted gzipped ndb_3_signal.log, ndb_3_trace.log.4 and ndb_3_out.log.
There is also a core file (core.5677) this time, but its size is 103624704. Please let me
know if you need it.

As I said, this time the process crashed when the database backup operation was started.
Another complication was that this data node (node 3) was the only data node at the time
when it crashed (01:00:29). We have a hardware issue with the second server. The server
with data node 2 hung earlier (00:54:13), but NDB handled that very well.

/Anatoly
[24 Nov 2006 14:27] Anatoly Pidruchny
Jonas,

I want to give you some of the information from GDB inspecting the core file of the
aborted ndbd process. Please see the submitted file gdb.out.log. Please let me know if
you need any additional info.

/Anatoly
[24 Nov 2006 14:28] Anatoly Pidruchny
Some info about the aborted ndbd process from GDB

Attachment: gdb.out.log (application/octet-stream, text), 8.62 KiB.

[24 Nov 2006 22:12] Jonas Oreland
hi,

that last crash is already fixed.
http://bugs.mysql.com/bug.php?id=23502 or 
http://bugs.mysql.com/bug.php?id=23499

can you apply those patches and rerun your test.

i'm very interested in finding acc-error
  (which is cause of subsequent restore-error)

and a debug build will set ACC_SAFE_QUEUE

/Jonas
[26 Nov 2006 5:36] Anatoly Pidruchny
gzipped out log for the 11/25/06 crash

Attachment: ndb_3_out.log.gz (application/x-gzip, text), 6.36 KiB.

[26 Nov 2006 5:37] Anatoly Pidruchny
gzipped signal log for the 11/25/06 crash

Attachment: ndb_3_signal.log.gz (application/x-gzip, text), 18.11 KiB.

[26 Nov 2006 5:37] Anatoly Pidruchny
gzipped trace log for the 11/25/06 crash

Attachment: ndb_3_trace.log.5.gz (application/x-gzip, text), 52.57 KiB.

[26 Nov 2006 5:48] Anatoly Pidruchny
Hi again.

As requested, I have applied the patches for the bug
http://bugs.mysql.com/bug.php?id=23502, as well as the one requested earlier:
http://bugs.mysql.com/bug.php?id=24039. Then I compiled the debug version of MySQL/NDB,
installed, started data nodes with --initial and restored the database from the backup.
After several minutes the node 3 crashed again, even not under any serious load. Looks
like the debug version did not catch any problem earlier. The error log looks the same:

================ ndb_3_error.log ================
Current byte-offset of file-pointer is: 568                       

Time: Saturday 25 November 2006 - 18:26:04
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: dbacc/DbaccMain.cpp
Error object: DBACC (Line: 3194) 0x0000000e
Program: ndbd
Pid: 12012
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.5
Version: Version 5.1.12 (beta)
***EOM***
================ ndb_3_error.log ================

I have also submitted the gzipped signal, trace and out logs of the crashed data node 3.

/Anatoly
[26 Nov 2006 20:49] Jonas Oreland
Hi again...

Again thx for the help.

In the debug compiled version, can you start ndbd using "--core"
(and set ulimit)

Can you with gdb go to interesting frame
and look at value of op->m_op_bits
and also all of *op

/Jonas
[27 Nov 2006 17:06] Jonas Oreland
Any news ?
[27 Nov 2006 17:51] Anatoly Pidruchny
Jonas,

as you asked, I restarted ndbd nodes with --core. The test has been running from last
night, but ndbd nodes just do not crash this time. Hopefully I will be able to reproduce
the crash later today.

/Anatoly
[28 Nov 2006 16:35] Jonas Oreland
Any news ?
[28 Nov 2006 16:46] Anatoly Pidruchny
Jonas,

the test has been running for about 36 hours now and data nodes have not crashed! I want
to stop the test, restart the nodes with --initial, restore the database from the backup
and start the test again. The backup was done with version 5.1.11 and may be it has
something to do with the crash.

Another question, is it possible that this crash in DBACC was really fixed by the patch
for the bug http://bugs.mysql.com/bug.php?id=23502 (crash during backup)? Is it possible
that the data was somehow corrupted during backup operation and this corruption later
caused crash in DBACC?

/Anatoly
[28 Nov 2006 19:49] Jonas Oreland
>the test has been running for about 36 hours now and data nodes have not
>crashed! I want to stop the test, restart the nodes with --initial, restore the
>database from the backup and start the test again. The backup was done with
>version 5.1.11 and may be it has something to do with the crash.

5.1.11 connection sounds unlikely.
but restarting "from scratch" sounds like a good idea.
it might increase likelyhood of bug.

>Another question, is it possible that this crash in DBACC was really fixed by
>the patch for the bug http://bugs.mysql.com/bug.php?id=23502 (crash during
>backup)? Is it possible that the data was somehow corrupted during backup
>operation and this corruption later caused crash in DBACC?

No that's impossible...

But, http://bugs.mysql.com/bug.php?id=24039 could cause crash much later...

Thx again for helping out...

/Jonas
[28 Nov 2006 20:18] Anatoly Pidruchny
Jonas,

good news! After I restarted ndbd nodes with "--initial --core" and restored the database
from a backup, then continues the test, node 3 crashed again. I am attaching the
information from gdb that you asked.

/Anatoly
[28 Nov 2006 20:19] Anatoly Pidruchny
The value of *op in Dbacc::readTablePk from gdb

Attachment: gdb.out.log (application/octet-stream, text), 3.22 KiB.

[28 Nov 2006 20:45] Jonas Oreland
can you also attach the tracefile.

i'll analyse this later tonight or early tomorrow...

/Jonas
[28 Nov 2006 20:52] Anatoly Pidruchny
gzipped trace log from 11/28/06 crash

Attachment: ndb_3_trace.log.8.gz (application/x-gzip, text), 49.35 KiB.

[28 Nov 2006 20:52] Anatoly Pidruchny
> can you also attach the tracefile.

Just did.
[28 Nov 2006 22:21] Jonas Oreland
FYI: new traces has given some hints to what is happening.

Will now get some sleep, is unfortunatly quite busy tomorrow...but will try to 
continue on it...
[28 Nov 2006 22:29] Anatoly Pidruchny
Jonas,

thank you very much for working hard on this. I did some more rounds of testing. I can
confirm now that this kind of ndbd crash happens only after the data is restored from a
backup. If I do a rolling restart of data nodes with --initial then ndbd processes do not
crash any more. May be this information will help you to find the root cause of the
crash.

/Anatoly
[30 Nov 2006 15:16] Jonas Oreland
Potential fix

Attachment: patch (application/octet-stream, text), 702 bytes.

[30 Nov 2006 15:18] Jonas Oreland
Hi,

I attached patch, it looks like it's a "simple" typo

Could you apply it (it's on DbaccMain.cpp only)
and retest ?

Thx again...for brilliant help

/Jonas
[30 Nov 2006 20:46] Anatoly Pidruchny
Hi, Jonas,

I have applied the patch and our testing is in progress now. So far so good! I will try
more testing. But so far there are no crashes and I think the patch really fixes the
problem. Is there any chance the fix can go into 5.1.13? I heard that 5.1.13 is going to
be released very soon.

Thank you very much!

/Anatoly
[1 Dec 2006 8:54] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16274

ChangeSet@1.2333, 2006-12-01 08:54:28+01:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#24476
    Fix typo that yeilded rare crash in ACC
[6 Dec 2006 12:01] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of
that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available
version, including the bug fix. More information about accessing the source trees is
available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.14 changelog.