Bug #24476 ndbd crashed because of failed ndbrequire in dbacc/DbaccMain.cpp.
Submitted: 21 Nov 2006 15:30 Modified: 6 Dec 2006 11:01
Reporter: Anatoly Pidruchny (Candidate Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.1.12 OS:Linux (Linux x86_64)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: DBACC, error 2341, Line 3194, ndbrequire

[21 Nov 2006 15:30] Anatoly Pidruchny
Description:
ndbd node crashed because of an internal program error (failed ndbrequire). The error log (ndb_3_error.log) is copy-pasted below. Please see the attached out log, trace log, cluster log and information from mysqlbug for more details.

The NDB Cluster was doing fine under our test load for several hours. Then it looked like ndbd processes started to take high CPU and operations started to become very slow. Finally node 3 crashed. The other data node 2 stayed up, CPU utilization lowered and DB operations started to work fast again.

******************** ndb_3_error.log ************************************
Current byte-offset of file-pointer is: 568                       

Time: Tuesday 21 November 2006 - 09:08:00
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbacc/DbaccMain.cpp
Error object: DBACC (Line: 3194) 0x0000000a
Program: ndbd
Pid: 4986
Trace: /sm/mysql/data/ndb_3_trace.log.1
Version: Version 5.1.12 (beta)
***EOM***
******************** ndb_3_error.log ************************************

How to repeat:
It is unknown if the problem is easily reproducible. NDB Cluster has to be running under load for several hours.
[21 Nov 2006 15:32] Anatoly Pidruchny
Changed category to Cluster.
[21 Nov 2006 15:38] Anatoly Pidruchny
Mysqlbug info

Attachment: mysqlbug.txt (text/plain), 3.46 KiB.

[21 Nov 2006 15:44] Anatoly Pidruchny
Error log of the ndbd node

Attachment: ndb_3_error.log (application/octet-stream, text), 568 bytes.

[21 Nov 2006 15:44] Anatoly Pidruchny
Out log of the ndbd node.

Attachment: ndb_3_out.log (application/octet-stream, text), 2.69 KiB.

[21 Nov 2006 15:45] Anatoly Pidruchny
gzipped trace log of the ndbd node

Attachment: ndb_3_trace.log.1.gz (application/x-gzip, text), 44.57 KiB.

[21 Nov 2006 15:49] Anatoly Pidruchny
Cluster log file

Attachment: ndb_1_cluster.log (application/octet-stream, text), 64.57 KiB.

[21 Nov 2006 15:51] Anatoly Pidruchny
One more thing. After the data node crashed, it was not restarted by the angel process. May be it is OK, because I am not sure that the angel is supposed to restart it in the case like this.
[21 Nov 2006 15:54] Jonas Oreland
Hi,

Could you test bug fix for http://bugs.mysql.com/bug.php?id=24039
if you do index scan + delete, this is bug is quite easy to trigger..

/Jonas
[21 Nov 2006 16:46] Anatoly Pidruchny
Hi, Jonas,

I do not think it is the same problem. But I will definitely try the bug fix for http://bugs.mysql.com/bug.php?id=24039.

Thanks,
Anatoly.
[22 Nov 2006 13:07] Valeriy Kravchuk
Please, try, and inform about the results.
[22 Nov 2006 15:42] Anatoly Pidruchny
Yesterday I applied the patch for bug http://bugs.mysql.com/bug.php?id=24039 and then we ran our testing over night. This morning one of the nodes crashed again with the same error messages. Below is the contents of the ndb_3_error.log. Please let me know if you need more information.

Time: Wednesday 22 November 2006 - 10:20:55
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbacc/DbaccMain.cpp
Error object: DBACC (Line: 3194) 0x0000000a
Program: ndbd
Pid: 7709
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.2
Version: Version 5.1.12 (beta)
***EOM***
                                                                                     
Time: Wednesday 22 November 2006 - 10:26:28
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: restore.cpp
Error object: RESTORE (Line: 1154) 0x0000000a
Program: ndbd
Pid: 9822
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.3
Version: Version 5.1.12 (beta)
***EOM***
[22 Nov 2006 15:49] Jonas Oreland
Hi,

Thx for the help,

1) Can you supply sm/mysql/ndb_data/ndb_3_trace.log.3
   (error during RESTORE)

2) Could you test running a debug compiled ndb version
   (or compile ndb with "export NDB_EXTRA_FLAGS=-DACC_SAFE_QUEUE" 
    before compiling acc)

I'v never seen this error before,
basically it finds a uncommited record wo/ lock, which ofcourse is incorrect.

Could you also say which transactions are running when node fails.
(i.e for all the transactions that will get aborted on this error, list what they've done sofar and what they were trying to at failure)

/Jonas
[22 Nov 2006 16:10] Anatoly Pidruchny
gzipped ndb_3_trace.log.3 (error during RESTORE)

Attachment: ndb_3_trace.log.3.gz (application/x-gzip, text), 30.42 KiB.

[22 Nov 2006 16:59] Anatoly Pidruchny
Jonas,

As you asked, I have just submitted the sm/mysql/ndb_data/ndb_3_trace.log.3 (error during RESTORE).

Next thing, I will compile debug version of MySQL/NDB.

Do you want to know what transactions were running during the first crash (failed ndbrequire in DBACC) or the second crash (failed ndbrequire in RESTORE)? 
We are using our own server application (two of them on two nodes) that access NDB directly using NDB API. All the transactions are very small. Looking at our traces, the following transactions/operations were running during the first crash:

============== First node =====================

1. 4 transactions doing:
myOperation->nextResult(true, FORCE_EXEC != 0);
Here myOperation is an NdbIndexScanOperation on a primary key of the table. The readTuples line was:

myOperation->readTuples(NdbOperation::LM_CommittedRead,
                                NdbScanOperation::SF_OrderBy | (lowToHigh ? 0 : NdbScanOperation::SF_Descending))

The transaction execute line was:

myTransaction->execute(NdbTransaction::NoCommit, NdbTransaction::AbortOnError, FORCE_EXEC)

2. 2 transactions deleting a single record using a primary key access.
3. 6 transactions reading a single record using a primary key access.
4. 4 transactions updating a single record using a primary key access.

============== Second node =====================

1. 2 transactions doing nextResult, as described above.
2. 3 transactions deleting a single record using a primary key access.
3. 2 transactions updating a single record using interpretedUpdateTuple(). The interpreted program does a simple check before the update and can choose not to update the record if the condition is not satisfied.
4. 1 transaction reading a single record using a primary key access.
5. 1 transaction updating a single record using a primary key access.

During the second crash, no transactions failed in our application. But you can see what kinds of transactions we are running all the time from the information above.
[23 Nov 2006 21:40] Jonas Oreland
Hi again,

looking forward to hearing if compiling with ACC_SAFE_QUEUE produces an "earlier" crash

the RESTORE crash is very very likely a "follup" bug from the ACC problem.

/Jonas
[24 Nov 2006 0:50] Anatoly Pidruchny
gzipped ndb_3_signal.log for the 11/23/06 crash

Attachment: ndb_3_signal.log.gz (application/x-gzip, text), 18.11 KiB.

[24 Nov 2006 0:52] Anatoly Pidruchny
gzipped trace log for the 11/23/06 crash

Attachment: ndb_3_trace.log.4.gz (application/x-gzip, text), 73.00 KiB.

[24 Nov 2006 0:53] Anatoly Pidruchny
gzipped out log for the 11/23/06 crash

Attachment: ndb_3_out.log.gz (application/x-gzip, text), 182.80 KiB.

[24 Nov 2006 1:10] Anatoly Pidruchny
Hi, Jonas.

Yesterday I compiled the debug version of MySQL/NDB. I did not use the ACC_SAFE_QUEUE. Please let me know if you need one more round of testing with ACC_SAFE_QUEUE. After several hours of testing the ndbd process crashed again. I think this time it crashed earlier, looks like when the "start backup" was initiated by a cron job at 1 a.m. Here is what was written in the error log:

======================== ndb_3_error.log ===============================
Time: Thursday 23 November 2006 - 01:00:25
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: main.cpp
Program: ndbd
Pid: 5677
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.4
Version: Version 5.1.12 (beta)
***EOM***
======================== ndb_3_error.log ===============================

I have just submitted gzipped ndb_3_signal.log, ndb_3_trace.log.4 and ndb_3_out.log. There is also a core file (core.5677) this time, but its size is 103624704. Please let me know if you need it.

As I said, this time the process crashed when the database backup operation was started. Another complication was that this data node (node 3) was the only data node at the time when it crashed (01:00:29). We have a hardware issue with the second server. The server with data node 2 hung earlier (00:54:13), but NDB handled that very well.

/Anatoly
[24 Nov 2006 13:27] Anatoly Pidruchny
Jonas,

I want to give you some of the information from GDB inspecting the core file of the aborted ndbd process. Please see the submitted file gdb.out.log. Please let me know if you need any additional info.

/Anatoly
[24 Nov 2006 13:28] Anatoly Pidruchny
Some info about the aborted ndbd process from GDB

Attachment: gdb.out.log (application/octet-stream, text), 8.62 KiB.

[24 Nov 2006 21:12] Jonas Oreland
hi,

that last crash is already fixed.
http://bugs.mysql.com/bug.php?id=23502 or 
http://bugs.mysql.com/bug.php?id=23499

can you apply those patches and rerun your test.

i'm very interested in finding acc-error
  (which is cause of subsequent restore-error)

and a debug build will set ACC_SAFE_QUEUE

/Jonas
[26 Nov 2006 4:36] Anatoly Pidruchny
gzipped out log for the 11/25/06 crash

Attachment: ndb_3_out.log.gz (application/x-gzip, text), 6.36 KiB.

[26 Nov 2006 4:37] Anatoly Pidruchny
gzipped signal log for the 11/25/06 crash

Attachment: ndb_3_signal.log.gz (application/x-gzip, text), 18.11 KiB.

[26 Nov 2006 4:37] Anatoly Pidruchny
gzipped trace log for the 11/25/06 crash

Attachment: ndb_3_trace.log.5.gz (application/x-gzip, text), 52.57 KiB.

[26 Nov 2006 4:48] Anatoly Pidruchny
Hi again.

As requested, I have applied the patches for the bug http://bugs.mysql.com/bug.php?id=23502, as well as the one requested earlier: http://bugs.mysql.com/bug.php?id=24039. Then I compiled the debug version of MySQL/NDB, installed, started data nodes with --initial and restored the database from the backup. After several minutes the node 3 crashed again, even not under any serious load. Looks like the debug version did not catch any problem earlier. The error log looks the same:

================ ndb_3_error.log ================
Current byte-offset of file-pointer is: 568                       

Time: Saturday 25 November 2006 - 18:26:04
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbacc/DbaccMain.cpp
Error object: DBACC (Line: 3194) 0x0000000e
Program: ndbd
Pid: 12012
Trace: /sm/mysql/ndb_data/ndb_3_trace.log.5
Version: Version 5.1.12 (beta)
***EOM***
================ ndb_3_error.log ================

I have also submitted the gzipped signal, trace and out logs of the crashed data node 3.

/Anatoly
[26 Nov 2006 19:49] Jonas Oreland
Hi again...

Again thx for the help.

In the debug compiled version, can you start ndbd using "--core"
(and set ulimit)

Can you with gdb go to interesting frame
and look at value of op->m_op_bits
and also all of *op

/Jonas
[27 Nov 2006 16:06] Jonas Oreland
Any news ?
[27 Nov 2006 16:51] Anatoly Pidruchny
Jonas,

as you asked, I restarted ndbd nodes with --core. The test has been running from last night, but ndbd nodes just do not crash this time. Hopefully I will be able to reproduce the crash later today.

/Anatoly
[28 Nov 2006 15:35] Jonas Oreland
Any news ?
[28 Nov 2006 15:46] Anatoly Pidruchny
Jonas,

the test has been running for about 36 hours now and data nodes have not crashed! I want to stop the test, restart the nodes with --initial, restore the database from the backup and start the test again. The backup was done with version 5.1.11 and may be it has something to do with the crash.

Another question, is it possible that this crash in DBACC was really fixed by the patch for the bug http://bugs.mysql.com/bug.php?id=23502 (crash during backup)? Is it possible that the data was somehow corrupted during backup operation and this corruption later caused crash in DBACC?

/Anatoly
[28 Nov 2006 18:49] Jonas Oreland
>the test has been running for about 36 hours now and data nodes have not
>crashed! I want to stop the test, restart the nodes with --initial, restore the
>database from the backup and start the test again. The backup was done with
>version 5.1.11 and may be it has something to do with the crash.

5.1.11 connection sounds unlikely.
but restarting "from scratch" sounds like a good idea.
it might increase likelyhood of bug.

>Another question, is it possible that this crash in DBACC was really fixed by
>the patch for the bug http://bugs.mysql.com/bug.php?id=23502 (crash during
>backup)? Is it possible that the data was somehow corrupted during backup
>operation and this corruption later caused crash in DBACC?

No that's impossible...

But, http://bugs.mysql.com/bug.php?id=24039 could cause crash much later...

Thx again for helping out...

/Jonas
[28 Nov 2006 19:18] Anatoly Pidruchny
Jonas,

good news! After I restarted ndbd nodes with "--initial --core" and restored the database from a backup, then continues the test, node 3 crashed again. I am attaching the information from gdb that you asked.

/Anatoly
[28 Nov 2006 19:19] Anatoly Pidruchny
The value of *op in Dbacc::readTablePk from gdb

Attachment: gdb.out.log (application/octet-stream, text), 3.22 KiB.

[28 Nov 2006 19:45] Jonas Oreland
can you also attach the tracefile.

i'll analyse this later tonight or early tomorrow...

/Jonas
[28 Nov 2006 19:52] Anatoly Pidruchny
gzipped trace log from 11/28/06 crash

Attachment: ndb_3_trace.log.8.gz (application/x-gzip, text), 49.35 KiB.

[28 Nov 2006 19:52] Anatoly Pidruchny
> can you also attach the tracefile.

Just did.
[28 Nov 2006 21:21] Jonas Oreland
FYI: new traces has given some hints to what is happening.

Will now get some sleep, is unfortunatly quite busy tomorrow...but will try to 
continue on it...
[28 Nov 2006 21:29] Anatoly Pidruchny
Jonas,

thank you very much for working hard on this. I did some more rounds of testing. I can confirm now that this kind of ndbd crash happens only after the data is restored from a backup. If I do a rolling restart of data nodes with --initial then ndbd processes do not crash any more. May be this information will help you to find the root cause of the crash.

/Anatoly
[30 Nov 2006 14:16] Jonas Oreland
Potential fix

Attachment: patch (application/octet-stream, text), 702 bytes.

[30 Nov 2006 14:18] Jonas Oreland
Hi,

I attached patch, it looks like it's a "simple" typo

Could you apply it (it's on DbaccMain.cpp only)
and retest ?

Thx again...for brilliant help

/Jonas
[30 Nov 2006 19:46] Anatoly Pidruchny
Hi, Jonas,

I have applied the patch and our testing is in progress now. So far so good! I will try more testing. But so far there are no crashes and I think the patch really fixes the problem. Is there any chance the fix can go into 5.1.13? I heard that 5.1.13 is going to be released very soon.

Thank you very much!

/Anatoly
[1 Dec 2006 7:54] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16274

ChangeSet@1.2333, 2006-12-01 08:54:28+01:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#24476
    Fix typo that yeilded rare crash in ACC
[6 Dec 2006 11:01] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.14 changelog.