Bug #24476 | ndbd crashed because of failed ndbrequire in dbacc/DbaccMain.cpp. | ||
---|---|---|---|
Submitted: | 21 Nov 2006 15:30 | Modified: | 6 Dec 2006 11:01 |
Reporter: | Anatoly Pidruchny (Candidate Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 5.1.12 | OS: | Linux (Linux x86_64) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
Tags: | DBACC, error 2341, Line 3194, ndbrequire |
[21 Nov 2006 15:30]
Anatoly Pidruchny
[21 Nov 2006 15:32]
Anatoly Pidruchny
Changed category to Cluster.
[21 Nov 2006 15:38]
Anatoly Pidruchny
Mysqlbug info
Attachment: mysqlbug.txt (text/plain), 3.46 KiB.
[21 Nov 2006 15:44]
Anatoly Pidruchny
Error log of the ndbd node
Attachment: ndb_3_error.log (application/octet-stream, text), 568 bytes.
[21 Nov 2006 15:44]
Anatoly Pidruchny
Out log of the ndbd node.
Attachment: ndb_3_out.log (application/octet-stream, text), 2.69 KiB.
[21 Nov 2006 15:45]
Anatoly Pidruchny
gzipped trace log of the ndbd node
Attachment: ndb_3_trace.log.1.gz (application/x-gzip, text), 44.57 KiB.
[21 Nov 2006 15:49]
Anatoly Pidruchny
Cluster log file
Attachment: ndb_1_cluster.log (application/octet-stream, text), 64.57 KiB.
[21 Nov 2006 15:51]
Anatoly Pidruchny
One more thing. After the data node crashed, it was not restarted by the angel process. May be it is OK, because I am not sure that the angel is supposed to restart it in the case like this.
[21 Nov 2006 15:54]
Jonas Oreland
Hi, Could you test bug fix for http://bugs.mysql.com/bug.php?id=24039 if you do index scan + delete, this is bug is quite easy to trigger.. /Jonas
[21 Nov 2006 16:46]
Anatoly Pidruchny
Hi, Jonas, I do not think it is the same problem. But I will definitely try the bug fix for http://bugs.mysql.com/bug.php?id=24039. Thanks, Anatoly.
[22 Nov 2006 13:07]
Valeriy Kravchuk
Please, try, and inform about the results.
[22 Nov 2006 15:42]
Anatoly Pidruchny
Yesterday I applied the patch for bug http://bugs.mysql.com/bug.php?id=24039 and then we ran our testing over night. This morning one of the nodes crashed again with the same error messages. Below is the contents of the ndb_3_error.log. Please let me know if you need more information. Time: Wednesday 22 November 2006 - 10:20:55 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbacc/DbaccMain.cpp Error object: DBACC (Line: 3194) 0x0000000a Program: ndbd Pid: 7709 Trace: /sm/mysql/ndb_data/ndb_3_trace.log.2 Version: Version 5.1.12 (beta) ***EOM*** Time: Wednesday 22 November 2006 - 10:26:28 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: restore.cpp Error object: RESTORE (Line: 1154) 0x0000000a Program: ndbd Pid: 9822 Trace: /sm/mysql/ndb_data/ndb_3_trace.log.3 Version: Version 5.1.12 (beta) ***EOM***
[22 Nov 2006 15:49]
Jonas Oreland
Hi, Thx for the help, 1) Can you supply sm/mysql/ndb_data/ndb_3_trace.log.3 (error during RESTORE) 2) Could you test running a debug compiled ndb version (or compile ndb with "export NDB_EXTRA_FLAGS=-DACC_SAFE_QUEUE" before compiling acc) I'v never seen this error before, basically it finds a uncommited record wo/ lock, which ofcourse is incorrect. Could you also say which transactions are running when node fails. (i.e for all the transactions that will get aborted on this error, list what they've done sofar and what they were trying to at failure) /Jonas
[22 Nov 2006 16:10]
Anatoly Pidruchny
gzipped ndb_3_trace.log.3 (error during RESTORE)
Attachment: ndb_3_trace.log.3.gz (application/x-gzip, text), 30.42 KiB.
[22 Nov 2006 16:59]
Anatoly Pidruchny
Jonas, As you asked, I have just submitted the sm/mysql/ndb_data/ndb_3_trace.log.3 (error during RESTORE). Next thing, I will compile debug version of MySQL/NDB. Do you want to know what transactions were running during the first crash (failed ndbrequire in DBACC) or the second crash (failed ndbrequire in RESTORE)? We are using our own server application (two of them on two nodes) that access NDB directly using NDB API. All the transactions are very small. Looking at our traces, the following transactions/operations were running during the first crash: ============== First node ===================== 1. 4 transactions doing: myOperation->nextResult(true, FORCE_EXEC != 0); Here myOperation is an NdbIndexScanOperation on a primary key of the table. The readTuples line was: myOperation->readTuples(NdbOperation::LM_CommittedRead, NdbScanOperation::SF_OrderBy | (lowToHigh ? 0 : NdbScanOperation::SF_Descending)) The transaction execute line was: myTransaction->execute(NdbTransaction::NoCommit, NdbTransaction::AbortOnError, FORCE_EXEC) 2. 2 transactions deleting a single record using a primary key access. 3. 6 transactions reading a single record using a primary key access. 4. 4 transactions updating a single record using a primary key access. ============== Second node ===================== 1. 2 transactions doing nextResult, as described above. 2. 3 transactions deleting a single record using a primary key access. 3. 2 transactions updating a single record using interpretedUpdateTuple(). The interpreted program does a simple check before the update and can choose not to update the record if the condition is not satisfied. 4. 1 transaction reading a single record using a primary key access. 5. 1 transaction updating a single record using a primary key access. During the second crash, no transactions failed in our application. But you can see what kinds of transactions we are running all the time from the information above.
[23 Nov 2006 21:40]
Jonas Oreland
Hi again, looking forward to hearing if compiling with ACC_SAFE_QUEUE produces an "earlier" crash the RESTORE crash is very very likely a "follup" bug from the ACC problem. /Jonas
[24 Nov 2006 0:50]
Anatoly Pidruchny
gzipped ndb_3_signal.log for the 11/23/06 crash
Attachment: ndb_3_signal.log.gz (application/x-gzip, text), 18.11 KiB.
[24 Nov 2006 0:52]
Anatoly Pidruchny
gzipped trace log for the 11/23/06 crash
Attachment: ndb_3_trace.log.4.gz (application/x-gzip, text), 73.00 KiB.
[24 Nov 2006 0:53]
Anatoly Pidruchny
gzipped out log for the 11/23/06 crash
Attachment: ndb_3_out.log.gz (application/x-gzip, text), 182.80 KiB.
[24 Nov 2006 1:10]
Anatoly Pidruchny
Hi, Jonas. Yesterday I compiled the debug version of MySQL/NDB. I did not use the ACC_SAFE_QUEUE. Please let me know if you need one more round of testing with ACC_SAFE_QUEUE. After several hours of testing the ndbd process crashed again. I think this time it crashed earlier, looks like when the "start backup" was initiated by a cron job at 1 a.m. Here is what was written in the error log: ======================== ndb_3_error.log =============================== Time: Thursday 23 November 2006 - 01:00:25 Status: Temporary error, restart node Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug) Error: 6000 Error data: Signal 6 received; Aborted Error object: main.cpp Program: ndbd Pid: 5677 Trace: /sm/mysql/ndb_data/ndb_3_trace.log.4 Version: Version 5.1.12 (beta) ***EOM*** ======================== ndb_3_error.log =============================== I have just submitted gzipped ndb_3_signal.log, ndb_3_trace.log.4 and ndb_3_out.log. There is also a core file (core.5677) this time, but its size is 103624704. Please let me know if you need it. As I said, this time the process crashed when the database backup operation was started. Another complication was that this data node (node 3) was the only data node at the time when it crashed (01:00:29). We have a hardware issue with the second server. The server with data node 2 hung earlier (00:54:13), but NDB handled that very well. /Anatoly
[24 Nov 2006 13:27]
Anatoly Pidruchny
Jonas, I want to give you some of the information from GDB inspecting the core file of the aborted ndbd process. Please see the submitted file gdb.out.log. Please let me know if you need any additional info. /Anatoly
[24 Nov 2006 13:28]
Anatoly Pidruchny
Some info about the aborted ndbd process from GDB
Attachment: gdb.out.log (application/octet-stream, text), 8.62 KiB.
[24 Nov 2006 21:12]
Jonas Oreland
hi, that last crash is already fixed. http://bugs.mysql.com/bug.php?id=23502 or http://bugs.mysql.com/bug.php?id=23499 can you apply those patches and rerun your test. i'm very interested in finding acc-error (which is cause of subsequent restore-error) and a debug build will set ACC_SAFE_QUEUE /Jonas
[26 Nov 2006 4:36]
Anatoly Pidruchny
gzipped out log for the 11/25/06 crash
Attachment: ndb_3_out.log.gz (application/x-gzip, text), 6.36 KiB.
[26 Nov 2006 4:37]
Anatoly Pidruchny
gzipped signal log for the 11/25/06 crash
Attachment: ndb_3_signal.log.gz (application/x-gzip, text), 18.11 KiB.
[26 Nov 2006 4:37]
Anatoly Pidruchny
gzipped trace log for the 11/25/06 crash
Attachment: ndb_3_trace.log.5.gz (application/x-gzip, text), 52.57 KiB.
[26 Nov 2006 4:48]
Anatoly Pidruchny
Hi again. As requested, I have applied the patches for the bug http://bugs.mysql.com/bug.php?id=23502, as well as the one requested earlier: http://bugs.mysql.com/bug.php?id=24039. Then I compiled the debug version of MySQL/NDB, installed, started data nodes with --initial and restored the database from the backup. After several minutes the node 3 crashed again, even not under any serious load. Looks like the debug version did not catch any problem earlier. The error log looks the same: ================ ndb_3_error.log ================ Current byte-offset of file-pointer is: 568 Time: Saturday 25 November 2006 - 18:26:04 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: dbacc/DbaccMain.cpp Error object: DBACC (Line: 3194) 0x0000000e Program: ndbd Pid: 12012 Trace: /sm/mysql/ndb_data/ndb_3_trace.log.5 Version: Version 5.1.12 (beta) ***EOM*** ================ ndb_3_error.log ================ I have also submitted the gzipped signal, trace and out logs of the crashed data node 3. /Anatoly
[26 Nov 2006 19:49]
Jonas Oreland
Hi again... Again thx for the help. In the debug compiled version, can you start ndbd using "--core" (and set ulimit) Can you with gdb go to interesting frame and look at value of op->m_op_bits and also all of *op /Jonas
[27 Nov 2006 16:06]
Jonas Oreland
Any news ?
[27 Nov 2006 16:51]
Anatoly Pidruchny
Jonas, as you asked, I restarted ndbd nodes with --core. The test has been running from last night, but ndbd nodes just do not crash this time. Hopefully I will be able to reproduce the crash later today. /Anatoly
[28 Nov 2006 15:35]
Jonas Oreland
Any news ?
[28 Nov 2006 15:46]
Anatoly Pidruchny
Jonas, the test has been running for about 36 hours now and data nodes have not crashed! I want to stop the test, restart the nodes with --initial, restore the database from the backup and start the test again. The backup was done with version 5.1.11 and may be it has something to do with the crash. Another question, is it possible that this crash in DBACC was really fixed by the patch for the bug http://bugs.mysql.com/bug.php?id=23502 (crash during backup)? Is it possible that the data was somehow corrupted during backup operation and this corruption later caused crash in DBACC? /Anatoly
[28 Nov 2006 18:49]
Jonas Oreland
>the test has been running for about 36 hours now and data nodes have not >crashed! I want to stop the test, restart the nodes with --initial, restore the >database from the backup and start the test again. The backup was done with >version 5.1.11 and may be it has something to do with the crash. 5.1.11 connection sounds unlikely. but restarting "from scratch" sounds like a good idea. it might increase likelyhood of bug. >Another question, is it possible that this crash in DBACC was really fixed by >the patch for the bug http://bugs.mysql.com/bug.php?id=23502 (crash during >backup)? Is it possible that the data was somehow corrupted during backup >operation and this corruption later caused crash in DBACC? No that's impossible... But, http://bugs.mysql.com/bug.php?id=24039 could cause crash much later... Thx again for helping out... /Jonas
[28 Nov 2006 19:18]
Anatoly Pidruchny
Jonas, good news! After I restarted ndbd nodes with "--initial --core" and restored the database from a backup, then continues the test, node 3 crashed again. I am attaching the information from gdb that you asked. /Anatoly
[28 Nov 2006 19:19]
Anatoly Pidruchny
The value of *op in Dbacc::readTablePk from gdb
Attachment: gdb.out.log (application/octet-stream, text), 3.22 KiB.
[28 Nov 2006 19:45]
Jonas Oreland
can you also attach the tracefile. i'll analyse this later tonight or early tomorrow... /Jonas
[28 Nov 2006 19:52]
Anatoly Pidruchny
gzipped trace log from 11/28/06 crash
Attachment: ndb_3_trace.log.8.gz (application/x-gzip, text), 49.35 KiB.
[28 Nov 2006 19:52]
Anatoly Pidruchny
> can you also attach the tracefile. Just did.
[28 Nov 2006 21:21]
Jonas Oreland
FYI: new traces has given some hints to what is happening. Will now get some sleep, is unfortunatly quite busy tomorrow...but will try to continue on it...
[28 Nov 2006 21:29]
Anatoly Pidruchny
Jonas, thank you very much for working hard on this. I did some more rounds of testing. I can confirm now that this kind of ndbd crash happens only after the data is restored from a backup. If I do a rolling restart of data nodes with --initial then ndbd processes do not crash any more. May be this information will help you to find the root cause of the crash. /Anatoly
[30 Nov 2006 14:16]
Jonas Oreland
Potential fix
Attachment: patch (application/octet-stream, text), 702 bytes.
[30 Nov 2006 14:18]
Jonas Oreland
Hi, I attached patch, it looks like it's a "simple" typo Could you apply it (it's on DbaccMain.cpp only) and retest ? Thx again...for brilliant help /Jonas
[30 Nov 2006 19:46]
Anatoly Pidruchny
Hi, Jonas, I have applied the patch and our testing is in progress now. So far so good! I will try more testing. But so far there are no crashes and I think the patch really fixes the problem. Is there any chance the fix can go into 5.1.13? I heard that 5.1.13 is going to be released very soon. Thank you very much! /Anatoly
[1 Dec 2006 7:54]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/16274 ChangeSet@1.2333, 2006-12-01 08:54:28+01:00, jonas@perch.ndb.mysql.com +1 -0 ndb - bug#24476 Fix typo that yeilded rare crash in ACC
[6 Dec 2006 11:01]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html Documented bugfix in 5.1.14 changelog.