Bug #28077 | ndbrequire failure in LGMAN | ||
---|---|---|---|
Submitted: | 24 Apr 2007 15:29 | Modified: | 7 Jan 2009 16:28 |
Reporter: | Hartmut Holzgraefe | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Disk Data | Severity: | S3 (Non-critical) |
Version: | 5.1.16 | OS: | Linux |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[24 Apr 2007 15:29]
Hartmut Holzgraefe
[10 Mar 2008 13:10]
Tomas Ulin
We have tried to reproduce this bug have not been able to. We would need more information on how to reproduce to bea ble to continue working on this bug. BR, Tomas
[10 Apr 2008 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[16 Apr 2008 10:22]
Susanne Ebrecht
Hartmut, we still need Feedback from you here.
[16 May 2008 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[9 Jul 2008 9:54]
Chris Picton
Trace log
Attachment: ndb_3_trace.log.1.bz2 (application/x-bzip, text), 89.65 KiB.
[9 Jul 2008 9:56]
Chris Picton
The trace file for this is attached above. ------------------------------------------ Time: Wednesday 9 July 2008 - 10:45:21 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: lgman.cpp Error object: LGMAN (Line: 1462) 0x0000000a Program: /usr/sbin/ndbd Pid: 8982 Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1 Version: mysql-5.1.23 ndb-6.2.15 ***EOM*** ----------------------------------------- Both my ndb nodes crashed at the same time. The queries being run were: mysql> update Raw_cdr set updated = 0 limit 100000; Query OK, 100000 rows affected (34.67 sec) Rows matched: 100000 Changed: 100000 Warnings: 0 mysql> update Raw_cdr set updated = 0 limit 200000; Query OK, 200000 rows affected (31.57 sec) Rows matched: 200000 Changed: 200000 Warnings: 0 mysql> update Raw_cdr set updated = 0 limit 500000; Query OK, 500000 rows affected (3 min 34.57 sec) Rows matched: 500000 Changed: 500000 Warnings: 0 mysql> update Raw_cdr set updated = 0 limit 1000000; ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER mysql> update Raw_cdr set updated = 0 limit 5; ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER
[9 Jul 2008 10:11]
Pekka Nousiainen
change to OPEN so more info can be added by others(?)
[18 Jul 2008 17:26]
Chris Picton
After truncating, dropping and recreating the Raw_cdr table, the bug has been triggered again: Time: Friday 18 July 2008 - 18:58:14 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: lgman.cpp Error object: LGMAN (Line: 1462) 0x0000000a Program: /usr/sbin/ndbd Pid: 6525 Trace: /var/lib/mysql-cluster/ndb_4_trace.log.5 Version: mysql-5.1.23 ndb-6.2.15 ***EOM*** The log file has the following: --------- changing file from -2145329152 to -2145329140 changing file from -2145329140 to -2145329152 miss matched writes ------------ And then after a while -------------- 2008-07-18 18:56:12 [ndbd] WARNING -- Node 5 failed in report_sub_start_conf 2008-07-18 18:56:12 [ndbd] WARNING -- Node 5 failed in report_sub_start_conf 2008-07-18 18:58:14 [ndbd] INFO -- lgman.cpp 2008-07-18 18:58:14 [ndbd] INFO -- LGMAN (Line: 1462) 0x0000000a 2008-07-18 18:58:14 [ndbd] INFO -- Error handler shutting down system 2008-07-18 18:58:14 [ndbd] INFO -- Error handler shutdown completed - exiting 2008-07-18 18:58:15 [ndbd] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. -------------- On one of the SQL nodes, I see the following in the logs: 080718 18:58:14 [ERROR] Got error 270 when reading table './CMS/Raw_cdr' Any ideas?
[20 Dec 2008 19:47]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62187 2784 Jonas Oreland 2008-12-20 ndb - bug#28077 - remove incorrect ndbrequire in lgman, plus fix accounting of free_buffer_words
[20 Dec 2008 19:52]
Bugs System
Pushed into 5.1.30-ndb-6.2.17 (revid:jonas@mysql.com-20081220194844-k181w30wbbeprkaj) (version source revid:jonas@mysql.com-20081220194844-k181w30wbbeprkaj) (merge vers: 5.1.30-ndb-6.2.17) (pib:6)
[20 Dec 2008 19:53]
Bugs System
Pushed into 5.1.30-ndb-6.3.21 (revid:jonas@mysql.com-20081220195109-tjimr7u6gkp210np) (version source revid:jonas@mysql.com-20081220195109-tjimr7u6gkp210np) (merge vers: 5.1.30-ndb-6.3.21) (pib:6)
[20 Dec 2008 19:54]
Bugs System
Pushed into 5.1.30-ndb-6.4.0 (revid:jonas@mysql.com-20081220195224-ql5g2lhjlpxthg69) (version source revid:jonas@mysql.com-20081220195224-ql5g2lhjlpxthg69) (merge vers: 5.1.30-ndb-6.4.0) (pib:6)
[20 Dec 2008 19:57]
Jonas Oreland
Problem was incorrect ndbrequire, that "fired" when undo-buffer was set "low" and starting high-load so that undo-buffer was almost full. also fixed in same commit, a related "accounting" bug that caused lgman to get incorrect notion of free memory in undo buffer, which could theoretically lead to buffer-overrun (i.e new log-records was written to memory that had not yet been flushed), which could cause subsequent restart problems
[7 Jan 2009 16:28]
Jon Stephens
Documented bugfix in the NDB-6.2.17 and NDB-6.3.21 changelogs, as follows: Starting a cluster under load such that Disk Data tables used most of the undo buffer could cause data node failures. The fix for this bug also corrected an issue in the LGMAN kernel block where the amount of free space left in the undo buffer was miscalculated, causing buffer overruns. This could cause records in the buffer to be overwritten, leading to problems when restarting data nodes.
[10 Feb 2009 20:13]
Bugs System
Pushed into 6.0.10-alpha (revid:alik@sun.com-20090210194937-s7xshv5l3m1v7wi9) (version source revid:tomas.ulin@sun.com-20090108115759-b4yhuwkm6w8tg7j3) (merge vers: 6.0.10-alpha) (pib:6)
[12 Feb 2009 10:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/66011 2814 He Zhenxing 2009-02-12 [merge] Auto merge 6.0 -> 6.0-rpl