Bug #28077 ndbrequire failure in LGMAN
Submitted: 24 Apr 2007 15:29 Modified: 7 Jan 2009 16:28
Reporter: Hartmut Holzgraefe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S3 (Non-critical)
Version:5.1.16 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any

[24 Apr 2007 15:29] Hartmut Holzgraefe
Description:
both nodes of a cluster show ndbrequire errors in LGMAN line 1427

   1426         // Only 1 chunk
   1427         ndbrequire(ptr.p->m_buffer_pages.getSize() == 2);
   1428         Uint32 tmp= consumer.m_current_page.m_idx + 1;
   1429         cnt= write_log_pages(signal, ptr, page, tmp);
   1430         assert(cnt <= tmp);

the errors might be related to send buffer overflows but are not necessarily
connected to these

How to repeat:
.
[10 Mar 2008 13:10] Tomas Ulin
We have tried to reproduce this bug have not been able to.

We would need more information on how to reproduce to bea ble to continue working on this bug.

BR,

Tomas
[10 Apr 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[16 Apr 2008 10:22] Susanne Ebrecht
Hartmut,

we still need Feedback from you here.
[16 May 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[9 Jul 2008 9:54] Chris Picton
Trace log

Attachment: ndb_3_trace.log.1.bz2 (application/x-bzip, text), 89.65 KiB.

[9 Jul 2008 9:56] Chris Picton
The trace file for this is attached above.
------------------------------------------
Time: Wednesday 9 July 2008 - 10:45:21
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 1462) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 8982
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1
Version: mysql-5.1.23 ndb-6.2.15
***EOM***
-----------------------------------------

Both my ndb nodes crashed at the same time.

The queries being run were:

mysql> update Raw_cdr set updated = 0 limit 100000;
Query OK, 100000 rows affected (34.67 sec)
Rows matched: 100000  Changed: 100000  Warnings: 0

mysql> update Raw_cdr set updated = 0 limit 200000;
Query OK, 200000 rows affected (31.57 sec)
Rows matched: 200000  Changed: 200000  Warnings: 0

mysql> update Raw_cdr set updated = 0 limit 500000;
Query OK, 500000 rows affected (3 min 34.57 sec)
Rows matched: 500000  Changed: 500000  Warnings: 0

mysql> update Raw_cdr set updated = 0 limit 1000000;
ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort 
of transaction' from NDBCLUSTER

mysql> update Raw_cdr set updated = 0 limit 5;    
ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER
[9 Jul 2008 10:11] Pekka Nousiainen
change to OPEN so more info can be added by others(?)
[18 Jul 2008 17:26] Chris Picton
After truncating, dropping and recreating the Raw_cdr table, the bug has been triggered again:

Time: Friday 18 July 2008 - 18:58:14
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 1462) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 6525
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.5
Version: mysql-5.1.23 ndb-6.2.15
***EOM***

The log file has the following:
---------
changing file from -2145329152 to -2145329140
changing file from -2145329140 to -2145329152
miss matched writes
------------
And then after a while
--------------
2008-07-18 18:56:12 [ndbd] WARNING  -- Node 5 failed in report_sub_start_conf
2008-07-18 18:56:12 [ndbd] WARNING  -- Node 5 failed in report_sub_start_conf
2008-07-18 18:58:14 [ndbd] INFO     -- lgman.cpp
2008-07-18 18:58:14 [ndbd] INFO     -- LGMAN (Line: 1462) 0x0000000a
2008-07-18 18:58:14 [ndbd] INFO     -- Error handler shutting down system
2008-07-18 18:58:14 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-18 18:58:15 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
--------------

On one of the SQL nodes, I see the following in the logs:

080718 18:58:14 [ERROR] Got error 270 when reading table './CMS/Raw_cdr'

Any ideas?
[20 Dec 2008 19:47] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/62187

2784 Jonas Oreland	2008-12-20
      ndb - bug#28077 - remove incorrect ndbrequire in lgman, plus fix accounting of free_buffer_words
[20 Dec 2008 19:52] Bugs System
Pushed into 5.1.30-ndb-6.2.17 (revid:jonas@mysql.com-20081220194844-k181w30wbbeprkaj) (version source revid:jonas@mysql.com-20081220194844-k181w30wbbeprkaj) (merge vers: 5.1.30-ndb-6.2.17) (pib:6)
[20 Dec 2008 19:53] Bugs System
Pushed into 5.1.30-ndb-6.3.21 (revid:jonas@mysql.com-20081220195109-tjimr7u6gkp210np) (version source revid:jonas@mysql.com-20081220195109-tjimr7u6gkp210np) (merge vers: 5.1.30-ndb-6.3.21) (pib:6)
[20 Dec 2008 19:54] Bugs System
Pushed into 5.1.30-ndb-6.4.0 (revid:jonas@mysql.com-20081220195224-ql5g2lhjlpxthg69) (version source revid:jonas@mysql.com-20081220195224-ql5g2lhjlpxthg69) (merge vers: 5.1.30-ndb-6.4.0) (pib:6)
[20 Dec 2008 19:57] Jonas Oreland
Problem was incorrect ndbrequire, that "fired" when
undo-buffer was set "low" and starting high-load
so that undo-buffer was almost full.

also fixed in same commit, a related "accounting" bug
that caused lgman to get incorrect notion of free memory in
undo buffer, which could theoretically lead to buffer-overrun
(i.e new log-records was written to memory that had not yet been
 flushed), which could cause subsequent restart problems
[7 Jan 2009 16:28] Jon Stephens
Documented bugfix in the NDB-6.2.17 and NDB-6.3.21 changelogs, as follows:

        Starting a cluster under load such that Disk Data tables used
        most of the undo buffer could cause data node failures.

        The fix for this bug also corrected an issue in the LGMAN kernel
        block where the amount of free space left in the undo buffer was
        miscalculated, causing buffer overruns. This could cause records
        in the buffer to be overwritten, leading to problems when
        restarting data nodes.
[10 Feb 2009 20:13] Bugs System
Pushed into 6.0.10-alpha (revid:alik@sun.com-20090210194937-s7xshv5l3m1v7wi9) (version source revid:tomas.ulin@sun.com-20090108115759-b4yhuwkm6w8tg7j3) (merge vers: 6.0.10-alpha) (pib:6)
[12 Feb 2009 10:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/66011

2814 He Zhenxing	2009-02-12 [merge]
      Auto merge 6.0 -> 6.0-rpl