MySQL Bugs: #28077: ndbrequire failure in LGMAN

Bug #28077	ndbrequire failure in LGMAN
Submitted:	24 Apr 2007 15:29	Modified:	7 Jan 2009 16:28
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S3 (Non-critical)
Version:	5.1.16	OS:	Linux
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
both nodes of a cluster show ndbrequire errors in LGMAN line 1427

   1426         // Only 1 chunk
   1427         ndbrequire(ptr.p->m_buffer_pages.getSize() == 2);
   1428         Uint32 tmp= consumer.m_current_page.m_idx + 1;
   1429         cnt= write_log_pages(signal, ptr, page, tmp);
   1430         assert(cnt <= tmp);

the errors might be related to send buffer overflows but are not necessarily
connected to these

How to repeat:
.

We have tried to reproduce this bug have not been able to.

We would need more information on how to reproduce to bea ble to continue working on this bug.

BR,

Tomas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Hartmut,

we still need Feedback from you here.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Trace log

Attachment: ndb_3_trace.log.1.bz2 (application/x-bzip, text), 89.65 KiB.

The trace file for this is attached above.
------------------------------------------
Time: Wednesday 9 July 2008 - 10:45:21
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 1462) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 8982
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1
Version: mysql-5.1.23 ndb-6.2.15
***EOM***
-----------------------------------------

Both my ndb nodes crashed at the same time.

The queries being run were:

mysql> update Raw_cdr set updated = 0 limit 100000;
Query OK, 100000 rows affected (34.67 sec)
Rows matched: 100000  Changed: 100000  Warnings: 0

mysql> update Raw_cdr set updated = 0 limit 200000;
Query OK, 200000 rows affected (31.57 sec)
Rows matched: 200000  Changed: 200000  Warnings: 0

mysql> update Raw_cdr set updated = 0 limit 500000;
Query OK, 500000 rows affected (3 min 34.57 sec)
Rows matched: 500000  Changed: 500000  Warnings: 0

mysql> update Raw_cdr set updated = 0 limit 1000000;
ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort 
of transaction' from NDBCLUSTER

mysql> update Raw_cdr set updated = 0 limit 5;    
ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER

change to OPEN so more info can be added by others(?)

After truncating, dropping and recreating the Raw_cdr table, the bug has been triggered again:

Time: Friday 18 July 2008 - 18:58:14
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 1462) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 6525
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.5
Version: mysql-5.1.23 ndb-6.2.15
***EOM***

The log file has the following:
---------
changing file from -2145329152 to -2145329140
changing file from -2145329140 to -2145329152
miss matched writes
------------
And then after a while
--------------
2008-07-18 18:56:12 [ndbd] WARNING  -- Node 5 failed in report_sub_start_conf
2008-07-18 18:56:12 [ndbd] WARNING  -- Node 5 failed in report_sub_start_conf
2008-07-18 18:58:14 [ndbd] INFO     -- lgman.cpp
2008-07-18 18:58:14 [ndbd] INFO     -- LGMAN (Line: 1462) 0x0000000a
2008-07-18 18:58:14 [ndbd] INFO     -- Error handler shutting down system
2008-07-18 18:58:14 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-18 18:58:15 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
--------------

On one of the SQL nodes, I see the following in the logs:

080718 18:58:14 [ERROR] Got error 270 when reading table './CMS/Raw_cdr'

Any ideas?

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/62187

2784 Jonas Oreland	2008-12-20
      ndb - bug#28077 - remove incorrect ndbrequire in lgman, plus fix accounting of free_buffer_words

Pushed into 5.1.30-ndb-6.2.17 (revid:jonas@mysql.com-20081220194844-k181w30wbbeprkaj) (version source revid:jonas@mysql.com-20081220194844-k181w30wbbeprkaj) (merge vers: 5.1.30-ndb-6.2.17) (pib:6)

Pushed into 5.1.30-ndb-6.3.21 (revid:jonas@mysql.com-20081220195109-tjimr7u6gkp210np) (version source revid:jonas@mysql.com-20081220195109-tjimr7u6gkp210np) (merge vers: 5.1.30-ndb-6.3.21) (pib:6)

Pushed into 5.1.30-ndb-6.4.0 (revid:jonas@mysql.com-20081220195224-ql5g2lhjlpxthg69) (version source revid:jonas@mysql.com-20081220195224-ql5g2lhjlpxthg69) (merge vers: 5.1.30-ndb-6.4.0) (pib:6)

Problem was incorrect ndbrequire, that "fired" when
undo-buffer was set "low" and starting high-load
so that undo-buffer was almost full.

also fixed in same commit, a related "accounting" bug
that caused lgman to get incorrect notion of free memory in
undo buffer, which could theoretically lead to buffer-overrun
(i.e new log-records was written to memory that had not yet been
 flushed), which could cause subsequent restart problems

Documented bugfix in the NDB-6.2.17 and NDB-6.3.21 changelogs, as follows:

        Starting a cluster under load such that Disk Data tables used
        most of the undo buffer could cause data node failures.

        The fix for this bug also corrected an issue in the LGMAN kernel
        block where the amount of free space left in the undo buffer was
        miscalculated, causing buffer overruns. This could cause records
        in the buffer to be overwritten, leading to problems when
        restarting data nodes.

Pushed into 6.0.10-alpha (revid:alik@sun.com-20090210194937-s7xshv5l3m1v7wi9) (version source revid:tomas.ulin@sun.com-20090108115759-b4yhuwkm6w8tg7j3) (merge vers: 6.0.10-alpha) (pib:6)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/66011

2814 He Zhenxing	2009-02-12 [merge]
      Auto merge 6.0 -> 6.0-rpl