Bug #66840 Reapetable ndb nodes crashes with error 2341
Submitted: 17 Sep 2012 4:47 Modified: 14 Jul 2016 8:57
Reporter: vladysla chrn Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.2.6 OS:Linux (Red Hat Enterprise Linux Server release 6.1 )
Assigned to: Bogdan Kecman CPU Architecture:Any
Tags: 1309, DBSPJ, failed ndbrequire, ndbmtd, SimulatedBlock.cpp

[17 Sep 2012 4:47] vladysla chrn
Description:
We configured 2 data node + 2 management node cluster in production. It worked well for 2 months, but last 3 days we experienced repeatable ndb data nodes crashes with messages :

Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1309) 0x00000002
Program: ndbmtd
Pid: 20778 thr: 0
Version: mysql-5.5.22 ndb-7.2.6
Trace: /NDB-GC/data/ndb_30_trace.log.6 [t1..t7]

Sometimes it affects only 1 node, sometimes all data nodes in the same time.
We managed to get rid of this error by shutting down 1 node and leaving only 1 data node in working set. But after bringing up the second node the crashes appeared again...

How to repeat:
Cant identify how to repeat it
[17 Sep 2012 4:57] vladysla chrn
Ndb error report for this issue

Attachment: ndb_error_report_20120916234857.tar.bz2 (application/octet-stream, text), 300.74 KiB.

[13 Nov 2012 15:42] Russell Knighton
I think I have now encountered this error twice. Here are the relevant snips from the error log.

-- First instance --
Time: Tuesday 4 September 2012 - 16:31:08
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 99090 thr: 19
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.5 [t1..t29]

-- latest instance --
Time: Tuesday 13 November 2012 - 14:38:56
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 122615 thr: 22
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.8 [t1..t29]

I will attach the log files if they will be of any use - but they will be quite large of course.
[15 Nov 2012 10:41] Russell Knighton
Okay, this has suddenly become a major issue and will now prevent us going live.

It has happened again:
Time: Wednesday 14 November 2012 - 19:41:54
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 27148 thr: 24
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.9 [t1..t29]
***EOM***

Has anyone looked into this

Would whoever is investigating this bug like all/any of the log files to help pin-point the problem?
[15 Nov 2012 13:11] Russell Knighton
And now it appears I may not be able to restart the node. It's happened again, this time in multiple threads:
Time: Thursday 15 November 2012 - 12:49:16
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 58906 thr: 23
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.10 [t1..t29]
***EOM***
                                                       
Time: Thursday 15 November 2012 - 12:49:16
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 58906 thr: 18
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.10 [t1..t29]
***EOM***
                                                       
Time: Thursday 15 November 2012 - 12:49:16
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 58906 thr: 22
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.10 [t1..t29]
***EOM***
                                                       
Time: Thursday 15 November 2012 - 12:49:16
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 58906 thr: 24
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.10 [t1..t29]
***EOM***

Could someone please give some pointers where I should be looking to diagnose the cause of this?
[15 Nov 2012 14:18] Russell Knighton
And again...

My suspicions are correct. I am now unable to restart my cluster node:

Time: Thursday 15 November 2012 - 14:16:29
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 123993 thr: 18
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.11 [t1..t29]
***EOM***
                                                      
Time: Thursday 15 November 2012 - 14:16:29
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 123993 thr: 19
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.11 [t1..t29]
***EOM***
                                                      
Time: Thursday 15 November 2012 - 14:16:29
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 123993 thr: 25
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.11 [t1..t29]
***EOM***
                                                      
Time: Thursday 15 November 2012 - 14:16:29
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 123993 thr: 20
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.11 [t1..t29]
***EOM***
[16 Nov 2012 9:42] Russell Knighton
And Again.

This is Node1:
=======================================================================
Time: Friday 16 November 2012 - 06:02:10
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 21826 thr: 22
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.12 [t1..t29]
***EOM***
                                                         
Time: Friday 16 November 2012 - 06:02:10
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 21826 thr: 24
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_11_trace.log.12 [t1..t29]
***EOM***
=======================================================================

Node 2:
=======================================================================
Time: Friday 16 November 2012 - 06:01:27
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000006
Program: ndbmtd
Pid: 42789 thr: 22
Version: mysql-5.5.25 ndb-7.2.7
Trace: /srv/data/cluster/ndb_data/ndb_12_trace.log.5 [t1..t29]
***EOM***
=======================================================================

Can some one please comment on this bug.
[16 Nov 2012 14:27] Russell Knighton
File uploaded to FTP with ndb_error_report output.

File-name: bug-data-66840.tar.bz2
MD5: d9c43059ae5ce4d05d848c6ad3120f11
[5 Dec 2012 13:07] Russell Knighton
Just to confirm that this issue is not resolved in 7.2.9:

Time: Tuesday 4 December 2012 - 19:13:36
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1263) 0x00000002
Program: ndbmtd
Pid: 30474 thr: 24
Version: mysql-5.5.28 ndb-7.2.9
Trace: /srv/data/cluster/ndb_data/ndb_12_trace.log.6 [t1..t29]
***EOM***
[27 May 2013 6:59] Alexey Asemov
Confirming the same issue:

Current byte-offset of file-pointer is: 1067

Time: Monday 20 May 2013 - 11:18:52
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a
bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1297) 0x00000000
Program: ndbmtd
Pid: 15796 thr: 0
Version: mysql-5.5.30 ndb-7.2.12
Trace: /db/cluster/ndbd/ndb_1_trace.log.1 [t1..t4]
***EOM***

Time: Monday 27 May 2013 - 10:42:37
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a
bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBSPJ (Line: 1297) 0x00000000
Program: ndbmtd
Pid: 29516 thr: 0
Version: mysql-5.5.30 ndb-7.2.12
Trace: /db/cluster/ndbd/ndb_1_trace.log.2 [t1..t4]
***EOM***
[23 Sep 2014 13:10] Hartmut Holzgraefe
I'm right now looking at a post mortem that killed a 7.2.15 cluster with the same ndbrequire(). Three nodes killed by hitting the same ndbrequire within a second, and the fourth node failed as it could not continue working on its own.

The line number in SimulatedBlock.cpp is 1299 with 7.2.15 but it is the same

    ndbrequire(ss == SEND_OK || ss == SEND_BLOCKED || ss == SEND_DISCONNECTED);

though.

From a quick look at the prepareSend() method that returned the 'ss' value I can see that it can also return

  SEND_BUFFER_FULL, SEND_MESSAGE_TOO_BIG, SEND_UNKNOWN_NODE

which could all trigger the ndbrequire() we're seeing here.

I *think* we can rule out SEND_UNKNOWN_NODE ...?

So it could be either SEND_BUFFER_FULL or SEND_MESSAGE_TOO_BIG ...

Now looking again at prepareSend() in TransporterRegistry.cpp
I can see:

        [...]
	WARNING("Signal to " << nodeId << " lost(buffer)");
	report_error(nodeId, TE_SIGNAL_LOST_SEND_BUFFER_FULL);
	return SEND_BUFFER_FULL;
      } else {
	return SEND_MESSAGE_TOO_BIG;
      }

Unfortunately WARNING() would only be active in debug builds AFAICT,
so even with absence of a "Signal to ... lost(buffer)" message in
the output log we can't simply conclude that we've been hitting
SEND_MESSAGE_TOO_BIG and not SEND_BUFFER_FULL ... :/

As all nodes failed on the same ndbrequire() at about the same time
my educated guess would be that SEND_MESSAGE_TOO_BIG was the reason 
for this, but I can't rule out SEND_BUFFER_FULL either ...
[23 Sep 2014 13:54] Hartmut Holzgraefe
I can provide ndb_error_reporter files if needed, but I don't think there's any more to see in those than what I already wrote ...
[14 Jul 2016 8:57] Bogdan Kecman
With provided config I can reproduce this bug on 7.2.6 but I cannot reproduce the bug on 7.2.24!. Also on 7.2.6 increasing TCP_DEFAULT values reduced the ability to reproduce the bug (I could still reproduce it with 16M send/receive buffer memory but not easily).