Bug #28443 transporter gets stuck when >1024 signals received at once
Submitted: 15 May 2007 15:23 Modified: 30 May 2007 19:42
Reporter: Kristian Nielsen
Status: Closed
Category:Server: Cluster Severity:S2 (Serious)
Version:4.1,5.0,5.1 OS:Any
Assigned to: Jonas Oreland Target Version:

[15 May 2007 15:23] Kristian Nielsen
Description:
Found this while tracking down occasional rpl_ndb_row_001 test failure.

A tcp dump shows the API node (mysqld) sending > 1024 signals to ndbd in one batch. But a
signal trace from ndbrequire() shows that only the first 1024 signals are processed; then
the ndbd kernel hangs waiting for an ATTRINFO belonging to the last TCKEYREQ received,
and eventually times out the API connection.

The value 1024 is the value of MAX_RECEIVED_SIGNALS in
storage/ndb/src/common/transporter/Packer.cpp.

From a quick look, it appears the problem is the following:

TransporterRegistry::performReceive() is called when select() says data on the socket is
ready.

It reads >1024 signals into the TCP transporter read buffer, then calls
TransporterRegistry::unpack().

unpack() unpacks and executes the first MAX_RECEIVED_SIGNALS() from the buffer, then
returns without handling the rest of this buffer.

Now the transporter goes to select() on the socket again, waiting for new data to arrive,
even though there is still unhandled signal data in the buffer. It does not call unpack()
again until new data is read (which could be forever), as far as I can tell from the
code:

	  const int receiveSize = t->doReceive();
	  if(receiveSize > 0)
	  {
	    Uint32 * ptr;
	    Uint32 sz = t->getReceiveData(&ptr);
	    transporter_recv_from(callbackObj, nodeId);
	    Uint32 szUsed = unpack(ptr, sz, nodeId, ioStates[nodeId]);
	    t->updateReceiveDataPtr(szUsed);
          }

How to repeat:
A simple way to repeat is to reduce the size of MAX_RECEIVED_SIGNALS in
storage/ndb/src/common/transporter/Packer.cpp. Ie. when I set it to 16, mysql-test-run.pl
fails to even start up the cluster.
[15 May 2007 17:01] Jonas Oreland
HeartbeatIntervalDbApi=30000
ReceiveBufferMemory=5M

start 1 node
create_tab T1
hugoLoad -b 1000 -r 25000 T1
[18 May 2007 11:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26947

ChangeSet@1.2562, 2007-05-18 09:48:52+02:00, jonas@perch.ndb.mysql.com +6 -0
  ndb - bug#28443
    Make sure that data can not e left lingering in receive buffer
[18 May 2007 14:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26978

ChangeSet@1.2564, 2007-05-18 11:34:57+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#28443
    review comment 2, atleast 1 signal need for test prg
[18 May 2007 14:27] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26980

ChangeSet@1.2563, 2007-05-18 11:06:03+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#28443
    review comment
    if some tcp-transporter has data, then do select with timeout 0
[23 May 2007 10:23] Bugs System
Pushed into 4.1.23
[23 May 2007 10:23] Bugs System
Pushed into 5.1.19-beta
[23 May 2007 10:24] Bugs System
Pushed into 5.0.44
[29 May 2007 7:35] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/27527

ChangeSet@1.2147, 2007-05-29 07:35:04+02:00, jonas@perch.ndb.mysql.com +6 -0
  ndb - bug#28443 (wl2325-5.0)
      Make sure that data can not be left lingering in receive buffer
[30 May 2007 17:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/27687

ChangeSet@1.2408, 2007-05-30 17:25:22+02:00, tomas@whalegate.ndb.mysql.com +6 -0
  Bug #28443
  - correction of merge error
[30 May 2007 19:42] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of
that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available
version, including the bug fix. More information about accessing the source trees is
available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented fix in 4.1.23/5.0.44/5.1.19 changelogs.
[11 Jun 2007 13:39] Bugs System
Pushed into 5.1.20-beta
[11 Jun 2007 13:41] Bugs System
Pushed into 5.0.44
[3 Jul 2007 8:42] Jon Stephens
Also documented for telco-6.2.3 release.