Bug #28443 transporter gets stuck when >1024 signals received at once
Submitted: 15 May 2007 13:23 Modified: 30 May 2007 17:42
Reporter: Kristian Nielsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1,5.0,5.1 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[15 May 2007 13:23] Kristian Nielsen
Description:
Found this while tracking down occasional rpl_ndb_row_001 test failure.

A tcp dump shows the API node (mysqld) sending > 1024 signals to ndbd in one batch. But a signal trace from ndbrequire() shows that only the first 1024 signals are processed; then the ndbd kernel hangs waiting for an ATTRINFO belonging to the last TCKEYREQ received, and eventually times out the API connection.

The value 1024 is the value of MAX_RECEIVED_SIGNALS in storage/ndb/src/common/transporter/Packer.cpp.

From a quick look, it appears the problem is the following:

TransporterRegistry::performReceive() is called when select() says data on the socket is ready.

It reads >1024 signals into the TCP transporter read buffer, then calls TransporterRegistry::unpack().

unpack() unpacks and executes the first MAX_RECEIVED_SIGNALS() from the buffer, then returns without handling the rest of this buffer.

Now the transporter goes to select() on the socket again, waiting for new data to arrive, even though there is still unhandled signal data in the buffer. It does not call unpack() again until new data is read (which could be forever), as far as I can tell from the code:

	  const int receiveSize = t->doReceive();
	  if(receiveSize > 0)
	  {
	    Uint32 * ptr;
	    Uint32 sz = t->getReceiveData(&ptr);
	    transporter_recv_from(callbackObj, nodeId);
	    Uint32 szUsed = unpack(ptr, sz, nodeId, ioStates[nodeId]);
	    t->updateReceiveDataPtr(szUsed);
          }

How to repeat:
A simple way to repeat is to reduce the size of MAX_RECEIVED_SIGNALS in storage/ndb/src/common/transporter/Packer.cpp. Ie. when I set it to 16, mysql-test-run.pl fails to even start up the cluster.
[15 May 2007 15:01] Jonas Oreland
HeartbeatIntervalDbApi=30000
ReceiveBufferMemory=5M

start 1 node
create_tab T1
hugoLoad -b 1000 -r 25000 T1
[18 May 2007 9:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26947

ChangeSet@1.2562, 2007-05-18 09:48:52+02:00, jonas@perch.ndb.mysql.com +6 -0
  ndb - bug#28443
    Make sure that data can not e left lingering in receive buffer
[18 May 2007 12:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26978

ChangeSet@1.2564, 2007-05-18 11:34:57+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#28443
    review comment 2, atleast 1 signal need for test prg
[18 May 2007 12:27] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26980

ChangeSet@1.2563, 2007-05-18 11:06:03+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#28443
    review comment
    if some tcp-transporter has data, then do select with timeout 0
[23 May 2007 8:23] Bugs System
Pushed into 4.1.23
[23 May 2007 8:23] Bugs System
Pushed into 5.1.19-beta
[23 May 2007 8:24] Bugs System
Pushed into 5.0.44
[29 May 2007 5:35] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/27527

ChangeSet@1.2147, 2007-05-29 07:35:04+02:00, jonas@perch.ndb.mysql.com +6 -0
  ndb - bug#28443 (wl2325-5.0)
      Make sure that data can not be left lingering in receive buffer
[30 May 2007 15:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/27687

ChangeSet@1.2408, 2007-05-30 17:25:22+02:00, tomas@whalegate.ndb.mysql.com +6 -0
  Bug #28443
  - correction of merge error
[30 May 2007 17:42] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented fix in 4.1.23/5.0.44/5.1.19 changelogs.
[11 Jun 2007 11:39] Bugs System
Pushed into 5.1.20-beta
[11 Jun 2007 11:41] Bugs System
Pushed into 5.0.44
[3 Jul 2007 6:42] Jon Stephens
Also documented for telco-6.2.3 release.