Bug #39879 NDBAPI : Wrong error reported for SendBuffer overload
Submitted: 6 Oct 2008 11:57 Modified: 12 Nov 2008 13:16
Reporter: Frazer Clement Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: NDB API Severity:S3 (Non-critical)
Version:5.1-telco-6.2.15 OS:Any
Assigned to: Frazer Clement CPU Architecture:Any

[6 Oct 2008 11:57] Frazer Clement
Bug 39867 occurs when messages due to Blob part operations are discarded within the Ndb kernel, and the handler's ActiveHook callback does not correctly indicate the error to the upper layers.

Modifying the ActiveHook to notify the upper layers of an error from the NdbTransaction shows that the error on the transaction is 1297, Time-out in NDB, probably caused by deadlock.

This message indicates that TC has timed-out waiting to hear back from the API.

Looking at debug trace output from the MySQLD, it appears that the MySQLD is still waiting to hear the results of all of its submitted operations from TC when it receives a TCROLLBACKREP, probably carrying the timeout error code.

This is confusing for users, as it indicates that some locking issue may be at fault, when in reality it is a buffer configuration problem.

The system should indicate the true source of the problem in this case.

How to repeat:
Run example program from bug#39867 against mysql-5.1-telco-6.2.15 with default SendBuffer size.

Modify ha_ndbcluster.cc to use error code from transaction when readData() fails and the Blob object has no error (see below).

In cases where SendBuffer overload occurs, timeout is given as the transaction failure reason. 

=== modified file 'sql/ha_ndbcluster.cc'
--- sql/ha_ndbcluster.cc        2008-02-13 13:42:22 +0000
+++ sql/ha_ndbcluster.cc        2008-10-06 09:07:30 +0000
@@ -806,6 +806,13 @@ int g_get_ndb_blobs_value(NdbBlob *ndb_b
     ha->m_blobs_buffer_size= ha->m_blob_total_size;

+  if (unlikely(ha->m_thd_ndb == NULL))
+  {
+    DBUG_RETURN(-1);
+  }
+  NdbTransaction* trans= ha->m_thd_ndb->trans;
     Now read all blob data.
     If we know the destination mysqld row, we also set the blob null bit and
@@ -836,7 +843,9 @@ int g_get_ndb_blobs_value(NdbBlob *ndb_b
       uchar *buf= ha->m_blobs_buffer + offset;
       uint32 len= ha->m_blobs_buffer_size - offset;
       if (ndb_blob->readData(buf, len) != 0)
-          ERR_RETURN(ndb_blob->getNdbError());
+        ERR_RETURN((ndb_blob->getNdbError().code == 0)?
+                   trans->getNdbError():
+                   ndb_blob->getNdbError());
       DBUG_PRINT("info", ("[%u] offset: %u  buf: 0x%lx  len=%u",
                           i, offset, (long) buf, len));
       DBUG_ASSERT(len == len64);

Suggested fix:
Determine why API is still waiting to hear from kernel (not all TCKEYREFs sent/received?)
[6 Oct 2008 11:57] Frazer Clement
MySQLD trace of blob error handling

Attachment: blob_bug_trace.txt (text/plain), 39.82 KiB.

[3 Nov 2008 16:20] Frazer Clement
Bug#39867 contains a proposed patch to fix this bug.
[12 Nov 2008 13:16] Jon Stephens
See Bug#39867 for changelog entry for this fix.