Description:
Bug 39867 occurs when messages due to Blob part operations are discarded within the Ndb kernel, and the handler's ActiveHook callback does not correctly indicate the error to the upper layers.
Modifying the ActiveHook to notify the upper layers of an error from the NdbTransaction shows that the error on the transaction is 1297, Time-out in NDB, probably caused by deadlock.
This message indicates that TC has timed-out waiting to hear back from the API.
Looking at debug trace output from the MySQLD, it appears that the MySQLD is still waiting to hear the results of all of its submitted operations from TC when it receives a TCROLLBACKREP, probably carrying the timeout error code.
This is confusing for users, as it indicates that some locking issue may be at fault, when in reality it is a buffer configuration problem.
The system should indicate the true source of the problem in this case.
How to repeat:
Run example program from bug#39867 against mysql-5.1-telco-6.2.15 with default SendBuffer size.
Modify ha_ndbcluster.cc to use error code from transaction when readData() fails and the Blob object has no error (see below).
In cases where SendBuffer overload occurs, timeout is given as the transaction failure reason.
=== modified file 'sql/ha_ndbcluster.cc'
--- sql/ha_ndbcluster.cc 2008-02-13 13:42:22 +0000
+++ sql/ha_ndbcluster.cc 2008-10-06 09:07:30 +0000
@@ -806,6 +806,13 @@ int g_get_ndb_blobs_value(NdbBlob *ndb_b
ha->m_blobs_buffer_size= ha->m_blob_total_size;
}
+ if (unlikely(ha->m_thd_ndb == NULL))
+ {
+ DBUG_ASSERT(FALSE);
+ DBUG_RETURN(-1);
+ }
+ NdbTransaction* trans= ha->m_thd_ndb->trans;
+
/*
Now read all blob data.
If we know the destination mysqld row, we also set the blob null bit and
@@ -836,7 +843,9 @@ int g_get_ndb_blobs_value(NdbBlob *ndb_b
uchar *buf= ha->m_blobs_buffer + offset;
uint32 len= ha->m_blobs_buffer_size - offset;
if (ndb_blob->readData(buf, len) != 0)
- ERR_RETURN(ndb_blob->getNdbError());
+ ERR_RETURN((ndb_blob->getNdbError().code == 0)?
+ trans->getNdbError():
+ ndb_blob->getNdbError());
DBUG_PRINT("info", ("[%u] offset: %u buf: 0x%lx len=%u",
i, offset, (long) buf, len));
DBUG_ASSERT(len == len64);
Suggested fix:
Determine why API is still waiting to hear from kernel (not all TCKEYREFs sent/received?)
Fix.