Description:
Bug 39867 occurs when messages due to Blob part operations are discarded within the Ndb
kernel, and the handler's ActiveHook callback does not correctly indicate the error to
the upper layers.
Modifying the ActiveHook to notify the upper layers of an error from the NdbTransaction
shows that the error on the transaction is 1297, Time-out in NDB, probably caused by
deadlock.
This message indicates that TC has timed-out waiting to hear back from the API.
Looking at debug trace output from the MySQLD, it appears that the MySQLD is still
waiting to hear the results of all of its submitted operations from TC when it receives a
TCROLLBACKREP, probably carrying the timeout error code.
This is confusing for users, as it indicates that some locking issue may be at fault,
when in reality it is a buffer configuration problem.
The system should indicate the true source of the problem in this case.
How to repeat:
Run example program from bug#39867 against mysql-5.1-telco-6.2.15 with default SendBuffer
size.
Modify ha_ndbcluster.cc to use error code from transaction when readData() fails and the
Blob object has no error (see below).
In cases where SendBuffer overload occurs, timeout is given as the transaction failure
reason.
=== modified file 'sql/ha_ndbcluster.cc'
--- sql/ha_ndbcluster.cc 2008-02-13 13:42:22 +0000
+++ sql/ha_ndbcluster.cc 2008-10-06 09:07:30 +0000
@@ -806,6 +806,13 @@ int g_get_ndb_blobs_value(NdbBlob *ndb_b
ha->m_blobs_buffer_size= ha->m_blob_total_size;
}
+ if (unlikely(ha->m_thd_ndb == NULL))
+ {
+ DBUG_ASSERT(FALSE);
+ DBUG_RETURN(-1);
+ }
+ NdbTransaction* trans= ha->m_thd_ndb->trans;
+
/*
Now read all blob data.
If we know the destination mysqld row, we also set the blob null bit and
@@ -836,7 +843,9 @@ int g_get_ndb_blobs_value(NdbBlob *ndb_b
uchar *buf= ha->m_blobs_buffer + offset;
uint32 len= ha->m_blobs_buffer_size - offset;
if (ndb_blob->readData(buf, len) != 0)
- ERR_RETURN(ndb_blob->getNdbError());
+ ERR_RETURN((ndb_blob->getNdbError().code == 0)?
+ trans->getNdbError():
+ ndb_blob->getNdbError());
DBUG_PRINT("info", ("[%u] offset: %u buf: 0x%lx len=%u",
i, offset, (long) buf, len));
DBUG_ASSERT(len == len64);
Suggested fix:
Determine why API is still waiting to hear from kernel (not all TCKEYREFs
sent/received?)
Fix.