Bug #39867 MySQL Cluster : Failures during Blob part operations not always detected
Submitted: 5 Oct 2008 23:19 Modified: 12 Nov 2008 12:10
Reporter: Frazer Clement Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.2.15 OS:Any
Assigned to: Frazer Clement CPU Architecture:Any

[5 Oct 2008 23:19] Frazer Clement
Description:
NDBAPI operations generated to read Blob parts can fail due to (for example) SendBuffer memory exhaustion in the kernel.  In telco-6.2.15, it is possible for some of these failures to go unnoticed, resulting in incorrect null values being returned to the user.

Detection and handling of errors on Blob part operations should be improved, as should the detection of these conditions in the handler.

How to repeat:
Run attached perl program against 6.2.15

Using MySQLD with --debug shows that part operations experience error 1218, which is marked against the transaction.

However, this does not seem to result in the blob's 'setActiveHook' callback reporting failure to the server.

Suggested fix:
Fix broken parts of the error reporting chain.
[5 Oct 2008 23:20] Frazer Clement
Modified example to create blob part op failure due to Send exhaustion

Attachment: blobtest-1.pl (application/x-perl, text), 1.79 KiB.

[3 Nov 2008 15:41] Frazer Clement
Proposed patch

Attachment: bug39867+39879.patch (text/x-patch), 192.77 KiB.

[3 Nov 2008 15:53] Frazer Clement
Proposed patch which :
 1) Improves ha_ndbcluster error handling to check operation and transaction objects when readData() returns a bad rc.
 2) Modifies LQH to send LQHKEYREF to TC (rather than TCKEYREF to API) in case where simple read fails.  This allows TC to implement AbortOnError behaviour for Simple reads.  Dirty (Committed) read still sends direct TCKEYREF and cannot use AbortOnError.
 3) Adds test to MTR ndb_blob which attempt to overload the API connection with Blob reads and verify that :
    a) If no error is reported, the data is correct.
    b) If an error is reported, it is the correct type
[3 Nov 2008 16:17] Frazer Clement
Note that the patch relies on the 'parent object accessors' fix to Bug#40242.

Also note that this patch fixes Bug#39879 as well
[5 Nov 2008 9:54] Frazer Clement
Patch with extra simple read tests re-enabled

Attachment: bug39867-extratests.patch (text/x-patch), 202.81 KiB.

[5 Nov 2008 10:17] Frazer Clement
New patch with extra testcases.  
Existing Simple Read testcases re-enabled.
New testcase for failing AbortOnError Simple Read followed by successful normal read.  Before fix, this testcase results in Transaction Abort due to timeout.  Afterwards it results in Transaction Abort due to Transporter overload.
Error insert is necessary for this testcase as the bug was on the early 'noFreeRecord' error handling path rather than later error handling paths.
[5 Nov 2008 19:52] Jonas Oreland
one comment: 5047 is "self-cleaning" when encountered.
if running this on 4-node cluster, there will be lingering 5047's set
so a insertErrorAllNodes(0) after testcase is a good idea

after that ok to push
[6 Nov 2008 17:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/58086

2719 Frazer Clement	2008-11-06
      Bug#39867 MySQL Cluster : Failures during Blob part operations not always detected
      
      2 parts : 
        1) The Ndb SQL handler (ha_ndbcluster) reported the error from the NdbBlob object
           rather than from the NdbTransaction object.  This results in inconsistent error
           messages in some cases 
        2) The NDB kernel bypassed the TC block when reporting primary key 'simple read' 
           failure in some scenarios.  This resulted in the API node not detecting 
           operation failures in some scenarios, and eventual transaction timeouts.
      
      Fixes :
        Change NDB kernel to send LQHKEYREF to TC for early simple read failure.  Direct send
        of TCKEYREF to API remains for 'dirty' read.
        Change ha_ndbcluster to obtain error information from the NdbTransaction
        object rather than the Blob object.
      
      Tests : 
        Re-enable simple read testing in testOperations and testTransactions.
        Extend testing to include Simple Reads in testNdbApi.
        Add Blob read transporter overload testcase to MTR test_blob testcase.
[6 Nov 2008 21:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/58115

2722 Frazer Clement	2008-11-06
      Bug#39867 MySQL Cluster : Failures during Blob part operations not always detected
      
      Add new testcase to Daily Basic
[8 Nov 2008 20:46] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/58252

2724 Frazer Clement	2008-11-08
            Bug#39867 MySQL Cluster : Failures during Blob part operations not always detected
            
            2 parts : 
              1) The Ndb SQL handler (ha_ndbcluster) reported the error from the NdbBlob object
                 rather than from the NdbTransaction object.  This results in inconsistent error
                 messages in some cases 
              2) The NDB kernel bypassed the TC block when reporting primary key 'simple read' 
                 failure in some scenarios.  This resulted in the API node not detecting 
                 operation failures in some scenarios, and eventual transaction timeouts.
            
            Fixes :
              Change NDB kernel to send LQHKEYREF to TC for early simple read failure.  Direct send
              of TCKEYREF to API remains for 'dirty' read.
              Change ha_ndbcluster to obtain error information from the NdbTransaction
              object rather than the Blob object.
            
            Tests : 
              Re-enable simple read testing in testOperations and testTransactions.
              Extend testing to include Simple Reads in testNdbApi.
              Add Blob read transporter overload testcase to MTR test_blob testcase.
[8 Nov 2008 21:08] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/58255

2727 Frazer Clement	2008-11-08
      Bug#39867 MySQL Cluster : Failures during Blob part operations not always detected
      
      Add new testcase to Daily Basic
[8 Nov 2008 22:19] Bugs System
Pushed into 5.1.29-ndb-6.4.0  (revid:frazer@mysql.com-20081108210806-iiu8s4ytv8gvbo98) (version source revid:frazer@mysql.com-20081108214303-z8nr2z5c1yccxac8) (pib:5)
[8 Nov 2008 22:43] Bugs System
Pushed into 5.1.29-ndb-6.2.17  (revid:frazer@mysql.com-20081108210806-iiu8s4ytv8gvbo98) (version source revid:frazer@mysql.com-20081108210806-iiu8s4ytv8gvbo98) (pib:5)
[8 Nov 2008 22:45] Bugs System
Pushed into 5.1.29-ndb-6.3.19  (revid:frazer@mysql.com-20081108210806-iiu8s4ytv8gvbo98) (version source revid:frazer@mysql.com-20081108212257-xppq7h6xmg3wduzp) (pib:5)
[12 Nov 2008 12:10] Jon Stephens
Documented in the ndb-6.2.17 and ndb-6.3.19 changelogs as follows:

        Failed operations on BLOB and TEXT columns were not always reported
        correctly to the originating SQL node.
[12 Nov 2008 13:18] Jon Stephens
Combined changelog entry with entry for Bug#39879, updated entry to read as follows:

        Failed operations on BLOB and
        TEXT columns were not always
        reported correctly to the originating SQL node. Such errors 
        were sometimes reported as being due to timeouts, when the 
        actual problem was a transporter overload due to insufficient 
        buffer space.
[12 Dec 2008 23:28] Bugs System
Pushed into 6.0.9-alpha  (revid:frazer@mysql.com-20081108210806-iiu8s4ytv8gvbo98) (version source revid:tomas.ulin@sun.com-20081209185954-9svcixh2p5hsfi6w) (pib:5)