MySQL Bugs: #74829: ndb.ndb_restore_conv

Bug #74829	ndb.ndb_restore_conv_more fails on POWER8
Submitted:	13 Nov 2014 3:47	Modified:	18 Jan 2016 20:59
Reporter:	Stewart Smith	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.3.7	OS:	Linux
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	ndb, POWER8, PowerPC

Description:
On big endian ppc64, the ndb.ndb_restore_conv_more test fails with ndb_restore aborting.

tail of var/log/ndb.ndb_restore_conv_more/tmp/ndb_testrun.log:

Processing data in table: test/def/NDB$BLOB_14_14(24) fragment 6
Opening file '/home/stewart/mysql-cluster-gpl-7.3.7/build/mysql-test/var/mysql_cluster.1//BACKUP/BACKUP-11/BACKUP-11.1.log'
File size 1024 bytes
Restored 155 tuples and 0 log entries
bh1->readData -8
/home/stewart/mysql-cluster-gpl-7.3.7/storage/ndb/tools/ndb_lib_move_data.cpp:1054: require((code != 0)) failed

The -8 there is from me modifying the error codes to find out exactly where it fails.

It asserts on ndb_lib_move_data.cpp:728, which is this:

      CHK2(bh1->readData(data, bytes) == 0, (bh1->getNdbError()));

Which means the blob is not null but we failed to read some data.

Adding in my marvellous debug code to find out where, I can see we're failing in NdbBlob::readDataPrivate, reading the last block:

  if (len > 0) {
    // partial last block                                                        
    DBUG_PRINT("info", ("partial last block pos=%llu len=%u", pos, len));
    assert((pos - theInlineSize) % thePartSize == 0 && len < thePartSize);
    Uint32 part = getPartNumber(pos);
    Uint16 sz = 0;
    if (readPart(thePartBuf.data, part, sz) == -1)
      DBUG_RETURN(-7);
    // need result now                                                           
    if (executePendingBlobReads() == -1)
      DBUG_RETURN(-8);

(I replaced return -1 with numbers to trace it), so (for some reason) we're failing executePendingBlobReads().

adding in:
fprintf(stderr, "part %u partsz %u pos %u sz %u len %u\n", part, thePart\
Size, pos, sz, len);
there, I get this just before the assert:
part 2 partsz 2000 pos 4256 sz 0 len 744

for the record, gdb "p *bh1" is:
$2 = {theBlobVersion = 2, theFixedDataFlag = false, theHeadSize = 16, 
  theVarsizeBytes = 2, theState = NdbBlob::Active, theNdbRecordFlag = true, 
  theEventBlobVersion = -1, theBtColumnNo = {-1, -1, 1, 2, 3}, 
  theNdb = 0x10037403b10, theNdbCon = 0x100374b5d10, theNdbOp = 0x1003746e0a0, 
  theEventOp = 0x0, theBlobEventOp = 0x0, theBlobEventPkRecAttr = 0x0, 
  theBlobEventDistRecAttr = 0x0, theBlobEventPartRecAttr = 0x0, 
  theBlobEventPkidRecAttr = 0x0, theBlobEventDataRecAttr = 0x0, theTable = 
    0x1003744d640, theAccessTable = 0x1003744d640, 
  theBlobTable = 0x100374622f0, theColumn = 0x1003744d030, 
  theFillChar = 255 '\377', theInlineSize = 256, thePartSize = 2000, 
  theStripeSize = 0, theGetFlag = false, theGetBuf = 0x0, theSetFlag = false, 
  theSetValueInPreExecFlag = false, theSetBuf = 0x0, theGetSetBytes = 0, 
  thePendingBlobOps = 1 '\001', theActiveHook = 0x0, theActiveHookArg = 0x0, 
  theKeyBuf = {data = 0x1003744bdc0 "\003", size = 4, maxsize = 8}, 
  theAccessKeyBuf = {data = 0x1003745dc60 "", size = 4, maxsize = 8}, 
  thePackKeyBuf = {data = 0x1003744d8a0 "\003", size = 4, maxsize = 8}, 
  theHeadInlineBuf = {data = 0x1003746eab0 "\016\001", size = 272, 
    maxsize = 272}, theHeadInlineCopyBuf = {data = 0x1003746f3b0 "", 
    size = 272, maxsize = 272}, thePartBuf = {data = 0x1003746ebd0 "", 
    size = 2000, maxsize = 2000}, thePartLen = 0, theBlobEventDataBuf = {
    data = 0x0, size = 0, maxsize = 0}, theBlobEventDistValue = 16842752, 
  theBlobEventPartValue = 927304144, theBlobEventPkidValue = 4294967295, 
  theHead = {varsize = 270, reserved = 0, pkid = 0, length = 5000, 
    headsize = 16}, 
  theInlineData = 0x1003746eac0 " -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.;"..., 
  theHeadInlineRecAttr = 0x1003746f4d0, theHeadInlineReadOp = 0x0, 
  theHeadInlineUpdateFlag = false, userDefinedPartitioning = false, 
  thePartitionId = 4294967295, thePartitionIdRecAttr = 0x0, theNullFlag = 0, 
  theLength = 5000, thePos = 0, theError = {status = NdbError::Success, 
    classification = NdbError::NoError, code = 0, mysql_code = 0, 
    message = 0x10235d30 "No error", details = 0x0}, theNext = 0x1003746f530}

How to repeat:
run ndb.ndb_restore_conv_more on POWER8 (maybe others? i haven't tried)

Suggested fix:
Find the bug and fix it :)

This test passes with 7.3.8 and on *LITTLE* endian ppc64.

Unless there is a plat to support NDB on any big endian platform (are any current platforms big endian?) it may be better to either close this or just add a configure check to no longer build on big endian platforms.

Pretty sure SPARC is still big endian.

Hi Stewart,

Using 7.3.7 I've verified this bug almost as described. System used was qemu-system-ppc64 running Debian Jessie. Due to this I also had to update some timeout variables, both for ndbd and mtr. I also had to copy storage/ndb/src/kernel/vm/mt-asm.h from 7.4.9 to get it to compile at all.

$ ./mysql-test/mtr ndb.ndb_restore_conv_more |& tee ndb.ndb_restore_conv_more.log

.
.
ndb.ndb_restore_conv_more                [ fail ]
        Test ended at 2015-12-18 16:29:44

CURRENT_TEST: ndb.ndb_restore_conv_more
/home/gustaf/src/mysql-cluster-gpl-7.3.7/storage/ndb/tools/ndb_lib_move_data.cpp:1048: require((code != 0)) failed

   1043 void
   1044 Ndb_move_data::set_error_code(int code, const char* fmt, ...)
   1045 {
   1046   va_list ap;
   1047   va_start(ap, fmt);
   1048   require(code != 0);
   1049   Error& e = m_error;
   1050   e.code = code;
   1051   my_vsnprintf(e.message, sizeof(e.message), fmt, ap);
   1052   va_end(ap);
   1053 }

$ tail mysql-test/var/log/ndb.ndb_restore_conv_more/tmp/ndb_testrun.log
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 1
_____________________________________________________
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 3
_____________________________________________________
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 5
_____________________________________________________
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 7
Opening file '/home/gustaf/src/mysql-cluster-gpl-7.3.7/build/mysql-test/var/mysql_cluster.1//BACKUP/BACKUP-8/BACKUP-8.2.log'
File size 1024 bytes
Restored 590 tuples and 0 log entries

$ cat /proc/cpuinfo 
processor	: 0
cpu		: POWER7 (raw), altivec supported
clock		: 1000.000000MHz
revision	: 2.3 (pvr 003f 0203)

timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)

If I instead try the current version 7.4.9 I cannot reproduce the bug:
ndb.ndb_restore_conv_more                [ pass ]  7769970
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 7769.970 of 8369 seconds executing testcases

Completed: All 1 tests were successful.

real	139m29.313s
user	46m59.335s
sys	91m31.128s

Could you try the current GA too?

/Gustaf

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".