Bug #74829 ndb.ndb_restore_conv_more fails on POWER8
Submitted: 13 Nov 2014 3:47 Modified: 18 Jan 2016 20:59
Reporter: Stewart Smith Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.3.7 OS:Linux
Assigned to: Assigned Account CPU Architecture:Any
Tags: ndb, POWER8, PowerPC

[13 Nov 2014 3:47] Stewart Smith
Description:
On big endian ppc64, the ndb.ndb_restore_conv_more test fails with ndb_restore aborting.

tail of var/log/ndb.ndb_restore_conv_more/tmp/ndb_testrun.log:

Processing data in table: test/def/NDB$BLOB_14_14(24) fragment 6
Opening file '/home/stewart/mysql-cluster-gpl-7.3.7/build/mysql-test/var/mysql_cluster.1//BACKUP/BACKUP-11/BACKUP-11.1.log'
File size 1024 bytes
Restored 155 tuples and 0 log entries
bh1->readData -8
/home/stewart/mysql-cluster-gpl-7.3.7/storage/ndb/tools/ndb_lib_move_data.cpp:1054: require((code != 0)) failed

The -8 there is from me modifying the error codes to find out exactly where it fails.

It asserts on ndb_lib_move_data.cpp:728, which is this:

      CHK2(bh1->readData(data, bytes) == 0, (bh1->getNdbError()));

Which means the blob is not null but we failed to read some data.

Adding in my marvellous debug code to find out where, I can see we're failing in NdbBlob::readDataPrivate, reading the last block:

  if (len > 0) {
    // partial last block                                                        
    DBUG_PRINT("info", ("partial last block pos=%llu len=%u", pos, len));
    assert((pos - theInlineSize) % thePartSize == 0 && len < thePartSize);
    Uint32 part = getPartNumber(pos);
    Uint16 sz = 0;
    if (readPart(thePartBuf.data, part, sz) == -1)
      DBUG_RETURN(-7);
    // need result now                                                           
    if (executePendingBlobReads() == -1)
      DBUG_RETURN(-8);

(I replaced return -1 with numbers to trace it), so (for some reason) we're failing executePendingBlobReads().

adding in:
fprintf(stderr, "part %u partsz %u pos %u sz %u len %u\n", part, thePart\
Size, pos, sz, len);
there, I get this just before the assert:
part 2 partsz 2000 pos 4256 sz 0 len 744

for the record, gdb "p *bh1" is:
$2 = {theBlobVersion = 2, theFixedDataFlag = false, theHeadSize = 16, 
  theVarsizeBytes = 2, theState = NdbBlob::Active, theNdbRecordFlag = true, 
  theEventBlobVersion = -1, theBtColumnNo = {-1, -1, 1, 2, 3}, 
  theNdb = 0x10037403b10, theNdbCon = 0x100374b5d10, theNdbOp = 0x1003746e0a0, 
  theEventOp = 0x0, theBlobEventOp = 0x0, theBlobEventPkRecAttr = 0x0, 
  theBlobEventDistRecAttr = 0x0, theBlobEventPartRecAttr = 0x0, 
  theBlobEventPkidRecAttr = 0x0, theBlobEventDataRecAttr = 0x0, theTable = 
    0x1003744d640, theAccessTable = 0x1003744d640, 
  theBlobTable = 0x100374622f0, theColumn = 0x1003744d030, 
  theFillChar = 255 '\377', theInlineSize = 256, thePartSize = 2000, 
  theStripeSize = 0, theGetFlag = false, theGetBuf = 0x0, theSetFlag = false, 
  theSetValueInPreExecFlag = false, theSetBuf = 0x0, theGetSetBytes = 0, 
  thePendingBlobOps = 1 '\001', theActiveHook = 0x0, theActiveHookArg = 0x0, 
  theKeyBuf = {data = 0x1003744bdc0 "\003", size = 4, maxsize = 8}, 
  theAccessKeyBuf = {data = 0x1003745dc60 "", size = 4, maxsize = 8}, 
  thePackKeyBuf = {data = 0x1003744d8a0 "\003", size = 4, maxsize = 8}, 
  theHeadInlineBuf = {data = 0x1003746eab0 "\016\001", size = 272, 
    maxsize = 272}, theHeadInlineCopyBuf = {data = 0x1003746f3b0 "", 
    size = 272, maxsize = 272}, thePartBuf = {data = 0x1003746ebd0 "", 
    size = 2000, maxsize = 2000}, thePartLen = 0, theBlobEventDataBuf = {
    data = 0x0, size = 0, maxsize = 0}, theBlobEventDistValue = 16842752, 
  theBlobEventPartValue = 927304144, theBlobEventPkidValue = 4294967295, 
  theHead = {varsize = 270, reserved = 0, pkid = 0, length = 5000, 
    headsize = 16}, 
  theInlineData = 0x1003746eac0 " -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.; -,.;"..., 
  theHeadInlineRecAttr = 0x1003746f4d0, theHeadInlineReadOp = 0x0, 
  theHeadInlineUpdateFlag = false, userDefinedPartitioning = false, 
  thePartitionId = 4294967295, thePartitionIdRecAttr = 0x0, theNullFlag = 0, 
  theLength = 5000, thePos = 0, theError = {status = NdbError::Success, 
    classification = NdbError::NoError, code = 0, mysql_code = 0, 
    message = 0x10235d30 "No error", details = 0x0}, theNext = 0x1003746f530}

How to repeat:
run ndb.ndb_restore_conv_more on POWER8 (maybe others? i haven't tried)

Suggested fix:
Find the bug and fix it :)
[2 Feb 2015 7:32] Stewart Smith
This test passes with 7.3.8 and on *LITTLE* endian ppc64.

Unless there is a plat to support NDB on any big endian platform (are any current platforms big endian?) it may be better to either close this or just add a configure check to no longer build on big endian platforms.
[5 Mar 2015 11:29] Mikael Ronström
Pretty sure SPARC is still big endian.
[18 Dec 2015 20:59] Gustaf Thorslund
Hi Stewart,

Using 7.3.7 I've verified this bug almost as described. System used was qemu-system-ppc64 running Debian Jessie. Due to this I also had to update some timeout variables, both for ndbd and mtr. I also had to copy storage/ndb/src/kernel/vm/mt-asm.h from 7.4.9 to get it to compile at all.

$ ./mysql-test/mtr ndb.ndb_restore_conv_more |& tee ndb.ndb_restore_conv_more.log

.
.
ndb.ndb_restore_conv_more                [ fail ]
        Test ended at 2015-12-18 16:29:44

CURRENT_TEST: ndb.ndb_restore_conv_more
/home/gustaf/src/mysql-cluster-gpl-7.3.7/storage/ndb/tools/ndb_lib_move_data.cpp:1048: require((code != 0)) failed

   1043 void
   1044 Ndb_move_data::set_error_code(int code, const char* fmt, ...)
   1045 {
   1046   va_list ap;
   1047   va_start(ap, fmt);
   1048   require(code != 0);
   1049   Error& e = m_error;
   1050   e.code = code;
   1051   my_vsnprintf(e.message, sizeof(e.message), fmt, ap);
   1052   va_end(ap);
   1053 }

$ tail mysql-test/var/log/ndb.ndb_restore_conv_more/tmp/ndb_testrun.log
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 1
_____________________________________________________
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 3
_____________________________________________________
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 5
_____________________________________________________
Processing data in table: test/def/NDB$BLOB_18_3(20) fragment 7
Opening file '/home/gustaf/src/mysql-cluster-gpl-7.3.7/build/mysql-test/var/mysql_cluster.1//BACKUP/BACKUP-8/BACKUP-8.2.log'
File size 1024 bytes
Restored 590 tuples and 0 log entries

$ cat /proc/cpuinfo 
processor	: 0
cpu		: POWER7 (raw), altivec supported
clock		: 1000.000000MHz
revision	: 2.3 (pvr 003f 0203)

timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)

If I instead try the current version 7.4.9 I cannot reproduce the bug:
ndb.ndb_restore_conv_more                [ pass ]  7769970
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 7769.970 of 8369 seconds executing testcases

Completed: All 1 tests were successful.

real	139m29.313s
user	46m59.335s
sys	91m31.128s

Could you try the current GA too?

/Gustaf
[19 Jan 2016 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".