Bug #80096 ndb_print_file core dump on Solaris
Submitted: 21 Jan 2016 10:11 Modified: 30 Mar 2016 10:21
Reporter: Magnus Blåudd Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.5.1 OS:Any
Assigned to: CPU Architecture:Any

[21 Jan 2016 10:11] Magnus Blåudd
Description:
The ndb_print_file test fails consistently when running on Solaris 11.

ndb.ndb_print_file                       w3 [ fail ]
Test ended at 2016-01-21 05:59:59 CURRENT_TEST: ndb.ndb_print_file mysqltest: At line 39: command "$NDB_PRINT_FILE $MYSQLD_DATADIR/../../ndbd.2/ndb_2_fs/data_1.dat > $NDB_TOOLS_OUTPUT " failed Output from before failure: exec of '/export/home/pb2/test/sb_2-17673790-1453350399.72/mysql-cluster-gpl-7.5.1-solaris11-sparc-64bit/bin/ndb_print_file /tmp/mtr-29342/var-n_mix/3/mysql_cluster.1/mysqld.1/data//../../ndbd.2/ndb_2_fs/data_1.dat > /tmp/mtr-29342/var-n_mix/3/tmp/ndb_testrun.log ' failed, error: 138, status: 138, errno: 29

How to repeat:
Fails all the time in PB. Could be worth to first double checking that errno 29 does not mean disk full or something.

Suggested fix:
.
[22 Jan 2016 14:05] Magnus Blåudd
Posted by developer:
 
Caused by a SIGBUS when using localtime. Some unaligned data is passed in. Probably need rewrite to use a aligned buffer on the stack.

t@1 (l@1) program terminated by signal BUS (invalid address alignment)
0xffffffff7e581900: localtime+0x0030:   ldx      [%i0], %i0
current thread: t@1
=>[1] localtime(0x10048c0e4, 0xffffffff7e200280, 0x0, 0xffffffff7e782000, 0x20072c, 0xffffffff7e000198), at 0xffffffff7e581900 
  [2] ctime(0x10048c0e4, 0xffffffff7e200240, 0x5, 0xffffffff7fff9430, 0x100, 0x70501), at 0xffffffff7e570310 
  [3] _ZlsR6NdbOutRKN12File_formats16Zero_page_headerE(0x10048c180, 0x10048c0c8, 0xffffffff7e787ba4, 0x10048c180, 0x100309ba8, 0x10048c180), at 0x10001c7ec 
  [4] _ZlsR6NdbOutRKN12File_formats8Datafile9Zero_pageE(0x10048c180, 0x10048c0c8, 0x0, 0x4, 0x2000000080, 0x0), at 0x10001c824 
  [5] _ZL15print_zero_pageiPvj(0x0, 0x10049c3f0, 0x8000, 0xffffffff7e787c80, 0x10048c180, 0x10048c0c8), at 0x10001bd50 
  [6] main(0xffffffff7e787c80, 0x8000, 0x1004176a8, 0x0, 0xffffffff7fff9890, 0x8000), at 0x100057504 
current thread: t@1
=>[1] localtime(0x10048c0e4, 0xffffffff7e200280, 0x0, 0xffffffff7e782000, 0x20072c, 0xffffffff7e000198), at 0xffffffff7e581900 
  [2] ctime(0x10048c0e4, 0xffffffff7e200240, 0x5, 0xffffffff7fff9430, 0x100, 0x70501), at 0xffffffff7e570310 
  [3] _ZlsR6NdbOutRKN12File_formats16Zero_page_headerE(0x10048c180, 0x10048c0c8, 0xffffffff7e787ba4, 0x10048c180, 0x100309ba8, 0x10048c180), at 0x10001c7ec 
  [4] _ZlsR6NdbOutRKN12File_formats8Datafile9Zero_pageE(0x10048c180, 0x10048c0c8, 0x0, 0x4, 0x2000000080, 0x0), at 0x10001c824 
  [5] _ZL15print_zero_pageiPvj(0x0, 0x10049c3f0, 0x8000, 0xffffffff7e787c80, 0x10048c180, 0x10048c0c8), at 0x10001bd50 
  [6] main(0xffffffff7e787c80, 0x8000, 0x1004176a8, 0x0, 0xffffffff7fff9890, 0x8000), at 0x100057504
[22 Jan 2016 14:09] Magnus Blåudd
Posted by developer:
 

NdbOut&
operator<<(NdbOut& out, const File_formats::Zero_page_header& obj)
{
  char buf[256];
  out << "page size:   " << obj.m_page_size << endl;
  out << "ndb version: " << obj.m_ndb_version << ", " <<
    ndbGetVersionString(obj.m_ndb_version, 0, 0, buf, sizeof(buf)) << endl;
  out << "ndb node id: " << obj.m_node_id << endl;
  out << "file type:   " << obj.m_file_type << endl;
  out << "time:        " << obj.m_time << ", " 
      << ctime((time_t*)&obj.m_time)<< endl;
         ^^^^
  return out;
}
[24 Mar 2016 10:37] Mauritz Sundell
Posted by developer:
 
Crash is in 
NdbOut&
operator<<(NdbOut& out, const File_formats::Zero_page_header& obj)
{
  char buf[256];
  out << "page size:   " << obj.m_page_size << endl;
  out << "ndb version: " << obj.m_ndb_version << ", " <<
    ndbGetVersionString(obj.m_ndb_version, 0, 0, buf, sizeof(buf)) << endl;
  out << "ndb node id: " << obj.m_node_id << endl;
  out << "file type:   " << obj.m_file_type << endl;
  out << "time:        " << obj.m_time << ", "
      << ctime((time_t*)&obj.m_time)<< endl;
               ^^^^^^^^^^^^ m_time is a 32bit word aligned on 4byte, but not on 8byte. and time_t is 8 byte!
  return out;
}

$ grep TIME_T ../CMakeCache.txt
HAVE_SIZEOF_TIME_T:INTERNAL=TRUE
SIZEOF_TIME_T:INTERNAL=8

[msundell@vimur09]~/build-7.5/mysql-test: /opt/SUNWspro/bin/dbx ../storage/ndb/src/kernel/blocks/ndb_print_file var/log/ndb.ndb_print_file/core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading ndb_print_file
core file header read successfully
Reading ld.so.1
Reading libnsl.so.1
Reading libsocket.so.1
Reading libstdc++.so.6.0.18
dbx: warning: unknown location expression code (0xe0)
dbx: warning: unknown location expression code (0xe0)
Reading libm.so.2
Reading libgcc_s.so.1
Reading libc.so.1
program terminated by signal BUS (invalid address alignment)
0xffffffff7e57f0d0: localtime+0x0030:   ldx      [%i0], %i0
(dbx) up
0xffffffff7e56de90: ctime+0x0024:       call     localtime      ! 0xffffffff7e57f0a0
(dbx) up
0x00000001000244a0: _ZlsR6NdbOutRKN12File_formats16Zero_page_headerE+0x0254:    call     ctime [PLT]    ! 0x100452cc0
(dbx) up
0x000000010002451c: _ZlsR6NdbOutRKN12File_formats8Datafile9Zero_pageE+0x0018:   call     _ZlsR6NdbOutRKN12File_formats16Zero_page_headerE       ! 0x10002424c
(dbx) up
0x00000001000223d4: _ZL15print_zero_pageiPvj+0x01dc:    call     _ZlsR6NdbOutRKN12File_formats8Datafile9Zero_pageE      ! 0x100024504
[30 Mar 2016 10:21] Jon Stephens
Fixed in NDB 7.4.11 and 7.5.2. Documented as follows:

    The ndb_print_file utility failed consistently on Solaris 9 for SPARC.

Closed.