Bug #39540 ndb_restore crash while restoring log from different endian
Submitted: 19 Sep 2008 16:31 Modified: 13 Apr 2009 16:16
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:Cluster 6.3.17 OS:Solaris (Sparc only)
Assigned to: Magnus Blåudd CPU Architecture:Any

[19 Sep 2008 16:31] Joerg Bruehe
Description:
This one occurred in the build of "Cluster-6.3.17".
I do not see a similar error in the Bugs DB, so this seems to be new.

Symptom:
=====
ndb.ndb_restore_compat         [ fail ]

Segmentation Fault - core dumped
mysqltest: At line NNN: command "$NDB_TOOLS_DIR/ndb_restore --no-defaults -b 1 -n 2 -p 1 -r $MYSQL_TEST_DIR/std_data/ndb_backup_packed >> $NDB_TOOLS_OUTPUT" failed

The result from queries just before the failure was:
< snip >
COUNT(*)
4056
SELECT * FROM SYSTEM_VALUES ORDER BY SYSTEM_VALUES_ID;
SYSTEM_VALUES_ID        VALUE
0       2297
1       5
SELECT * FROM mysql.ndb_apply_status WHERE server_id=0;
server_id       epoch   log_name        start_pos       end_pos
0       331             0       0
SELECT * FROM DESCRIPTION ORDER BY USERNAME;
USERNAME        ADDRESS
Guangbao Ni     Suite 503, 5F NCI Tower, A12 Jianguomenwai Avenue Chaoyang District, Beijing, 100022  PRC
USERNAME Varchar(255;latin1_swedish_ci) NULL AT=SHORT_VAR ST=MEMORY
ADDRESS Longvarchar(2002;latin1_swedish_ci) NULL AT=MEDIUM_VAR ST=MEMORY
DROP TABLE GL;
DROP TABLE ACCOUNT;
DROP TABLE TRANSACTION;
DROP TABLE SYSTEM_VALUES;
DROP TABLE ACCOUNT_TYPE;
exec of '/PATH/bin/ndb_restore --no-defaults -b 1 -n 2 -p 1 -r /PATH/mysql-test/std_data/ndb_backup_packed >> /PATH/mysql-test/var/log/ndb_testrun.log' failed, error: 35584, status: 139, errno: 29

More results from queries before failure can be found in /PATH/mysql-test/var/log/ndb_restore_compat.log
=====

Status 139 means crash with core dump,
errno 29 is "ESPIPE".

It happened in all tests for Solaris (9 and 10) on Sparc (32 and 64 bit),
and only on these platforms.
(Failures of this test on other platforms were caused by more general problems there.)

How to repeat:
This occurred when running the test suite.
[15 Dec 2008 11:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61659

2787 Leonard Zhou	2008-12-15
      BUG#39540 Correc 'ndb_restore' tool
[9 Feb 2009 18:49] Joerg Bruehe
Same crash in the build of cluster 6.4.2, same platforms.
[9 Feb 2009 20:11] Joerg Bruehe
... and I suspect this might be related to the same basic problem,
even though shows on 64 bit hosts only (build + test of 6.4.2):

=====
ndb.ndb_restore_undolog        [ fail ]

Bus Error - core dumped
mysqltest: At line 441: command "$NDB_RESTORE --no-defaults -b 1 -n 2 -r $MYSQL_TEST_DIR/std_data/ndb_backup51_undolog_le >> $NDB_TOOLS_OUTPUT" failed

The result from queries just before the failure was:
USE test;
DROP TABLE IF EXISTS t_num,t_datetime,t_string_1,t_string_2,t_gis,t_string_3,t_string_4,t_string_5;
exec of '/PATH/bin/ndb_restore --no-defaults -b 1 -n 2 -r /PATH/mysql-test/std_data/ndb_backup51_undolog_le >>
 /PATH/mysql-test/var/log/ndb_testrun.log' failed, error: 35328, status: 138, errno: 29

More results from queries before failure can be found in /PATH/mysql-test/var/log/ndb_restore_undolog.log

Warnings from just before the error:
Note 1051 Unknown table 't_num'
Note 1051 Unknown table 't_datetime'
Note 1051 Unknown table 't_string_1'
Note 1051 Unknown table 't_string_2'
Note 1051 Unknown table 't_gis'
Note 1051 Unknown table 't_string_3'
Note 1051 Unknown table 't_string_4'
=====
[1 Apr 2009 13:59] Jonas Oreland
Magnus,
why not take it "one-step-further"
and push the "null" check down into Twiddle?

Also, can you explain more what the problem is,
why does it only fail on sparc?
does Twiddle(0) work on linux (or x86)
etc...
[2 Apr 2009 8:29] Magnus Blåudd
mysqldev@sol10-sparc-a:~/magnus/mysql-5.1.32-ndb-7.0.5-pb558/mysql-test> ../libtool --mode=execute dbx ../storage/ndb/tools/ndb_restore core 
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading ndb_restore
core file header read successfully
Reading ld.so.1
Reading libmtmalloc.so.1
Reading libpthread.so.1
Reading libthread.so.1
Reading librt.so.1
Reading libgen.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading libm.so.2
Reading libCstd.so.1
Reading libCrun.so.1
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
Reading libc_psr.so.1
t@1 (l@1) terminated by signal SEGV (no mapping at the fault address)
Current function is BackupFile::Twiddle
   84         attr_data->u_int32_value[i] = Twiddle32(attr_data->u_int32_value[i]);
(dbx) p attr_data    
attr_data = 0x10043f320
(dbx) p *attr_data
*attr_data = {
    null          = true
    size          = 5365856U
    int8_value    = (nil)
    u_int8_value  = (nil)
    int16_value   = (nil)
    u_int16_value = (nil)
    int32_value   = (nil)
    u_int32_value = (nil)
    int64_value   = (nil)
    u_int64_value = (nil)
    string_value  = (nil)
    void_value    = (nil)
}
(dbx) up
Current function is RestoreLogIterator::getNextLogEntry
 1805       Twiddle(attr->Desc, &(attr->Data));
(dbx) p sz
sz = 0
(dbx) p attr
attr = 0x10043f318
(dbx) p * attr
*attr = {
    Desc = 0x10041b138
    Data = {
        null          = true
        size          = 5365856U
        int8_value    = (nil)
        u_int8_value  = (nil)
        int16_value   = (nil)
        u_int16_value = (nil)
        int32_value   = (nil)
        u_int32_value = (nil)
        int64_value   = (nil)
        u_int64_value = (nil)
        string_value  = (nil)
        void_value    = (nil)
    }
}
(dbx) p attr->Desc
attr->Desc = 0x10041b138
(dbx) p *attr->Desc
*attr->Desc = {
    size           = 32U
    arraySize      = 2U
    attrId         = 9U
    m_column       = 0x10049b418
    m_nullBitIndex = 0
    convertFunc    = (nil)
    parameter      = (nil)
}
[2 Apr 2009 14:30] Magnus Blåudd
Crash occurs in ndb_restore when it tries to restore the log part of the checked in backup from mysql-test/std_data/ndb_backup_packed.

The problem does not show up on little endian machines because the above backup is from a little endian machine and thus the 'Twiddle' function will detect that by checking "m_hostByteOrder" and return immediately without doing anything.

When running on big endian machine 'Twiddle' will try to swap the byte order of the  "attr_data" union, which has previously been set to NULL through the "void_value" pointer.

Suggest that we don't call 'Twiddle' when data is NULL and there is nothing to do as well as adding asserts in 'Twiddle' to detect the problem on any platform.
[2 Apr 2009 15:08] Jonas Oreland
super magnus,
ok to push (what ever you do) since explanation is great
[2 Apr 2009 18:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71232
[2 Apr 2009 18:44] Bugs System
Pushed into 5.1.32-ndb-6.3.24 (revid:magnus.blaudd@sun.com-20090402182917-iiiqfx7uf2g89y2a) (version source revid:magnus.blaudd@sun.com-20090402182917-iiiqfx7uf2g89y2a) (merge vers: 5.1.32-ndb-6.3.24) (pib:6)
[2 Apr 2009 19:05] Bugs System
Pushed into 5.1.32-ndb-7.0.5 (revid:magnus.blaudd@sun.com-20090402185252-ohj110mutolxd2x4) (version source revid:magnus.blaudd@sun.com-20090402185252-ohj110mutolxd2x4) (merge vers: 5.1.32-ndb-7.0.5) (pib:6)
[13 Apr 2009 16:16] Jon Stephens
Documented bugfix in the NDB-6.3.24 and 7.0.5 changelogs as follows:

        ndb_restore crashed when trying to restore a backup made to a
        MySQL Cluster running on a platform having different endianness
        from that on which the original backup was taken.