Bug #28123 rpl_ndb_mix_innodb.test casue slave to core on sol10-sparc-a
Submitted: 26 Apr 2007 14:26 Modified: 10 Jul 2007 7:17
Reporter: Jonathan Miller Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S1 (Critical)
Version:5.1-telco OS:Solaris (sol10-sparc-a)
Assigned to: Assigned Account CPU Architecture:Any

[26 Apr 2007 14:26] Jonathan Miller
Description:
After several correction to original test rpl_ndb_stm_innodb.test the test case was broken out into 2 test cases that use wrappers. A wrapper for SBR and one for MBR. Both of these test passed on all of the push build systems with the exception of rpl_ndb_mix_innodb.test that caused a crash on the slave of the sol10-sparc-a system:

CURRENT_TEST: rpl_ndb_mix_innodb
070426  2:31:25 [Warning] The syntax for replication startup options is deprecated and will be removed in MySQL 5.2. Please use 'CHANGE MASTER' instead.
070426  2:31:25 [Note] NDB: NodeID is 3, management server 'localhost:11106'
070426  2:31:30 [Note] NDB[0]: all storage nodes connected
070426  2:31:31 [Note] NDB[1]: all storage nodes connected
070426  2:31:31 [Note] NDB[2]: all storage nodes connected
070426  2:31:31 [Note] Starting MySQL Cluster Binlog Thread
070426  2:31:31 [Note] Event Scheduler: Loaded 0 events
070426  2:31:31 [Note] /export/home/pushbuild/pb/mysql-5.1-telco/138/mysql-5.1.18-ndb-6.2.1-pb138/sql/mysqld: ready for connections.
Version: '5.1.18-ndb-6.2.1-pb138-log'  socket: '/tmp/pbtmp-n_mix-110/slave.sock'  port: 11102  MySQL Pushbuild Edition, build 138
070426  2:31:31 [Note] NDB Binlog: Ndb tables initially read only.
070426  2:31:31 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema
070426  2:31:31 [Note] NDB Binlog: logging ./mysql/ndb_schema
070426  2:31:31 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_apply_status
070426  2:31:31 [Note] NDB Binlog: logging ./mysql/ndb_apply_status
070426  2:31:31 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log '/tmp/var-n_mix-110/log/slave-relay-bin.000001' position: 4
070426  2:31:31 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:11100',replication started in log 'FIRST' at position 4
070426  2:31:31 [Note] NDB Binlog: ndb tables writable
070426  2:31:32 [Note] NDB Binlog: starting log at epoch 186
070426  2:31:32 [Note] NDB Binlog: RENAME Event: REPL$test/t1
070426  2:31:32 [Note] NDB Binlog: logging ./test/t1
070426  2:31:33 [Note] NDB Binlog: Node: 1, subscribe from node 3, Subscriber bitmask 08000000
070426  2:31:34 [Note] NDB Binlog: RENAME Event: REPL$test/t2
070426  2:31:34 [Note] NDB Binlog: logging ./test/t2
070426  2:31:35 [Note] NDB Binlog: drop table ./test/t1.
070426  2:31:36 [Note] NDB Binlog: drop table ./test/t2.
070426  2:31:39 [Note] NDB Binlog: RENAME Event: REPL$tpcb/account
070426  2:31:39 [Note] NDB Binlog: logging ./tpcb/account
070426  2:31:40 [Note] NDB Binlog: RENAME Event: REPL$tpcb/branch
070426  2:31:40 [Note] NDB Binlog: logging ./tpcb/branch
070426  2:31:41 [Note] NDB Binlog: RENAME Event: REPL$tpcb/teller
070426  2:31:41 [Note] NDB Binlog: logging ./tpcb/teller
070426  2:31:44 [Note] NDB Binlog: RENAME Event: REPL$tpcb/history
070426  2:31:44 [Note] NDB Binlog: logging ./tpcb/history
070426  2:31:46 - mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_threads=151
threads_connected=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 59821 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Writing a core file  

How to repeat:
Pull 5.1-telco and build on a sol10-sparc system.

run > mysql-test-run.pl --do-test=rpl_ndb_mix_innodb --mysqld=--binlog-format=MIXED
[2 Jul 2007 6:48] Guangbao Ni
Loaded symbols for /usr/platform/SUNW,Sun-Fire-V210/lib/libc_psr.so.1
#0  0xff11fc54 in _lwp_kill () from /usr/lib/libc.so.1
(gdb) bt
#0  0xff11fc54 in _lwp_kill () from /usr/lib/libc.so.1
#1  0x00339730 in write_core (sig=11) at stacktrace.c:229
#2  0x001b69c0 in handle_segfault (sig=11) at mysqld.cc:2239
#3  0xff355bb4 in __sighndlr () from /usr/lib/libthread.so.1
#4  0xff34f80c in call_user_handler () from /usr/lib/libthread.so.1
#5  <signal handler called>
#6  0xff0b455c in strlen () from /usr/lib/libc.so.1
#7  0xff107058 in _doprnt () from /usr/lib/libc.so.1
#8  0xff108ab4 in fprintf () from /usr/lib/libc.so.1
#9  0x002b2190 in ndbcluster_commit (hton=0x0, thd=0xffffffff, all=false)
    at ha_ndbcluster.cc:5152
#10 0x0029bff8 in ha_commit_one_phase(THD*, bool) (thd=0xb969e0, all=false)
    at handler.cc:789
#11 0x0029c5a4 in ha_commit_trans(THD*, bool) (thd=0xb969e0, all=false)
    at handler.cc:755
#12 0x0029c968 in ha_autocommit_or_rollback(THD*, int) (thd=0xb969e0,
    error=0) at handler.cc:899
#13 0x00273594 in Rows_log_event::do_update_pos(st_relay_log_info*) (
---Type <return> to continue, or q <return> to quit---
    this=0xc93fc8, rli=0xb1d118) at log_event.cc:6176
#14 0x0032508c in exec_relay_log_event (thd=0xb969e0, rli=0xb1d118)
    at log_event.h:828
#15 0x00325aa4 in handle_slave_sql (arg=0xb1d118) at slave.cc:2407
[3 Jul 2007 7:03] Guangbao Ni
in slave.err file, there are the following error messages:

070629 15:34:18 [ERROR] Slave: Error in Update_rows event: commit of row events failed, Error_code: 1
070629 15:34:18 [ERROR] Slave: It was not possible to update the positions of the relay log information: the slave may be in an inconsistent state. Stopped in /export/home/ngb/telco_bug28346/mysql-test/var/log/slave-relay-bin.000003 position 194946, Error_code: 1105
070629 15:34:18 [ERROR] Slave (additional info): Got error 4350 'Transaction already aborted' from NDBCLUSTER Error_code: 1296
070629 15:34:18 [Warning] Slave: Got error 4350 'Transaction already aborted' from NDB Error_code: 1296
070629 15:34:18 [Warning] Slave: Got error 4350 'Transaction already aborted' from NDBCLUSTER Error_code: 1296
070629 15:34:18 [Warning] Slave: Got error 4350 'Transaction already aborted' from NDBCLUSTER Error_code: 1296
070629 15:34:18 [Warning] Slave: Got error 4350 during COMMIT Error_code: 1180
070629 15:34:18 [Warning] Slave: Unknown error Error_code: 1105
070629 15:34:18 [Warning] Slave: Unknown error Error_code: 1105
070629 15:34:18 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 194800
NDB: Found 3 NdbTransaction's that have not been released
NDB: Found 1 NdbReceiver that has not been released
NDB: Found 3 NdbTransaction's that have not been released
NDB: Found 1 NdbReceiver that has not been released
070629 15:34:49 [Note] Slave: received end packet from server, apparent master shutdown:
070629 15:34:49 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'master-bin.000001' position 197628
070629 15:34:49 [ERROR] Slave I/O thread: error reconnecting to master 'root@127.0.0.1:9306':                       Error: 'Can't connect to MySQL server on '127.0.0.1' (146)'  errno: 2003  retry-time: 1  retries: 10
070629 15:34:49 [Note] /export/home/ngb/telco_bug28346/sql/mysqld: Normal shutdown
070629 15:34:49 [Note] Event Scheduler: Purging the queue. 0 events
070629 15:34:49 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
070629 15:34:49 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 197628
070629 15:34:49 [Note] Stopping Cluster Binlog
070629 15:34:49 [Warning] Forcing shutdown of 1 plugins
070629 15:34:49 [Note] Plugin 'ndbcluster' will be forced to shutdown
070629 15:34:49 [Note] Stopping Cluster Utility thread
NDB: table share ./mysql/ndb_apply_status with use_count 1 not freed
NDB: table share ./tpcb/teller with use_count 1 not freed
NDB: table share ./tpcb/branch with use_count 1 not freed
NDB: table share ./tpcb/account with use_count 1 not freed
NDB: table share ./tpcb/history with use_count 1 not freed
NDB: table share ./mysql/ndb_schema with use_count 1 not freed
070629 15:34:51 [Note] /export/home/ngb/telco_bug28346/sql/mysqld: Shutdown complete
[9 Jul 2007 1:32] Guangbao Ni
for rpl_ndb_mix_innodb
If master and slave both run on Solaris, and when master table is INNODB and slave table is NDB, in Field_blob::unpack() and Field_blob::pack(), 
these two function will both call get_length() function,
in uint32 Field_blob::get_length() function,
uint32 Field_blob::get_length(const char *pos) 
{ 
  switch (packlength) { 
  case 1:  
    return (uint32) (uchar) pos[0];  
  case 2:  
    { 
      uint16 tmp; 
#ifdef WORDS_BIGENDIAN 
      if (table->s->db_low_byte_first) 
  tmp=sint2korr(pos);     //master is INNODB, will run this
      else     
#endif 
  shortget(tmp,pos);    //slave is NDB, will run this
      return (uint32) tmp; 
    } 
  case 3:  
    return (uint32) uint3korr(pos); 
  case 4:
    {
      uint32 tmp;
#ifdef WORDS_BIGENDIAN
      if (table->s->db_low_byte_first)
  tmp=uint4korr(pos);
      else
#endif
  longget(tmp,pos);
      return (uint32) tmp;
    }
  }
  return 0;         // Impossible
}

and from the result of printout, i can see the master prints the result of get_length() is 36 and the slave print out 603979776. it just means the different length byte order.
[9 Jul 2007 9:16] Guangbao Ni
and I simplify the test case as followings,

--disable_query_log
--source include/have_ndb.inc
--source include/have_innodb.inc
--source include/have_binlog_format_mixed.inc
--source include/master-slave.inc
--enable_query_log
let $off_set = 9;
let $rpl_format = 'MIX';

# Create database/tables and stored procdures
connection master;
#--source include/tpcb.inc
CREATE DATABASE tpcb;

CREATE TABLE tpcb.history (id MEDIUMINT NOT NULL AUTO_INCREMENT,
                           tdate DATETIME, uuidf LONGBLOB,
                           filler CHAR(80),PRIMARY KEY (id));
# Switch tables on slave to use NDB
--sync_slave_with_master
USE tpcb;
ALTER TABLE history ENGINE NDB;

--echo

# Load DB tpcb and run some transactions
connection master;
--disable_query_log
#CALL tpcb.load();
INSERT INTO tpcb.history VALUES(1, NOW(),
         UUID(),'completed trans');
--sync_slave_with_master
[10 Jul 2007 7:17] Guangbao Ni
duplicate with bug#29549