Bug #19885 memory corruption with ndb binlog in conjunction with online alter table
Submitted: 17 May 2006 14:08 Modified: 19 May 2006 23:39
Reporter: Kristian Nielsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S1 (Critical)
Version:5.1.11 OS:Linux (Linux/x86 (32 bit))
Assigned to: Tomas Ulin

[17 May 2006 14:08] Kristian Nielsen
Description:
I get an occasional crash on master1 in test case ndb_condition_pushdown.

The crash happens only rarely, running the test in a loop I've seen it happen as early as on the 4th and as late as on the 33th run.

This is the backtrace from gdb:

#0  0xb7f789d3 in pthread_kill () from /lib/tls/libpthread.so.0
#1  0x083fbf4a in write_core (sig=11) at stacktrace.c:220
#2  0x0824223d in handle_segfault (sig=11) at mysqld.cc:2151
#3  <signal handler called>
#4  0x082fef1a in Table_map_log_event (this=0xb7a04138, thd=0x8cc5c78, tbl=0x90d9b08, tid=69, 
    is_transactional=true, flags=0) at log_event.cc:5693
#5  0x082ee045 in THD::binlog_write_table_map (this=0x8cc5c78, table=0x90d9b08, is_trans=true)
    at log.cc:2673
#6  0x083f2261 in injector::transaction::use_table (this=0xb7a0436c, sid=102, tbl=
      {m_table = 0x90d9b08, m_is_transactional = true}) at rpl_injector.cc:81
#7  0x0839c872 in ndb_binlog_thread_func (arg=0x0) at ha_ndbcluster_binlog.cc:3442
#8  0xb7f75ced in start_thread () from /lib/tls/libpthread.so.0
#9  0xb7e9fdee in clone () from /lib/tls/libc.so.6

The crash is in this line in log_event.cc:

      m_coltype[i]= m_table->field[i]->type();

i=7, and the vtable of table->field[i] is corrupt, causing the virtual function call to fail.

(gdb) p *(m_table->field[7])
$26 = {_vptr.Field = 0x15228, ptr = 0x90d0000 "¥¥¥¥¥¥¥¥h4z\025I\001", null_ptr = 0x0, 
  table = 0x90d9b08, orig_table = 0x90d9b08, table_name = 0x90d9b6c, 
  field_name = 0x90dac39 "medium", comment = {str = 0x87f9fb4 "", length = 0}, 
  query_id = 73014444032, add_index = 165, key_start = {map = 2}, part_of_key = {map = 0}, 
  part_of_sortkey = {map = 2}, unireg_check = NONE, field_length = 8, field_index = 7, 
  flags = 16393, fieldnr = 8, null_bit = 0 '\0'}

Address 0x15228 is not valid, other vtable entries are in the 0x87a???? range.

So it appears that someone is overwriting the vtable entry...

How to repeat:
(for i in `seq 1 100` ; do echo XXXXXXXXX $i XXXXXXXXXXX; MTR_BUILD_THREAD=4 time perl mysql-test-run.pl --timer --force --tmpdir=/dev/shm/t4 --vardir=/dev/shm/v4 ndb_condition_pushdown || exit 1; done)
[17 May 2006 14:24] Kristian Nielsen
With a --debug run it crashes earlier when trying to dump the table:

#0  0xb7f429d3 in pthread_kill () from /lib/tls/libpthread.so.0
#1  0x083fbf4a in write_core (sig=11) at stacktrace.c:220
#2  0x0824223d in handle_segfault (sig=11) at mysqld.cc:2151
#3  <signal handler called>
#4  0x08392a03 in dbug_print_table (info=0x8816d98 "table", table=0x8d69a98)
    at ha_ndbcluster_binlog.cc:169
#5  0x08399bbe in ndb_binlog_thread_handle_data_event (ndb=0x8d04c10, 
    pOp=0x8cc5050, row=@0xb79ce2cc, trans=@0xb79ce36c)
    at ha_ndbcluster_binlog.cc:2884
#6  0x0839cd64 in ndb_binlog_thread_func (arg=0x0)
    at ha_ndbcluster_binlog.cc:3516
#7  0xb7f3fced in start_thread () from /lib/tls/libpthread.so.0
#8  0xb7e69dee in clone () from /lib/tls/libc.so.6
(gdb) frame 5
#5  0x08399bbe in ndb_binlog_thread_handle_data_event (ndb=0x8d04c10, 
    pOp=0x8cc5050, row=@0xb79ce2cc, trans=@0xb79ce36c)
    at ha_ndbcluster_binlog.cc:2884

(gdb) p i
$6 = 4
(gdb) p *f
$7 = {_vptr.Field = 0x803ff199, ptr = 0x0, 
  null_ptr = 0x1e80301 <Address 0x1e80301 out of bounds>, table = 0x10001, 
  orig_table = 0x100, table_name = 0x100, field_name = 0x0, comment = {
    str = 0x21020200 <Address 0x21020200 out of bounds>, length = 1963003610}, 
  query_id = 1245845979797979175, add_index = false, key_start = {
    map = 17515}, part_of_key = {map = 0}, part_of_sortkey = {map = 0}, 
  unireg_check = NONE, field_length = 7, field_index = 4, flags = 128, 
  fieldnr = 5, null_bit = 8 '\b'}

This time, it is field 4 that is (severely) corrupted.
[18 May 2006 7:24] Kristian Nielsen
This Valgrind report may be relevant, it seems master1 is writing into free()'ed memory:

==15146== Invalid write of size 1
==15146==    at 0x866E9A4: NdbRecAttr::receive_data(unsigned const*, unsigned) (NdbRecAttr.cpp:131)
==15146==    by 0x8688A14: NdbEventOperationImpl::receive_data(NdbRecAttr*, unsigned const*, unsigned) (NdbEventOperationImpl.hpp:522)
==15146==    by 0x868298F: NdbEventOperationImpl::receive_event() (NdbEventOperationImpl.cpp:758)
==15146==    by 0x86839D1: NdbEventBuffer::nextEvent() (NdbEventOperationImpl.cpp:1144)
==15146==    by 0x8631727: Ndb::nextEvent() (Ndb.cpp:1329)
==15146==    by 0x839CE2E: ndb_binlog_thread_func (ha_ndbcluster_binlog.cc:3458)
==15146==    by 0x404BC36: pthread_start_thread (manager.c:310)
==15146==    by 0x41C32B9: clone (clone.S:119)
==15146==  Address 0x472F78A is 42 bytes inside a block of size 996 free'd
==15146==    at 0x401CF37: free (vg_replace_malloc.c:235)
==15146==    by 0x86DEE2E: _myfree (safemalloc.c:314)
==15146==    by 0x86DE187: free_root (my_alloc.c:347)
==15146==    by 0x829D070: closefrm(st_table*, bool) (table.cc:1604)
==15146==    by 0x839322D: ndbcluster_binlog_close_table(THD*, st_ndbcluster_share*) (ha_ndbcluster_binlog.cc:259)
==15146==    by 0x8395C30: ndb_handle_schema_change(THD*, Ndb*, NdbEventOperation*, st_ndbcluster_share*) (ha_ndbcluster_binlog.cc:1539)
==15146==    by 0x8399F57: ndb_binlog_thread_handle_non_data_event(THD*, Ndb*, NdbEventOperation*, Binlog_index_row&) (ha_ndbcluster_binlog.cc:2884)
==15146==    by 0x839D77B: ndb_binlog_thread_func (ha_ndbcluster_binlog.cc:3612)
==15146==    by 0x404BC36: pthread_start_thread (manager.c:310)
==15146==    by 0x41C32B9: clone (clone.S:119)
[18 May 2006 7:35] Kristian Nielsen
The write into free()'ed memory is when nextEvent() copies data into NdbRecAttr::theValue.

Note that the memory is freed in closefrm() calling free_root().

It seems that NdbRecAttr::theValue is allocated on the wrong memroot, being free()'ed while the pointer is still active.
[18 May 2006 8:40] Kristian Nielsen
I tried to comment out the free_root() in closefrm(), and the crash goes away (but a leek is introduced of course).

So the bug here clearly seems to be that NdbRecAttr.theValue is allocated on the table mem_root, and then when the table is closefrm()'ed the NdbRecAttr.theValue pointers are still around.

The fix must be either to kill the bad theValue pointers prior to closefrm(), or to allocate and free them with another mechanism than table->mem_root.
[18 May 2006 9:18] Jonas Oreland
testing again...
[19 May 2006 13:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6634
[19 May 2006 21:31] Tomas Ulin
reviewed by martin
[19 May 2006 21:32] Tomas Ulin
pushed to 5.1.11
[19 May 2006 23:39] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented fix in 5.1.11 changelog.