Bug #24694 network outages can cause slave mysqld to core on reconnection when using NDB
Submitted: 29 Nov 2006 14:06 Modified: 19 Feb 2007 22:18
Reporter: Jonathan Miller Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S1 (Critical)
Version:5.1.14 OS:Linux (Linux 32 Bit)
Assigned to: Jonathan Miller CPU Architecture:Any

[29 Nov 2006 14:06] Jonathan Miller
Description:
Hi,

In tring to get Rafal a core for #21448 [Com]: network outages can cause slave mysqld to core on reconnection I got a core that is cluster realated.

-> Thanks again for the work you put into repeating the crash and 
-> collecting the evidence. It is going to be a great help.
-> 
-> It seems that we are hitting (at least) two issues here. One of them 
-> is some problem with NDB engine. This is what was hit in the most 
-> recent crash with the following stack:
-> 
-> #0  0x002cd402 in __kernel_vsyscall ()
-> (gdb) bt
-> #0  0x002cd402 in __kernel_vsyscall ()
-> #1  0x0046164f in pthread_kill () from /lib/libpthread.so.0
-> #2  0x0835d6b7 in write_core ()
-> #3  0x08214fe6 in handle_segfault ()
-> #4  <signal handler called>
-> #5  0x084af0f2 in NdbTransaction::getNdbOperation ()
-> #6  0x084af359 in NdbTransaction::getNdbOperation ()
-> #7  0x08302562 in ha_ndbcluster::write_row ()
-> #8  0x082ee875 in handler::ha_write_row ()
-> #9  0x082b8d16 in replace_record ()
-> #10 0x082b8d73 in Write_rows_log_event::do_exec_row ()
-> #11 0x082b68fd in Rows_log_event::exec_event ()
-> #12 0x0834cde9 in exec_relay_log_event ()
-> #13 0x0834d34c in handle_slave_sql ()
-> #14 0x0045ebd4 in start_thread () from /lib/libpthread.so.0
-> #15 0x003b64fe in clone () from /lib/libc.so.6
-> 
-> The two crashes you reported originally in the bug report were 
-> however about a different issue, more related to our replication 
-> code.
-> 
-> #0  0x002cd402 in __kernel_vsyscall ()
-> #1  0x0046164f in pthread_kill () from /lib/libpthread.so.0
-> #2  0x0834d78f in write_core ()
-> #3  0x081f8ad6 in handle_segfault ()
-> #4  <signal handler called>
-> #5  0x00357477 in memset () from /lib/libc.so.6
-> #6  0x081d9d30 in Field_string::unpack ()
-> #7  0x08299a63 in unpack_row ()
-> #8  0x0829be4c in Write_rows_log_event::do_prepare_row ()
-> #9  0x08299edd in Rows_log_event::exec_event () #10 0x0833ff77 in 
-> exec_relay_log_event ()
-> #11 0x08341afd in handle_slave_sql ()
-> #12 0x0045ebd4 in start_thread () from /lib/libpthread.so.0
-> #13 0x003b64fe in clone () from /lib/libc.so.6

I will save off the core and the mysqld symbols file

Also, I noticed in my first couple of attempts that after restoring the connection, I almost always have to stop the slave and restart the slave to get it to connect to the cluster again; this problem area maybe where the bulk of the issues in the code can be found.

  

How to repeat:
See 21448

Suggested fix:
Don't crash ;-)