MySQL Bugs: #21448: network outages can cause slave mysqld to core on reconnection

Bug #21448	network outages can cause slave mysqld to core on reconnection
Submitted:	4 Aug 2006 13:25	Modified:	15 Oct 2007 10:27
Reporter:	Jonathan Miller	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S1 (Critical)
Version:	5.1.12	OS:	Linux (Linux)
Assigned to:	Serge Kozlov	CPU Architecture:	Any

Description:
Test: 2 DN 2 Replica Master replicating to 2DN 2 Replica Slave. TPC-B running. Close the network port on the slave MySQLD for about 2 minutes and open backup 

Running this test several times and I get different results each time. During this test when I tried to "select sum(balance) from account" I got a lost connection from mysql server.

mysql> select sum(balance) from account;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...

Then each time I tried to restart the slave, the mysqld would crash.

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    3
Current database: *** NONE ***

*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: ndb09.mysql.com
                Master_User: rep
                Master_Port: 3306
              Connect_Retry: 1
            Master_Log_File: ndb09.000003
        Read_Master_Log_Pos: 43240147
             Relay_Log_File: ndb12-relay-bin.000008
              Relay_Log_Pos: 644877
      Relay_Master_Log_File: ndb09.000003
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB: mysql
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 42737229
            Relay_Log_Space: 1148575
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: ndb09.mysql.com
                Master_User: rep
                Master_Port: 3306
              Connect_Retry: 1
            Master_Log_File: ndb09.000003
        Read_Master_Log_Pos: 43240147
             Relay_Log_File: ndb12-relay-bin.000008
              Relay_Log_Pos: 644877
      Relay_Master_Log_File: ndb09.000003
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB: mysql
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 42737229
            Relay_Log_Space: 1148575
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    3
Current database: *** NONE ***

*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: ndb09.mysql.com
                Master_User: rep
                Master_Port: 3306
              Connect_Retry: 1
            Master_Log_File: ndb09.000003
        Read_Master_Log_Pos: 43240147
             Relay_Log_File: ndb12-relay-bin.000008
              Relay_Log_Pos: 644877
      Relay_Master_Log_File: ndb09.000003
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB: mysql
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 42737229
            Relay_Log_Space: 1148965
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.01 sec)

#0  0x002cd402 in __kernel_vsyscall ()
#1  0x0046164f in pthread_kill () from /lib/libpthread.so.0
#2  0x0834d78f in write_core ()
#3  0x081f8ad6 in handle_segfault ()
#4  <signal handler called>
#5  0x00357477 in memset () from /lib/libc.so.6
#6  0x081d9d30 in Field_string::unpack ()
#7  0x08299a63 in unpack_row ()
#8  0x0829be4c in Write_rows_log_event::do_prepare_row ()
#9  0x08299edd in Rows_log_event::exec_event ()
#10 0x0833ff77 in exec_relay_log_event ()
#11 0x08341afd in handle_slave_sql ()
#12 0x0045ebd4 in start_thread () from /lib/libpthread.so.0
#13 0x003b64fe in clone () from /lib/libc.so.6

0x81f89e6 handle_segfault + 438
0x2cd420 (?)
0x9a84c80 _end + 21070440
0x829be4c _ZN20Write_rows_log_event14do_prepare_rowEP3THDP8st_tablePKc + 42
0x8299edd _ZN14Rows_log_event10exec_eventEP17st_relay_log_info + 1027
0x833ff77 _Z20exec_relay_log_eventP3THDP17st_relay_log_info + 839
0x8341afd handle_slave_sql + 779
0x45ebd4 (?)
0x3b64fe (?)

0x81f89e6
0x2cd420
0x9a84c80
0x829be4c
0x8299edd
0x833ff77
0x8341afd
0x45ebd4
0x3b64fe
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at (nil)  is invalid pointer
thd->thread_id=5

How to repeat:
I have not been able to get back to this situation just yet. I am hitting other issue. The only way I was able to recover and move past this was by following the:
Implementing Failover with MySQL Cluster
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-failover.html

Once I ran this Failover procedure the slave started and finished processing from where it had left off from the master when the network was shutdown.

Hi,

I hit this one again today :-)

This time I did a network outage on the master and then open the port backup. The slave cored with:
#0  0x002cd402 in __kernel_vsyscall ()
#1  0x0046164f in pthread_kill () from /lib/libpthread.so.0
#2  0x0834d78f in write_core ()
#3  0x081f8ad6 in handle_segfault ()
#4  <signal handler called>
#5  0x00357477 in memset () from /lib/libc.so.6
#6  0x081d9d30 in Field_string::unpack ()
#7  0x08299a63 in unpack_row ()
#8  0x0829be4c in Write_rows_log_event::do_prepare_row ()
#9  0x08299edd in Rows_log_event::exec_event ()
#10 0x0833ff77 in exec_relay_log_event ()
#11 0x08341afd in handle_slave_sql ()
#12 0x0045ebd4 in start_thread () from /lib/libpthread.so.0
#13 0x003b64fe in clone () from /lib/libc.so.6

Again, running resetslave.pl fixes what ever is causing the slave to core;

        Slave Epoch =  158929
        Master bin log =  n09.000003
        Master Bin Log position =  155002
        Slave has been updated. You may now start slave.

Hi Rafal,

-> > perspective, to start working on the bug I need to do the following
-> > preparations:
-> >
-> > - find out what TCP-B test is and where to get it.

TPC-B is a database standard performance test. Our current version is a set of perl scripts that are located in the test-extra clone.

-> > - install on some computer any infrastructure needed to run this test
-> (I assume it is not normal mysql-test/ thing)

Correct, This is not a MySQL Thing

-> > - configure needed environment (described as "2 DN 2 Replica Master
-> > replicating to 2DN 2 Replica Slave" in the bug report).

So you would need at least 6 computers for this configuration. 3 for the master cluster and 3 for the slave cluster.

-> > - find out how to "close the network port on slave" and when to do it
-> so that the bug is trigerred.

This is done through the web interface to the managed switch in Uppsala. 

If you could contact Pekka and get log on rights to ndbmaster as ndbdev, I would be happy to allow the use of my six systems to help resolve this. I can lead you through the steps.

Not sure if this helps or not. Please let me know.

In addition, I am adding this to the bug.

Thanks!

Jonathan Miller

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16849

ChangeSet@1.2362, 2006-12-12 19:34:59+01:00, rafal@quant.(none) +4 -0
  BUG#21448 (network outages can cause slave mysqld to core on reconnection):
  
  According to the analysis in the bug report, the failure can be caused by 
  wrong event re-execution logic in the main event execution loop of the slave's 
  SQL thread. This logic is currently contained inside exec_relay_log_event() 
  function which is repeatedly called from the SQL thread.
  
  This patch cleans-up the event execution loop and associated logic. It does it 
  by refactoring code so that exec_relay_log_event() function is concerned only 
  with event execution and correct interpretation of execution errors, while other 
  issues such as event reading, skipping and repetitions are handled inside the 
  loop.
  
  The code was mostly moved from exec_relay_log_event() into the event execution
  loop but it was also restructured and cleaned-up. In particular the logic 
  handling repeated execution of event groups upon transient errors is fixed so
  that the crash should (hopefully) not repeat.

Rafal will conduct some more debugging to be sure.

Hi,

I retested this on the latest rpl clone and I am not getting cores currently. I am getting issues with replication where the slave can not recover from the outage.

070221  1:55:44 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1073741824, event_type: 85
070221  1:55:44 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
070221  1:55:44 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (y
ou can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysql
binlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary
log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0
070221  1:55:44 [ERROR] Slave: Unknown error, Error_code: 1105
070221  1:55:44 [Warning] Slave: Unknown error Error_code: 1105
070221  1:55:44 [Warning] Slave: Unknown error Error_code: 1105
070221  1:55:44 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE ST
ART". We stopped at log 'ndb09-bin.000001' position 3952051

Even restarting the MySQLD did not solve the above.

In addition, I did this same exact test for Tomas on Monday on the Telco clone and got a different issue. (See: http://bugs.mysql.com/24694) I closed 24694 and opened #26483 [NEW]: Slave MySQLD does not reconnect  w/ slave cluster after network outage restored.  http://bugs.mysql.com/id=26483

Thanks,
/Jeb

Hi,

Still not crashing and actually acting much better, but I was able to reproduce  the slave's relay log corruption. Assigning back to Rafal per Lars,

070417 16:02:17 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1114112, event_type: 0
070417 16:02:17 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
070417 16:02:17 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0
070417 16:02:17 [ERROR] Slave: Unknown error, Error_code: 1105
070417 16:02:17 [Warning] Slave: Unknown error Error_code: 1105
070417 16:02:17 [Warning] Slave: Unknown error Error_code: 1105
070417 16:02:17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'n09.000091' position 180679429070417 16:02:26 [Note] Slave SQL thread initialized, starting replication in log 'n09.000091' at position 180679429, relay log './ndb12-relay-bin.033685' position: 6187391
070417 16:02:27 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1114112, event_type: 0
070417 16:02:27 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
070417 16:02:27 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0
070417 16:02:27 [ERROR] Slave: Unknown error, Error_code: 1105
070417 16:02:27 [Warning] Slave: Unknown error Error_code: 1105
070417 16:02:27 [Warning] Slave: Unknown error Error_code: 1105
070417 16:02:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'n09.000091' position 180679429

Now Chuck has setup a cluster replication and tried to detect this
failure, but did not manage to find any problems.

For the problems raised by Lars:

  1. Problem of printing of binlog event probably do not exist, 
     so no action needed.
  2. Rafal will push his patch for BUG#25597, this will take care
     of proper printing of group positions in the error log.
  3. Checksum will be implemented later, no action for this bug.
  4. We think this has been fixed by Mats after this bug was created.

We ask Jeb to reverify this again with the latest mysql-5.1-new-rpl
code.  And we also ask for a description of the setup.

Couldn't reproduce. Probably fixed.