Bug #21448 | network outages can cause slave mysqld to core on reconnection | ||
---|---|---|---|
Submitted: | 4 Aug 2006 13:25 | Modified: | 15 Oct 2007 10:27 |
Reporter: | Jonathan Miller | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Row Based Replication ( RBR ) | Severity: | S1 (Critical) |
Version: | 5.1.12 | OS: | Linux (Linux) |
Assigned to: | Serge Kozlov | CPU Architecture: | Any |
[4 Aug 2006 13:25]
Jonathan Miller
[7 Aug 2006 19:38]
Jonathan Miller
Hi, I hit this one again today :-) This time I did a network outage on the master and then open the port backup. The slave cored with: #0 0x002cd402 in __kernel_vsyscall () #1 0x0046164f in pthread_kill () from /lib/libpthread.so.0 #2 0x0834d78f in write_core () #3 0x081f8ad6 in handle_segfault () #4 <signal handler called> #5 0x00357477 in memset () from /lib/libc.so.6 #6 0x081d9d30 in Field_string::unpack () #7 0x08299a63 in unpack_row () #8 0x0829be4c in Write_rows_log_event::do_prepare_row () #9 0x08299edd in Rows_log_event::exec_event () #10 0x0833ff77 in exec_relay_log_event () #11 0x08341afd in handle_slave_sql () #12 0x0045ebd4 in start_thread () from /lib/libpthread.so.0 #13 0x003b64fe in clone () from /lib/libc.so.6 Again, running resetslave.pl fixes what ever is causing the slave to core; Slave Epoch = 158929 Master bin log = n09.000003 Master Bin Log position = 155002 Slave has been updated. You may now start slave.
[16 Nov 2006 13:25]
Jonathan Miller
Hi Rafal, -> > perspective, to start working on the bug I need to do the following -> > preparations: -> > -> > - find out what TCP-B test is and where to get it. TPC-B is a database standard performance test. Our current version is a set of perl scripts that are located in the test-extra clone. -> > - install on some computer any infrastructure needed to run this test -> (I assume it is not normal mysql-test/ thing) Correct, This is not a MySQL Thing -> > - configure needed environment (described as "2 DN 2 Replica Master -> > replicating to 2DN 2 Replica Slave" in the bug report). So you would need at least 6 computers for this configuration. 3 for the master cluster and 3 for the slave cluster. -> > - find out how to "close the network port on slave" and when to do it -> so that the bug is trigerred. This is done through the web interface to the managed switch in Uppsala. If you could contact Pekka and get log on rights to ndbmaster as ndbdev, I would be happy to allow the use of my six systems to help resolve this. I can lead you through the steps. Not sure if this helps or not. Please let me know. In addition, I am adding this to the bug. Thanks! Jonathan Miller
[12 Dec 2006 18:35]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/16849 ChangeSet@1.2362, 2006-12-12 19:34:59+01:00, rafal@quant.(none) +4 -0 BUG#21448 (network outages can cause slave mysqld to core on reconnection): According to the analysis in the bug report, the failure can be caused by wrong event re-execution logic in the main event execution loop of the slave's SQL thread. This logic is currently contained inside exec_relay_log_event() function which is repeatedly called from the SQL thread. This patch cleans-up the event execution loop and associated logic. It does it by refactoring code so that exec_relay_log_event() function is concerned only with event execution and correct interpretation of execution errors, while other issues such as event reading, skipping and repetitions are handled inside the loop. The code was mostly moved from exec_relay_log_event() into the event execution loop but it was also restructured and cleaned-up. In particular the logic handling repeated execution of event groups upon transient errors is fixed so that the crash should (hopefully) not repeat.
[22 Jan 2007 10:09]
Guilhem Bichot
Rafal will conduct some more debugging to be sure.
[21 Feb 2007 0:59]
Jonathan Miller
Hi, I retested this on the latest rpl clone and I am not getting cores currently. I am getting issues with replication where the slave can not recover from the outage. 070221 1:55:44 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1073741824, event_type: 85 070221 1:55:44 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error 070221 1:55:44 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (y ou can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysql binlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 070221 1:55:44 [ERROR] Slave: Unknown error, Error_code: 1105 070221 1:55:44 [Warning] Slave: Unknown error Error_code: 1105 070221 1:55:44 [Warning] Slave: Unknown error Error_code: 1105 070221 1:55:44 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE ST ART". We stopped at log 'ndb09-bin.000001' position 3952051 Even restarting the MySQLD did not solve the above. In addition, I did this same exact test for Tomas on Monday on the Telco clone and got a different issue. (See: http://bugs.mysql.com/24694) I closed 24694 and opened #26483 [NEW]: Slave MySQLD does not reconnect w/ slave cluster after network outage restored. http://bugs.mysql.com/id=26483 Thanks, /Jeb
[17 Apr 2007 14:14]
Jonathan Miller
Hi, Still not crashing and actually acting much better, but I was able to reproduce the slave's relay log corruption. Assigning back to Rafal per Lars, 070417 16:02:17 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1114112, event_type: 0 070417 16:02:17 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error 070417 16:02:17 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 070417 16:02:17 [ERROR] Slave: Unknown error, Error_code: 1105 070417 16:02:17 [Warning] Slave: Unknown error Error_code: 1105 070417 16:02:17 [Warning] Slave: Unknown error Error_code: 1105 070417 16:02:17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'n09.000091' position 180679429070417 16:02:26 [Note] Slave SQL thread initialized, starting replication in log 'n09.000091' at position 180679429, relay log './ndb12-relay-bin.033685' position: 6187391 070417 16:02:27 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1114112, event_type: 0 070417 16:02:27 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error 070417 16:02:27 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 070417 16:02:27 [ERROR] Slave: Unknown error, Error_code: 1105 070417 16:02:27 [Warning] Slave: Unknown error Error_code: 1105 070417 16:02:27 [Warning] Slave: Unknown error Error_code: 1105 070417 16:02:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'n09.000091' position 180679429
[11 Sep 2007 14:53]
Lars Thalmann
Now Chuck has setup a cluster replication and tried to detect this failure, but did not manage to find any problems. For the problems raised by Lars: 1. Problem of printing of binlog event probably do not exist, so no action needed. 2. Rafal will push his patch for BUG#25597, this will take care of proper printing of group positions in the error log. 3. Checksum will be implemented later, no action for this bug. 4. We think this has been fixed by Mats after this bug was created. We ask Jeb to reverify this again with the latest mysql-5.1-new-rpl code. And we also ask for a description of the setup.
[15 Oct 2007 10:27]
Serge Kozlov
Couldn't reproduce. Probably fixed.