Bug #4629 | Crash after SLAVE STOP, if the IO thread is in special state. | ||
---|---|---|---|
Submitted: | 19 Jul 2004 16:39 | Modified: | 24 Aug 2004 3:21 |
Reporter: | Jozef Vesely | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 4.1.3 | OS: | Windows (win2000) |
Assigned to: | Reggie Burnett | CPU Architecture: | Any |
[19 Jul 2004 16:39]
Jozef Vesely
[20 Jul 2004 11:19]
Guilhem Bichot
Hi, It does not happen on Linux. Must be specific of our Windows MySQL version. Are you using our binaries (which one?) or self-compiled? What does the debugger say about the nature of the crash (thanks for the backtrace you provided): segmentation fault, assertion failure, ... ? Anything in the error log (.err file?) Thank you!
[20 Jul 2004 16:02]
Jozef Vesely
Hi I disovered it with 4.1.1a binary version mysqld.exe prints version: 4.1.1a-alpha-max-debug-log I simplified steps to reproduce a bit: Start with clean install on both master and slave. On master only create the 'replicator' user On slave: CHANGE MASTER TO MASTER_HOST='dar.ttc.sk', MASTER_USER='replicator', MASTER_PASSWORD='******'; START SLAVE; SHOW SLAVE STATUS\G -- Slave_IO_State: Waiting for master to send event -- Now shut down master SHOW SLAVE STATUS\G -- Slave_IO_State: Reconnecting after a failed master event read STOP SLAVE; Slave crashes with: "The instruction at "0xdddddddd" referenced memory at "0xdddddddd". The memory could not be read." After crash VC says: "Unhandled exception in mysqld.exe: 0xC0000005: Access Violation." 0xdddddddd is sometimes different. The CHANGE MASTER TO MASTER_HOST='', MASTER_USER='', MASTER_PASSWORD=''; line which seemed suspicious to me is not needed for crash. If I stop and start the slave again (replication restarts automaticaly) and then stop master and issue STOP SLAVE; slave crashes also. I see nothing interesting in the log: D:\jove\clean_4.1.1a>bin\mysqld.exe --port 3306 --console --log-warnings --server-id=2 --log-bin --master-connect-retry=5 --net_read_timeout=5 --net_write_timeout=5 040720 14:02:43 InnoDB: Started; log sequence number 0 43634 bin\mysqld.exe: ready for connections. Version: '4.1.1a-alpha-max-debug-log' socket: '' port: 3306 0<40720 14:03:35 Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log '.\pob2-relay-bin.000001' position: 4 040720 14:03:35 Slave I/O thread: connected to master 'replicator@dar.ttc.sk:3306', replication started in log 'FIRST' at position 4 040720 14:03:55 Slave: received end packet from server, apparent master shutdown: 040720 14:03:55 Slave I/O thread: Failed reading log event, reconnecting to retry, log 'DAR-bin.000001' position 79 040720 14:03:56 Slave I/O thread: error reconnecting to master 'replicator@dar.ttc.sk:3306': Error: 'Can't connect to MySQL server on 'dar.ttc.sk' (10061)' errno: 2003 retry-time: 5 retries: 86400 040720 14:04:07 Slave I/O thread killed during or after a reconnect done to recover from failed read 040720 14:04:07 Slave I/O thread exiting, read up to log 'DAR-bin.000001', position 79 Later I compiled 4.1.3b from source (MSVC++ 6.0 compiler) (Win32 Debug configuration) I got error while build: D:\jove\clean_4.1.3b\sql\item.cpp(917) : error C2520: conversion from unsigned __int64 to double not implemented, use signed __int64 and "corrected" it: - return (double) TIME_to_ulonglong(&value.time); + return (double) (signed __int64) TIME_to_ulonglong(&value.time); but I don't think it influences crashes I experience. With this compiled 4.1.3b I get the same behaviour.
[20 Jul 2004 16:23]
Guilhem Bichot
Hi, Thanks a lot for the good feedback! Miguel, I am assigning it this to you for now (it happens on Windows, not on Linux); could you please try to repeat the crash by following the steps explained by Jozef in this bug report? Thank you! Guilhem
[20 Jul 2004 17:42]
Jozef Vesely
I found some time to debug... Crash occures in sql\sql_class.cpp in THD::close_active_vio() around line 664 as back trace showed: if (active_vio) { vio_close(active_vio); active_vio = 0; } At first glance I did not notice that vio_close is a macro defined in include\violite.h around line 148: #define vio_close(vio) ((vio)->vioclose)(vio) However "active_vio" points to some garbage (chunks of same value repeated in memory, over multiple fields of that struct) In those few cases when slave does not crash active_vio was NULL.
[21 Jul 2004 1:07]
MySQL Verification Team
Yes I was able to repeat with servers 4.1.4: /sql_class.cc --673-- void THD::close_active_vio() { DBUG_ENTER("close_active_vio"); safe_mutex_assert_owner(&LOCK_delete); #ifndef EMBEDDED_LIBRARY if (active_vio) { vio_close(active_vio); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ crash here active_vio = 0; } #endif DBUG_VOID_RETURN; } CALL STACK mysqld.exe!THD::close_active_vio() Line 673 + 0x19 C++ mysqld.exe!THD::awake(int prepare_to_die=0) Line 427 C++ mysqld.exe!terminate_slave_thread(THD * thd=0x00e50190, _RTL_CRITICAL_SECTION * term_lock=0x00000000, _RTL_CRITICAL_SECTION * cond_lock=0x00e568e8, pthread_cond_t * term_cond=0x00e56910, volatile int * slave_running=0x00e57fc0) Line 568 + 0x1a C++ mysqld.exe!terminate_slave_threads(st_master_info * mi=0x00e55bb0, int thread_mask=3, int skip_lock=1) Line 526 + 0x2a C++ mysqld.exe!stop_slave(THD * thd=0x02ec1040, st_master_info * mi=0x00e55bb0, int net_report=1) Line 809 + 0xf C++ mysqld.exe!mysql_execute_command(THD * thd=0x02ec1040) Line 2439 + 0x11 C++ mysqld.exe!mysql_parse(THD * thd=0x02ec1040, char * inBuf=0x02eb1060, unsigned int length=10) Line 4034 + 0x9 C++ mysqld.exe!dispatch_command(enum_server_command command=COM_QUERY, THD * thd=0x02ec1040, char * packet=0x02ec2389, unsigned int packet_length=11) Line 1457 + 0x1d C++ mysqld.exe!do_command(THD * thd=0x02ec1040) Line 1272 + 0x31 C++ mysqld.exe!handle_one_connection(void * arg=0x02ec1040) Line 1016 + 0x9 C++ mysqld.exe!pthread_start(void * param=0x00e5df30) Line 63 + 0x7 C mysqld.exe!_threadstart(void * ptd=0x00e581a8) Line 173 + 0xd C kernel32.dll!77e6d33b() Thank you for the bug report
[22 Jul 2004 10:09]
Jozef Vesely
I have found the cause of the problem: firstly: -------- in: connect_to_master() file: slave\slave.cpp line: 3950 #ifdef SIGNAL_WITH_VIO_CLOSE thd->set_active_vio(mysql->net.vio); // it does: thd->active_vio = mysql->net.vio; with locking #endif later: ----- in: end_server() file: slave\client.cpp line: 821 vio_delete frees mysql->net.vio and mysql->net.vio is set to 0 thus: ---- thd->active_vio is set to freed pointer, what causes problems when we try to access its members ... note: ---- SIGNAL_WITH_VIO_CLOSE only is defined in include\config_netware.h and include\config_win.h what explains why problem does not occur on linux. Currently I am thinking of the best sollution, probably to call thd->clear_active_vio() after each end_server() or include it somehow in end_server() itself (that would require to add thd parameter to it which would break many things :-( )
[22 Jul 2004 10:42]
Guilhem Bichot
Jozef! I read your comments and I am just impressed; you saved us a couple of hours of work probably. Thank you a real lot; we're going to think about the best code change to have active_vio not garbage. And we'll also check if 4.0 could be affected. I'll post a patch here.
[22 Jul 2004 15:19]
MySQL Verification Team
Yes SIGNAL_WITH_VIO_CLOSE was introduced only on Windows because it not supports signals like Unix, this fixed a connection hanging bug and now the problem is the show slave status with the active vio. Thanks
[23 Jul 2004 18:40]
Guilhem Bichot
Precision to Jozef's fine analysis: when he says: >thus: >---- >thd->active_vio is set to freed pointer, what >causes problems when we try to access its >members ... The access to the members of the freed thd->active_vio happens in THD::awake() which is called by terminate_slave_thread() which is called by STOP SLAVE, which was how the bug was noticed. I'd suggest to call thd->clear_active_vio() *before* each end_server(). If we call after, there is a very short moment where active_vio points to freed memory and STOP SLAVE can occur and crash. If we call it before, then there is a very short moment where active_vio is 0 but slave is still connected; then there is no crash but we may not react to STOP SLAVE, which is less bad than a crash. The perfect solution would be to do the free() and set mysql->net.vio and thd->active_vio both to 0 atomically (using LOCK_delete).
[21 Aug 2004 5:07]
Reggie Burnett
I have submitted a patch for this.
[24 Aug 2004 3:21]
Reggie Burnett
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bugfix, yourself. More information about accessing the source trees is available at http://www.mysql.com/doc/en/Installing_source_tree.html Additional info: This patch was reviewed and approved by Guilhem