Bug #4629 Crash after SLAVE STOP, if the IO thread is in special state.
Submitted: 19 Jul 2004 16:39 Modified: 24 Aug 2004 3:21
Reporter: Jozef Vesely Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.1.3 OS:Microsoft Windows (win2000)
Assigned to: Reggie Burnett

[19 Jul 2004 16:39] Jozef Vesely
Description:
When the IO replication thread is in "Waiting to reconnect after a failed master event read" state, and I issue the SLAVE  STOP; command mysqld crashes.

Problem is repeatable. If you need more information, I will be happy to provide it.

Backtrace:
dddddddd()
THD::close_active_vio() line 671 + 25 bytes
THD::awake(int 0) line 427
terminate_slave_thread(THD * 0x01477128, _RTL_CRITICAL_SECTION * 0x00000000, _RTL_CRITICAL_SECTION * 0x0147c550, pthread_cond_t * 0x0147c578, volatile int * 0x0147dc28) line 568 + 26 bytes
terminate_slave_threads(st_master_info * 0x0147b818, int 3, int 1) line 526 + 42 bytes
stop_slave(THD * 0x01479950, st_master_info * 0x0147b818, int 1) line 809 + 15 bytes
mysql_execute_command(THD * 0x01479950) line 2439 + 17 bytes
mysql_parse(THD * 0x01479950, char * 0x014b3428, unsigned int 10) line 4034 + 9 bytes
dispatch_command(enum_server_command COM_QUERY, THD * 0x01479950, char * 0x014ab349, unsigned int 11) line 1457 + 29 bytes
do_command(THD * 0x01479950) line 1272 + 49 bytes
handle_one_connection(void * 0x01479950) line 1016 + 9 bytes
pthread_start(void * 0x00dea830) line 63 + 7 bytes
_threadstart(void * 0x00dec650) line 173 + 13 bytes
KERNEL32! 7c57b382()

How to repeat:
1. configure replication between two machines:

on A (master):
STOP SLAVE;
SHOW SLAVE STATUS;
CHANGE MASTER TO
  MASTER_HOST='',
  MASTER_PORT=0,
  MASTER_USER='',
  MASTER_PASSWORD='';
RESET MASTER;

on B(slave):
RESET MASTER;
STOP SLAVE;
CHANGE MASTER TO
  MASTER_HOST='A',
  MASTER_PORT=3306,
  MASTER_USER='replicator',
  MASTER_PASSWORD='**********';
START SLAVE;

2. kill the mysqld on machine A, check that IO thread on B is in "Waiting to reconnect after a failed master event read" state, then issue STOP SLAVE; on B

3. mysqld on B crashes
[20 Jul 2004 11:19] Guilhem Bichot
Hi,

It does not happen on Linux. Must be specific of our Windows MySQL version.
Are you using our binaries (which one?) or self-compiled?
What does the debugger say about the nature of the crash (thanks for the backtrace you provided): segmentation fault, assertion failure, ... ?
Anything in the error log (.err file?)
Thank you!
[20 Jul 2004 16:02] Jozef Vesely
Hi

I disovered it with 4.1.1a binary version
mysqld.exe prints version: 4.1.1a-alpha-max-debug-log

I simplified steps to reproduce a bit:

Start with clean install on both master and slave.

On master only create the 'replicator' user

On slave:

CHANGE MASTER TO 
  MASTER_HOST='dar.ttc.sk', 
  MASTER_USER='replicator', 
  MASTER_PASSWORD='******';
START SLAVE;
SHOW SLAVE STATUS\G
-- Slave_IO_State: Waiting for master to send event
-- Now shut down master
SHOW SLAVE STATUS\G
-- Slave_IO_State: Reconnecting after a failed master event read
STOP SLAVE;

Slave crashes with:
"The instruction at "0xdddddddd" referenced memory at "0xdddddddd". The memory could not be read."

After crash VC says:
"Unhandled exception in mysqld.exe: 0xC0000005: Access Violation."

0xdddddddd is sometimes different.

The CHANGE MASTER TO MASTER_HOST='', MASTER_USER='', MASTER_PASSWORD=''; line which seemed suspicious to me is 
not needed for crash.

If I stop and start the slave again (replication restarts automaticaly)
and then stop master and issue STOP SLAVE; slave crashes also.

I see nothing interesting in the log:

D:\jove\clean_4.1.1a>bin\mysqld.exe --port 3306 --console --log-warnings --server-id=2 --log-bin --master-connect-retry=5 --net_read_timeout=5 --net_write_timeout=5
040720 14:02:43  InnoDB: Started; log sequence number 0 43634
bin\mysqld.exe: ready for connections.
Version: '4.1.1a-alpha-max-debug-log'  socket: ''  port: 3306
0<40720 14:03:35  Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log '.\pob2-relay-bin.000001' position: 4
040720 14:03:35  Slave I/O thread: connected to master 'replicator@dar.ttc.sk:3306',  replication started in log 'FIRST' at position 4
040720 14:03:55  Slave: received end packet from server, apparent master shutdown:
040720 14:03:55  Slave I/O thread: Failed reading log event, reconnecting to retry, log 'DAR-bin.000001' position 79
040720 14:03:56  Slave I/O thread: error reconnecting to master 'replicator@dar.ttc.sk:3306': Error: 'Can't connect to MySQL server on 'dar.ttc.sk' (10061)'  errno: 2003  retry-time: 5  retries: 86400
040720 14:04:07  Slave I/O thread killed during or after a reconnect done to recover from failed read
040720 14:04:07  Slave I/O thread exiting, read up to log 'DAR-bin.000001', position 79

Later I compiled 4.1.3b from source (MSVC++ 6.0 compiler) (Win32 Debug configuration)

I got error while build:
D:\jove\clean_4.1.3b\sql\item.cpp(917) : error C2520: conversion from unsigned __int64 to double not implemented, use signed __int64

and "corrected" it:
-    return (double) TIME_to_ulonglong(&value.time);
+    return (double) (signed __int64) TIME_to_ulonglong(&value.time);

but I don't think it influences crashes I experience.

With this compiled 4.1.3b I get the same behaviour.
[20 Jul 2004 16:23] Guilhem Bichot
Hi,
Thanks a lot for the good feedback!
Miguel, I am assigning it this to you for now (it happens on Windows, not on Linux); could you please try to repeat the crash by following the steps explained by Jozef in this bug report?
Thank you!
Guilhem
[20 Jul 2004 17:42] Jozef Vesely
I found some time to debug...

Crash occures in sql\sql_class.cpp in THD::close_active_vio() around line 664
as back trace showed:

  if (active_vio)
  {
    vio_close(active_vio);
    active_vio = 0;
  }

At first glance I did not notice that vio_close is a 
macro defined in include\violite.h around line 148:

#define vio_close(vio)				((vio)->vioclose)(vio)

However "active_vio" points to some garbage (chunks of same value repeated in memory, over multiple fields of that struct)

In those few cases when slave does not crash active_vio was NULL.
[21 Jul 2004 1:07] Miguel Solorzano
Yes I was able to repeat with servers 4.1.4:

/sql_class.cc
--673--

void THD::close_active_vio()
{
  DBUG_ENTER("close_active_vio");
  safe_mutex_assert_owner(&LOCK_delete); 
#ifndef EMBEDDED_LIBRARY
  if (active_vio)
  {
    vio_close(active_vio);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
crash here
    active_vio = 0;
  }
#endif
  DBUG_VOID_RETURN;
}

CALL STACK

mysqld.exe!THD::close_active_vio()  Line 673 + 0x19	C++
mysqld.exe!THD::awake(int prepare_to_die=0)  Line 427	C++
mysqld.exe!terminate_slave_thread(THD * thd=0x00e50190, _RTL_CRITICAL_SECTION * term_lock=0x00000000, _RTL_CRITICAL_SECTION * cond_lock=0x00e568e8, pthread_cond_t * term_cond=0x00e56910, volatile int * slave_running=0x00e57fc0)  Line 568 + 0x1a	C++
mysqld.exe!terminate_slave_threads(st_master_info * mi=0x00e55bb0, int thread_mask=3, int skip_lock=1)  Line 526 + 0x2a	C++
mysqld.exe!stop_slave(THD * thd=0x02ec1040, st_master_info * mi=0x00e55bb0, int net_report=1)  Line 809 + 0xf	C++
mysqld.exe!mysql_execute_command(THD * thd=0x02ec1040)  Line 2439 + 0x11	C++
mysqld.exe!mysql_parse(THD * thd=0x02ec1040, char * inBuf=0x02eb1060, unsigned int length=10)  Line 4034 + 0x9	C++
mysqld.exe!dispatch_command(enum_server_command command=COM_QUERY, THD * thd=0x02ec1040, char * packet=0x02ec2389, unsigned int packet_length=11)  Line 1457 + 0x1d	C++
mysqld.exe!do_command(THD * thd=0x02ec1040)  Line 1272 + 0x31	C++
mysqld.exe!handle_one_connection(void * arg=0x02ec1040)  Line 1016 + 0x9	C++
mysqld.exe!pthread_start(void * param=0x00e5df30)  Line 63 + 0x7	C
mysqld.exe!_threadstart(void * ptd=0x00e581a8)  Line 173 + 0xd	C
kernel32.dll!77e6d33b() 	

Thank you for the bug report
[22 Jul 2004 10:09] Jozef Vesely
I have found the cause of the problem:

firstly:
--------
in: connect_to_master() file: slave\slave.cpp line: 3950

#ifdef SIGNAL_WITH_VIO_CLOSE
thd->set_active_vio(mysql->net.vio);
  // it does:  thd->active_vio = mysql->net.vio; with locking
#endif

later:
-----
in: end_server() file: slave\client.cpp line: 821
  vio_delete frees mysql->net.vio
  and mysql->net.vio is set to 0

thus:
----
thd->active_vio is set to freed pointer, what
causes problems when we try to access its
members ...

note:
----
SIGNAL_WITH_VIO_CLOSE only is defined in
include\config_netware.h and include\config_win.h

what explains why problem does not occur on linux.

Currently I am thinking of the best sollution, probably to call
thd->clear_active_vio() after each end_server() or include it somehow
in end_server() itself (that would require to add thd parameter to it which 
would break many things :-( )
[22 Jul 2004 10:42] Guilhem Bichot
Jozef!
I read your comments and I am just impressed; you saved us a couple of hours of work probably.
Thank you a real lot; we're going to think about the best code change to have active_vio not garbage. And we'll also check if 4.0 could be affected.
I'll post a patch here.
[22 Jul 2004 15:19] Miguel Solorzano
Yes SIGNAL_WITH_VIO_CLOSE was introduced only on Windows because it not
supports signals like Unix, this fixed a connection hanging bug and now the
problem is the show slave status with the active vio.

Thanks
[23 Jul 2004 18:40] Guilhem Bichot
Precision to Jozef's fine analysis: when he says:
>thus:
>----
>thd->active_vio is set to freed pointer, what
>causes problems when we try to access its
>members ...

The access to the members of the freed thd->active_vio happens in THD::awake() which is called by terminate_slave_thread() which is called by STOP SLAVE, which was how the bug was noticed.
I'd suggest to call thd->clear_active_vio() *before* each end_server(). If we call after, there is a very short moment where active_vio points to freed memory and STOP SLAVE can occur and crash. If we call it before, then there is a very short moment where active_vio is 0 but slave is still connected; then there is no crash but we may not react to STOP SLAVE, which is less bad than a crash. The perfect solution would be to do the free() and set mysql->net.vio and thd->active_vio both to 0 atomically (using LOCK_delete).
[21 Aug 2004 5:07] Reggie Burnett
I have submitted a patch for this.
[24 Aug 2004 3:21] Reggie Burnett
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

This patch was reviewed and approved by Guilhem