Bug #58546 test rpl_packet timeout failure sporadically on PB
Submitted: 29 Nov 2010 4:36 Modified: 28 Jan 2011 12:21
Reporter: Zhenxing He Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.1 OS:Any
Assigned to: Libing Song CPU Architecture:Any
Tags: sporadic, timeout
Triage: Triaged: D3 (Medium)

[29 Nov 2010 4:36] Zhenxing He
Description:
rpl_packet.test fails sporadically due to timeout on PB with the following backtrace:

Program terminated with signal 6, Aborted.
[New process 19147]
[New process 19168]
[New process 19165]
[New process 19150]
#0  0x00007fd221664c36 in pthread_kill () from /lib/libpthread.so.0
#0  0x00007fd221664c36 in pthread_kill () from /lib/libpthread.so.0
#1  0x00000000006300d7 in handle_segfault (sig=6) at mysqld.cc:2607
#2  <signal handler called>
#3  0x00007fd220af7ce2 in select () from /lib/libc.so.6
#4  0x000000000062dfcd in handle_connections_sockets (
    arg=<value optimized out>) at mysqld.cc:5098
#5  0x00000000006329dc in main (argc=<value optimized out>, 
    argv=<value optimized out>) at mysqld.cc:4582

Thread 4 (process 19150):
#0  0x00007fd221667797 in do_sigwait () from /lib/libpthread.so.0
#1  0x00007fd22166783d in sigwait () from /lib/libpthread.so.0
#2  0x000000000062fb63 in signal_hand (arg=<value optimized out>)
    at mysqld.cc:2801
#3  0x00007fd22165ffc7 in start_thread () from /lib/libpthread.so.0
#4  0x00007fd220afe5ad in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 3 (process 19165):
#0  0x00007fd221663fad in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x00000000007490a4 in terminate_slave_thread (thd=0x7fd21c000970, 
    term_lock=0x1d45bb8, term_cond=0x1d45c10, slave_running=0x1d46344, 
    skip_lock=146) at slave.cc:535
#2  0x00000000007491ac in terminate_slave_threads (mi=0x1d434a0, 
    thread_mask=2, skip_lock=true) at slave.cc:426
#3  0x000000000075000b in stop_slave (thd=0x1d4eaa0, mi=0x1d434a0, 
    net_report=true) at sql_repl.cc:968
#4  0x000000000063d318 in mysql_execute_command (thd=0x1d4eaa0)
    at sql_parse.cc:2888
#5  0x000000000063fe6b in mysql_parse (thd=0x1d4eaa0, 
    rawbuf=0x1d57d80 "STOP SLAVE", length=10, found_semicolon=0x41d6a070)
    at sql_parse.cc:6068
#6  0x0000000000640f92 in dispatch_command (command=COM_QUERY, thd=0x1d4eaa0, 
    packet=0x1d24581 "STOP SLAVE", packet_length=<value optimized out>)
    at sql_parse.cc:1261
#7  0x0000000000642176 in do_command (thd=0x1d4eaa0) at sql_parse.cc:889
#8  0x0000000000635f26 in handle_one_connection (arg=<value optimized out>)
    at sql_connect.cc:1149
#9  0x00007fd22165ffc7 in start_thread () from /lib/libpthread.so.0
#10 0x00007fd220afe5ad in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 2 (process 19168):
#0  0x00007fd221663d29 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x00000000006bbfe7 in MYSQL_BIN_LOG::wait_for_update (
    this=<value optimized out>, thd=0x7fd21c000970, is_slave=143)
    at log.cc:4916
#2  0x000000000074b482 in handle_slave_sql (arg=<value optimized out>)
    at slave.cc:4279
#3  0x00007fd22165ffc7 in start_thread () from /lib/libpthread.so.0
#4  0x00007fd220afe5ad in clone () from /lib/libc.so.6
#5  0x0000000000000000 in ?? ()

Thread 1 (process 19147):
#0  0x00007fd221664c36 in pthread_kill () from /lib/libpthread.so.0
#1  0x00000000006300d7 in handle_segfault (sig=6) at mysqld.cc:2607
#2  <signal handler called>
#3  0x00007fd220af7ce2 in select () from /lib/libc.so.6
#4  0x000000000062dfcd in handle_connections_sockets (
    arg=<value optimized out>) at mysqld.cc:5098
#5  0x00000000006329dc in main (argc=<value optimized out>, 
    argv=<value optimized out>) at mysqld.cc:4582

How to repeat:
Check:

http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&search=yes&push_id=17805...

And also:
http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_output_id=daeea558b...
[1 Dec 2010 3:11] Zhenxing He
Probably related to BUG#43460
[17 Dec 2010 10:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127165

3525 Li-Bing.Song@sun.com	2010-12-17
      Bug#58546 test rpl_packet timeout failure sporadically on PB
      
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
      
      After this patch, SQL thread is always stopped first. It guarantees
      IO thread stops just after the transaction which cannot be
      rolled back safely and is being replied by SQL thread
      has been replicated completely.
[22 Dec 2010 15:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127515

3534 Li-Bing.Song@sun.com	2010-12-22
      Bug#58546 test rpl_packet timeout failure sporadically on PB
      
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
      
      After this patch, SQL thread is always stopped first. It guarantees
      IO thread stops just after the transaction which cannot be
      rolled back safely and is being replied by SQL thread
      has been replicated completely.
      and added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[24 Dec 2010 3:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127574

3534 Li-Bing.Song@sun.com	2010-12-24
      Bug#58546 test rpl_packet timeout failure sporadically on PB
      
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
      
      After this patch, SQL thread is always stopped first. It guarantees
      IO thread stops just after the transaction which cannot be
      rolled back safely and is being replied by SQL thread
      has been replicated completely.
      and added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[24 Dec 2010 4:26] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127577

3534 Li-Bing.Song@sun.com	2010-12-24
      Bug#58546 test rpl_packet timeout failure sporadically on PB
      
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
      
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
      
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[17 Jan 2011 7:46] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/128890

3555 anders.song@greatopensource.com	2011-01-17
      Bug#58546 test rpl_packet timeout failure sporadically on PB
            
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
            
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
            
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[18 Jan 2011 12:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129080

3559 Li-Bing.Song@sun.com	2011-01-18
      Bug#58546 test rpl_packet timeout failure sporadically on PB
                  
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
                  
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
                  
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[18 Jan 2011 12:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129084

3560 anders.song@greatopensource.com	2011-01-18
      Bug#58546 test rpl_packet timeout failure sporadically on PB
                  
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
                  
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
                  
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[18 Jan 2011 12:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129085

3559 Anders.Song@greatopensource.com	2011-01-18
      Bug#58546 test rpl_packet timeout failure sporadically on PB
                  
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
                  
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
                  
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[18 Jan 2011 12:46] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129086

3560 anders.song@greatopensource.com	2011-01-18
      Bug#58546 test rpl_packet timeout failure sporadically on PB
                  
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
                  
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
                  
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[18 Jan 2011 12:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129089

3560 anders.song@greatopensource.com	2011-01-18
      Bug#58546 test rpl_packet timeout failure sporadically on PB
                  
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
                  
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
                  
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[19 Jan 2011 1:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129116

3560 Libing Song	2011-01-19
      Bug#58546 test rpl_packet timeout failure sporadically on PB
                  
      rpl_packet got a timeout failure sporadically on PB when stopping
      slave. The real reason of this bug is that STOP SLAVE stopped
      IO thread first and then stopped SQL thread. It was
      possible that IO thread stopped after replicating part of a
      transaction which SQL thread was executing. SQL thread would
      be hung if the transaction could not be rolled back safely.
                  
      After this patch, STOP SLAVE will stop SQL thread first and then stop IO
      thread, which guarantees that IO thread will fetch the reset of the
      events of the transaction that SQL thread is executing, so that SQL
      thread can finish the transaction if it cannot be rolled back safely.
                  
      Added below auxiliary files to make the test code neater.
      restart_slave_sql.inc
      rpl_connection_master.inc
      rpl_connection_slave.inc
      rpl_connection_slave1.inc
[19 Jan 2011 1:54] Bugs System
Pushed into mysql-trunk 5.6.2 (revid:anders.song@greatopensource.com-20110118175419-0b0m3j5iv5gsfqpn) (version source revid:anders.song@greatopensource.com-20110118175419-0b0m3j5iv5gsfqpn) (merge vers: 5.6.2) (pib:24)
[19 Jan 2011 2:00] Bugs System
Pushed into mysql-5.1 5.1.56 (revid:anders.song@greatopensource.com-20110118172349-vlqfdghj773e6m0d) (version source revid:anders.song@greatopensource.com-20110118172349-vlqfdghj773e6m0d) (merge vers: 5.1.56) (pib:24)
[19 Jan 2011 2:00] Bugs System
Pushed into mysql-5.5 5.5.10 (revid:anders.song@greatopensource.com-20110118175758-unu85pyyjklxl6uu) (version source revid:anders.song@greatopensource.com-20110118173810-v1omp2rf8ryjq1ku) (merge vers: 5.5.10) (pib:24)
[24 Jan 2011 3:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129385

3564 anders.song@greatopensource.com	2011-01-24
      Postfix BUG#58546
      Updated Copyright.
[24 Jan 2011 3:59] Bugs System
Pushed into mysql-trunk 5.6.2 (revid:anders.song@greatopensource.com-20110124035822-to3cj6hrdb5sa9zm) (version source revid:anders.song@greatopensource.com-20110124035822-to3cj6hrdb5sa9zm) (merge vers: 5.6.2) (pib:24)
[24 Jan 2011 4:00] Bugs System
Pushed into mysql-5.1 5.1.56 (revid:anders.song@greatopensource.com-20110124034854-2i1k6vlf6gs0qdpy) (version source revid:anders.song@greatopensource.com-20110124034854-2i1k6vlf6gs0qdpy) (merge vers: 5.1.56) (pib:24)
[24 Jan 2011 4:00] Bugs System
Pushed into mysql-5.5 5.5.10 (revid:anders.song@greatopensource.com-20110124035229-y4og7g6v9dtout9x) (version source revid:anders.song@greatopensource.com-20110124035229-y4og7g6v9dtout9x) (merge vers: 5.5.10) (pib:24)
[27 Jan 2011 20:31] Jon Stephens
Documented as follows in the 5.1.56, 5.5.10, and 5.6.2 changelogs:

        Formerly STOP SLAVE stopped the slave I/O thread first and then
        the slave SQL thread; thus, it was possible for the I/O thread
        stop after replicating only part of a transaction which SQL
        thread was executing, in wich case--if the transaction
        could not be rolled back safely--the SQL thread could hang.

        Now STOP SLAVE stops the slave SQL thread first and then the
        I/O thread; this guarantees that the I/O thread can fetch any
        remaining events in the transaction that the SQL thread is
        executing, so that the SQL thread can finish the transaction if
        it cannot be rolled back safely.

Closed.
[28 Jan 2011 12:21] Jon Stephens
Already coumented as noted above; returning to Closed state.
[13 Mar 2012 22:21] Justin Tolmer
The patch submitted does not actually fix the bug. And the documentation update is incorrect. Specifically,

        this guarantees that the I/O thread can fetch any
        remaining events in the transaction that the SQL thread is
        executing

is not accurate. Stopping the I/O thread second makes it more likely that the entire transaction will be fetched but it does not guarantee it. handle_slave_io makes no attempt to ensure that it fetches complete transactions. It will gladly stop at an event in the middle of a transaction if it is killed. Thus, there is still a race and the I/O thread may terminate before fetching the entire transaction, and it sometimes does. That is why the change to rpl.rpl_stop_slave which was done in the patch for this bug is causing the test to intermittently fail, as mentioned in bug 61091.