Bug #25306 Race conditions during replication slave shutdown
Submitted: 28 Dec 2006 8:37 Modified: 31 May 2007 9:17
Reporter: Ingo Strüwing
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:5.0.34 OS:Any
Assigned to: Mats Kindahl Target Version:

[28 Dec 2006 8:37] Ingo Strüwing
Description:
Quote from
https://intranet.mysql.com/~knielsen/pb/getlog.pl?dir=mysql-5.0-engines&entry=istruewing@c
hilla.local-20061228003907&name=elogsum_n_stm&plat=pb-valgrind

VALGRIND: 'Invalid read of size 4'
    COUNT: 1
    FUNCTION: pthread_cond_broadcast@@GLIBC_2.3.2    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E50BC7: pthread_cond_broadcast@@GLIBC_2.3.2 (in
/lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: '256 bytes in 1 blocks are still reachable in loss record 5 of 7'
    COUNT: 1
    FUNCTION: calloc    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4A1FB82: calloc (vg_replace_malloc.c:279)
             by 0x9BECF5: my_thread_init (my_thr_init.c:210)
             by 0x720C8F: handle_slave_io (slave.cc:3440)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: 'Invalid read of size 4'
    COUNT: 1
    FUNCTION: __lll_mutex_lock_wait    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E52B4A: __lll_mutex_lock_wait (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: 'Syscall param futex(futex) points to unaddressable byte(s)'
    COUNT: 1
    FUNCTION: __lll_mutex_lock_wait    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E52B48: __lll_mutex_lock_wait (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

This log is generated from
https://intranet.mysql.com/~knielsen/pb/getlog.pl?dir=mysql-5.0-engines&entry=istruewing@c
hilla.local-20061228003907&name=test_n_stm&plat=pb-valgrind

How to repeat:
See pushbuild log.
[8 Jan 2007 11:26] Rafal Somla
The four valgrind stacks reported here can be divided into two groups. Stacks #1, #3 and
#4 are caused by accessing memory free'd inside end_slave() while stack #2 is related to
memory management problems in slave thread. I created a separate BUG#25463 report with
stack #2 (as it would probably need a separate solution). From now on this bug will
concern only stacks #1, #3 and #4.
[8 Jan 2007 11:32] Rafal Somla
Most probably the problem is in doing condition broadcast or mutex lock on condition/mutex
from MASTER_INFO structure which has been already free'd inside end_slave(). This suggests
some race conditions during replication slave shutdown. Unfortunatelly, valgrind doesn't
say where the condition/mutex operations were called inside our code...
[8 Feb 2007 20:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/19582

ChangeSet@1.2392, 2007-02-03 20:14:16+01:00, rafal@quant.(none) +2 -0
  BUG#25306 (Race conditions during replication slave shutdown (valgrind stacks)):
  The possibility of the race is removed by changing sequence of calls
  
    pthread_mutex_unlock(&mi->run_lock);
    pthread_cond_broadcast(&mi->stop_cond);  
  
  into
  
    pthread_cond_broadcast(&mi->stop_cond);
    pthread_mutex_unlock(&mi->run_lock);
  
  at the end of I/O thread (similar change at the end of SQL thread). This ensures 
  that no thread waiting on the condition executes between the broadcast and the 
  unlock and thus can't delete the mi structure which caused the bug.
[12 Feb 2007 16:30] Chuck Bell
The patch looks good. The explanation and patch are well documented.
[26 Feb 2007 16:37] Rafal Somla
Looking at the issue
[28 Feb 2007 11:15] Rafal Somla
The issue with rpl_ddl test is not related to the fix I made. The problem is present in
the before-patch tree. I filled a separate BUG#26687 report.

I revert to "patch approved" as the fix was already approved by rewievers.
[28 Feb 2007 11:52] Rafal Somla
Pushed into mysql-5.0-rpl tree.
[1 Mar 2007 10:53] Rafal Somla
Merged into 5.1-rpl tree.
[8 Mar 2007 9:13] Andrei Elkin
pushed to 5.0.38,5.1.17-beta
[9 Mar 2007 15:26] MC Brown
Internal change, no entry has been recorded in the changelog.
[19 Jun 2007 15:58] Andrei Elkin
Bug #27675 mysqld safe_mutex assert at shutdown can relate to the latest finding.
Terminating thread (after mysqladmin shutdown) does not lock mi->run_lock mutex 
(differently from a connection that STOP slave). That must be the reason of Bug#27675.