Bug #25306 Race conditions during replication slave shutdown
Submitted: 28 Dec 2006 7:37 Modified: 31 May 2007 7:17
Reporter: Ingo Strüwing Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.34 OS:Any
Assigned to: Mats Kindahl CPU Architecture:Any

[28 Dec 2006 7:37] Ingo Strüwing
Description:
Quote from https://intranet.mysql.com/~knielsen/pb/getlog.pl?dir=mysql-5.0-engines&entry=istruewing@c...

VALGRIND: 'Invalid read of size 4'
    COUNT: 1
    FUNCTION: pthread_cond_broadcast@@GLIBC_2.3.2    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E50BC7: pthread_cond_broadcast@@GLIBC_2.3.2 (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: '256 bytes in 1 blocks are still reachable in loss record 5 of 7'
    COUNT: 1
    FUNCTION: calloc    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4A1FB82: calloc (vg_replace_malloc.c:279)
             by 0x9BECF5: my_thread_init (my_thr_init.c:210)
             by 0x720C8F: handle_slave_io (slave.cc:3440)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: 'Invalid read of size 4'
    COUNT: 1
    FUNCTION: __lll_mutex_lock_wait    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E52B4A: __lll_mutex_lock_wait (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: 'Syscall param futex(futex) points to unaddressable byte(s)'
    COUNT: 1
    FUNCTION: __lll_mutex_lock_wait    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E52B48: __lll_mutex_lock_wait (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

This log is generated from https://intranet.mysql.com/~knielsen/pb/getlog.pl?dir=mysql-5.0-engines&entry=istruewing@c...

How to repeat:
See pushbuild log.
[8 Jan 2007 10:26] Rafal Somla
The four valgrind stacks reported here can be divided into two groups. Stacks #1, #3 and #4 are caused by accessing memory free'd inside end_slave() while stack #2 is related to memory management problems in slave thread. I created a separate BUG#25463 report with stack #2 (as it would probably need a separate solution). From now on this bug will concern only stacks #1, #3 and #4.
[8 Jan 2007 10:32] Rafal Somla
Most probably the problem is in doing condition broadcast or mutex lock on condition/mutex from MASTER_INFO structure which has been already free'd inside end_slave(). This suggests some race conditions during replication slave shutdown. Unfortunatelly, valgrind doesn't say where the condition/mutex operations were called inside our code...
[8 Feb 2007 19:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/19582

ChangeSet@1.2392, 2007-02-03 20:14:16+01:00, rafal@quant.(none) +2 -0
  BUG#25306 (Race conditions during replication slave shutdown (valgrind stacks)):
  The possibility of the race is removed by changing sequence of calls
  
    pthread_mutex_unlock(&mi->run_lock);
    pthread_cond_broadcast(&mi->stop_cond);  
  
  into
  
    pthread_cond_broadcast(&mi->stop_cond);
    pthread_mutex_unlock(&mi->run_lock);
  
  at the end of I/O thread (similar change at the end of SQL thread). This ensures 
  that no thread waiting on the condition executes between the broadcast and the 
  unlock and thus can't delete the mi structure which caused the bug.
[12 Feb 2007 15:30] Chuck Bell
The patch looks good. The explanation and patch are well documented.
[26 Feb 2007 15:37] Rafal Somla
Looking at the issue
[28 Feb 2007 10:15] Rafal Somla
The issue with rpl_ddl test is not related to the fix I made. The problem is present in the before-patch tree. I filled a separate BUG#26687 report.

I revert to "patch approved" as the fix was already approved by rewievers.
[28 Feb 2007 10:52] Rafal Somla
Pushed into mysql-5.0-rpl tree.
[1 Mar 2007 9:53] Rafal Somla
Merged into 5.1-rpl tree.
[8 Mar 2007 8:13] Andrei Elkin
pushed to 5.0.38,5.1.17-beta
[9 Mar 2007 14:26] MC Brown
Internal change, no entry has been recorded in the changelog.
[19 Jun 2007 13:58] Andrei Elkin
Bug #27675 mysqld safe_mutex assert at shutdown can relate to the latest finding.
Terminating thread (after mysqladmin shutdown) does not lock mi->run_lock mutex 
(differently from a connection that STOP slave). That must be the reason of Bug#27675.