MySQL Bugs: #25306: Race conditions during replication slave shutdown

Bug #25306	Race conditions during replication slave shutdown
Submitted:	28 Dec 2006 7:37	Modified:	31 May 2007 7:17
Reporter:	Ingo Strüwing	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.34	OS:	Any
Assigned to:	Mats Kindahl	CPU Architecture:	Any

Description:
Quote from https://intranet.mysql.com/~knielsen/pb/getlog.pl?dir=mysql-5.0-engines&entry=istruewing@c...

VALGRIND: 'Invalid read of size 4'
    COUNT: 1
    FUNCTION: pthread_cond_broadcast@@GLIBC_2.3.2    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E50BC7: pthread_cond_broadcast@@GLIBC_2.3.2 (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: '256 bytes in 1 blocks are still reachable in loss record 5 of 7'
    COUNT: 1
    FUNCTION: calloc    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4A1FB82: calloc (vg_replace_malloc.c:279)
             by 0x9BECF5: my_thread_init (my_thr_init.c:210)
             by 0x720C8F: handle_slave_io (slave.cc:3440)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: 'Invalid read of size 4'
    COUNT: 1
    FUNCTION: __lll_mutex_lock_wait    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E52B4A: __lll_mutex_lock_wait (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

VALGRIND: 'Syscall param futex(futex) points to unaddressable byte(s)'
    COUNT: 1
    FUNCTION: __lll_mutex_lock_wait    FILES:    slave.err
    TESTS:    rpl_rewrite_db
    STACK: at 0x4E52B48: __lll_mutex_lock_wait (in /lib64/libpthread-2.4.so)
           Address 0x55CAD00 is 3,832 bytes inside a block of size 11,832 free'd
             at 0x4A2046E: free (vg_replace_malloc.c:233)
             by 0x71F420: end_slave() (slave.cc:1100)
             by 0x5CF172: kill_server(void*) (mysqld.cc:795)
             by 0x5CF68F: kill_server_thread (mysqld.cc:1022)
             by 0x4E4D192: start_thread (in /lib64/libpthread-2.4.so)
             by 0x53C045C: clone (in /lib64/libc-2.4.so)

This log is generated from https://intranet.mysql.com/~knielsen/pb/getlog.pl?dir=mysql-5.0-engines&entry=istruewing@c...

How to repeat:
See pushbuild log.

The four valgrind stacks reported here can be divided into two groups. Stacks #1, #3 and #4 are caused by accessing memory free'd inside end_slave() while stack #2 is related to memory management problems in slave thread. I created a separate BUG#25463 report with stack #2 (as it would probably need a separate solution). From now on this bug will concern only stacks #1, #3 and #4.

Most probably the problem is in doing condition broadcast or mutex lock on condition/mutex from MASTER_INFO structure which has been already free'd inside end_slave(). This suggests some race conditions during replication slave shutdown. Unfortunatelly, valgrind doesn't say where the condition/mutex operations were called inside our code...

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/19582

ChangeSet@1.2392, 2007-02-03 20:14:16+01:00, rafal@quant.(none) +2 -0
  BUG#25306 (Race conditions during replication slave shutdown (valgrind stacks)):
  The possibility of the race is removed by changing sequence of calls
  
    pthread_mutex_unlock(&mi->run_lock);
    pthread_cond_broadcast(&mi->stop_cond);  
  
  into
  
    pthread_cond_broadcast(&mi->stop_cond);
    pthread_mutex_unlock(&mi->run_lock);
  
  at the end of I/O thread (similar change at the end of SQL thread). This ensures 
  that no thread waiting on the condition executes between the broadcast and the 
  unlock and thus can't delete the mi structure which caused the bug.

The patch looks good. The explanation and patch are well documented.

Looking at the issue

The issue with rpl_ddl test is not related to the fix I made. The problem is present in the before-patch tree. I filled a separate BUG#26687 report.

I revert to "patch approved" as the fix was already approved by rewievers.

Pushed into mysql-5.0-rpl tree.

Merged into 5.1-rpl tree.

pushed to 5.0.38,5.1.17-beta

Internal change, no entry has been recorded in the changelog.

Bug #27675 mysqld safe_mutex assert at shutdown can relate to the latest finding.
Terminating thread (after mysqladmin shutdown) does not lock mi->run_lock mutex 
(differently from a connection that STOP slave). That must be the reason of Bug#27675.