Bug #50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
Submitted: 7 Jan 2010 20:40 Modified: 7 Mar 2010 1:18
Reporter: Elena Stepanova Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.5.1-m2 OS:Linux
Assigned to: Libing Song CPU Architecture:Any
Triage: Triaged: D1 (Critical)

[7 Jan 2010 20:40] Elena Stepanova
Description:
Server with semi-sync replication crashes while running concurrent DML flow.

100107 23:25:49 [Note] Start binlog_dump to slave_server(2), pos(, 4)
100107 23:25:49 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
mysqld: semisync_master.cc:798: int ReplSemiSyncMaster::commitTrx(const char*, my_off_t): Assertion `!active_tranxs_->is_tranx_end_pos(trx_wa
it_binlog_name, trx_wait_binlog_pos)' failed.
100107 23:25:52 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=6
max_threads=151
threads_connected=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 60561 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x138e6b80
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x461f9100 thread_stack 0x40000
bin/mysqld(my_print_stacktrace+0x2e)[0x8bcc2e]
bin/mysqld(handle_segfault+0x322)[0x5cc032]
/lib64/libpthread.so.0[0x33a8c0de70]
/lib64/libc.so.6(gsignal+0x35)[0x33a8030155]
/lib64/libc.so.6(abort+0x110)[0x33a8031bf0]
/lib64/libc.so.6(__assert_fail+0xf6)[0x33a80295d6]
lib/plugin/semisync_master.so[0x2aaaab977168]
bin/mysqld(_ZN14Trans_delegate12after_commitEP3THDb+0xfc)[0x75a95c]
bin/mysqld(_Z15ha_commit_transP3THDb+0x489)[0x6c84e9]
bin/mysqld(_Z25ha_autocommit_or_rollbackP3THDi+0x28)[0x6c8628]
bin/mysqld(_Z19close_thread_tablesP3THD+0x298)[0x615508]
bin/mysqld(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x116)[0x736216]
bin/mysqld(_ZN13sp_instr_stmt7executeEP3THDPj+0x134)[0x736f74]
bin/mysqld(_ZN7sp_head7executeEP3THD+0x62b)[0x73452b]
bin/mysqld(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x58c)[0x7359dc]
bin/mysqld(_Z21mysql_execute_commandP3THD+0x2edf)[0x5df69f]
bin/mysqld(_Z11mysql_parseP3THDPKcjPS2_+0x376)[0x5e2636]
bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x100e)[0x5e364e]
bin/mysqld(_Z10do_commandP3THD+0xe3)[0x5e3e73]
bin/mysqld(handle_one_connection+0x23e)[0x5d616e]
/lib64/libpthread.so.0[0x33a8c062f7]
/lib64/libc.so.6(clone+0x6d)[0x33a80d1b6d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x2aaab00064f0 is an invalid pointer
thd->thread_id=29
thd->killed=NOT_KILLED

Setting to Linux because could not reproduce on Solaris so far. 

How to repeat:
- unpack the attached archive in <mysql-basedir>/mysql-test. It should create
'stress_test_basedir' folder;
- run as  
perl ./stress_test_basedir/run.pl

Notes:
run.pl is a wrapper for MTR to start server, and mysql-stress-test to run a 5-thread stress test. 
For mysql-stress-test you need to have perl with threads on the top of your path (or please modify run.pl to call mysql-stress-test.pl with the right perl).

The screen output should say 'Waiting for server(s) to exit...' for a few seconds (sleep time for the server to start), and then start rapidly producing lines like
test_loop[0:0 0:377]: TID 2 test: 'stress2'  Errors: No Errors. Test Passed OK
test_loop[0:0 0:378]: TID 1 test: 'stress1'  Errors: No Errors. Test Passed OK

When/if the problem is hit, the output says something like
test_loop[0:0 0:7]: TID 2 test: 'stress1'  Errors: Severity S1: 15 (thread aborting)
and the test exits

If the server does not start fast enough, the test might abort immediately without server crash; in this case please increase sleep time in run.pl.
[7 Jan 2010 21:24] Elena Stepanova
All threads bt

Attachment: all_threads_bug50157.out (application/octet-stream, text), 15.90 KiB.

[7 Jan 2010 21:24] Elena Stepanova
Stress test

Attachment: stress_test_basedir.tar.gz (application/gzip, text), 21.21 KiB.

[7 Jan 2010 21:24] Elena Stepanova
See also bug#50160 and bug#50163. The attached stress test is the same for all three bugs, and it randomly ends with one of these crashes.
All three problems might be the same, but since the produced stack trace is noticeably different in each case, I'm logging each one separately.
[20 Jan 2010 8:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/97490

3127 Li-Bing.Song@sun.com	2010-01-20
      BUG#50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is created when each time some log events are written into binlog file
      and is synchronized. TranxNodes' memories are allocted from mem_root of the current
      thread, and will be freed immediately after current statement ending.
      Sometimes, a statement ends and its TranxNode is freed before it is cleared
      from TranxNode list. So the Pointer of the TranxNode in TranxNode list becomes a 
      wild pointer.
      
      After this patch, One statement always waits until its log events has been replicated
      completely if it has been binlogged and synchronized. For the TranxNode will be
      cleared from the TranxNode list after the log event has been replicated.
     @ sql/rpl_handler.cc
        params are not initialized.
[20 Jan 2010 9:30] Zhenxing He
Bug#50160, Bug#50163 is marked as dup of this bug.
[22 Jan 2010 14:59] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/97891

3128 Li-Bing.Song@sun.com	2010-01-22
      Bug #50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is created when each time some log events are written into binlog file
      and is synchronized. TranxNodes' memories are allocted from mem_root of the current
      thread, and will be freed immediately after current statement ending.
      from TranxNode list. So the Pointer of the TranxNode in TranxNode list becomes a 
      wild pointer.
            
      After this patch, TranxNodes are not allocated from mem_root.
[25 Jan 2010 9:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98017

3127 Li-Bing.Song@sun.com	2010-01-25
      BUG#50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is allocated and insertted into the active list each time 
      when some log events are written into binlog file and is flushed. 
      The memory for TranxNode is allocted with thd_alloc and will be freed 
      after at the end of the statement. The after_commit/after_rollback callback
      was supposed to be call before the end of each statement and remove the node from
      the active list. However this assumption is not correct in all cases(e.g. CREATE
      TEMPORARY ... SELECT), and can cause the memory allocated for TranxNode be freed
      before it was removed from the active list. So The TranxNode pointer in the active
      list would become a wild pointer and cause the crash.
      
      After this patch, The memory for TranxNode will be allocated by my_malloc.
     @ sql/rpl_handler.cc
        params are not initialized.
[27 Jan 2010 7:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98279

3127 Li-Bing.Song@sun.com	2010-01-27
      BUG#50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is allocated and insertted into the active list each time 
      when some log events are written into binlog file and is flushed. 
      The memory for TranxNode is allocted with thd_alloc and will be freed 
      after at the end of the statement. The after_commit/after_rollback callback
      was supposed to be call before the end of each statement and remove the node from
      the active list. However this assumption is not correct in all cases(e.g. CREATE
      TEMPORARY ... SELECT), and can cause the memory allocated for TranxNode be freed
      before it was removed from the active list. So The TranxNode pointer in the active
      list would become a wild pointer and cause the crash.
      
      After this patch, The memory for TranxNode will be allocated by my_malloc.
     @ sql/rpl_handler.cc
        params are not initialized.
[29 Jan 2010 6:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98535

3127 Li-Bing.Song@sun.com	2010-01-29
      BUG#50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is allocated and insertted into the active list each time 
      a log events is written and flushed into the binlog file. 
      The memory for TranxNode is allocated with thd_alloc and will be freed 
      at the end of the statement. The after_commit/after_rollback callback
      was supposed to be called before the end of each statement and remove the node from
      the active list. However this assumption is not correct in all cases(e.g. call 
      'CREATE TEMPORARY TABLE myisam_t SELECT * FROM innodb_t' in a transaction
       and delete all temporary tables automatically when a session closed), 
      and can cause the memory allocated for TranxNode be freed
      before it was removed from the active list. So The TranxNode pointer in the active
      list would become a wild pointer and cause the crash.
      
      After this patch, We had a class called a TranxNodeAllocate which manages the memory
      for allocating and freeing TranxNode. It uses my_malloc to allocate memory.
     @ sql/rpl_handler.cc
        params are not initialized.
[30 Jan 2010 15:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98715

3127 Li-Bing.Song@sun.com	2010-01-30
      BUG#50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is allocated and inserted into the active list each time 
      a log event is written and flushed into the binlog file. 
      The memory for TranxNode is allocated with thd_alloc and will be freed 
      at the end of the statement. The after_commit/after_rollback callback
      was supposed to be called before the end of each statement and remove the node from
      the active list. However this assumption is not correct in all cases(e.g. call 
      'CREATE TEMPORARY TABLE myisam_t SELECT * FROM innodb_t' in a transaction
       and delete all temporary tables automatically when a session closed), 
      and can cause the memory allocated for TranxNode be freed
      before it was removed from the active list. So The TranxNode pointer in the active
      list would become a wild pointer and cause the crash.
      
      After this patch, We have a class called a TranxNodeAllocate which manages the memory
      for allocating and freeing TranxNode. It uses my_malloc to allocate memory.
     @ sql/rpl_handler.cc
        params are not initialized.
[30 Jan 2010 18:29] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98725

3127 Li-Bing.Song@sun.com	2010-01-31
      BUG#50157 Assertion !active_tranxs_->is_tranx_end_pos(..) in ReplSemiSyncMaster::commitTrx
      
      The root cause of the crash is that a TranxNode is freed before it is used.
      A TranxNode is allocated and inserted into the active list each time 
      a log event is written and flushed into the binlog file. 
      The memory for TranxNode is allocated with thd_alloc and will be freed 
      at the end of the statement. The after_commit/after_rollback callback
      was supposed to be called before the end of each statement and remove the node from
      the active list. However this assumption is not correct in all cases(e.g. call 
      'CREATE TEMPORARY TABLE myisam_t SELECT * FROM innodb_t' in a transaction
       and delete all temporary tables automatically when a session closed), 
      and can cause the memory allocated for TranxNode be freed
      before it was removed from the active list. So The TranxNode pointer in the active
      list would become a wild pointer and cause the crash.
      
      After this patch, We have a class called a TranxNodeAllocate which manages the memory
      for allocating and freeing TranxNode. It uses my_malloc to allocate memory.
     @ sql/rpl_handler.cc
        params are not initialized.
[30 Jan 2010 19:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98733

2932 Li-Bing.Song@sun.com	2010-01-31 [merge]
      Auto Merge fix for bug#50157
[31 Jan 2010 13:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98778

2979 Li-Bing.Song@sun.com	2010-01-31 [merge]
      Manual Merge for bug#50157
[31 Jan 2010 13:53] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98779

3859 Li-Bing.Song@sun.com	2010-01-31 [merge]
      Manual merge for bug#50157
[1 Feb 2010 7:29] Libing Song
Pushed to mysql-5.1-rep-semisync and merged to mysql-trunk-bugfixing, mysql-next-mr-bugfixing and mysql-6.0-codebase-bugfixing
[4 Feb 2010 14:23] Zhenxing He
Bug#50060 is marked as dup of this bug.
[12 Feb 2010 17:40] Bugs System
Pushed into 5.5.2-m2 (revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (version source revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (merge vers: 5.5.2-m2) (pib:16)
[13 Feb 2010 8:37] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100213083436-9pesg4h55w1mekxc) (version source revid:luis.soares@sun.com-20100211135109-t63avry9fqpgyh78) (merge vers: 6.0.14-alpha) (pib:16)
[13 Feb 2010 8:39] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100213083327-cee4ao3jpg33eggv) (version source revid:luis.soares@sun.com-20100211135018-1f9dbghg0itszigo) (pib:16)
[18 Feb 2010 18:28] Paul Dubois
Noted in 5.5.2, 6.0.14 changelogs.

With semisynchronous replication, memory allocated for handling
transactions could be freed while still in use, resulting in a server
crash.  

Setting report to Need Merge pending push of Celosia to release tree.
[6 Mar 2010 11:06] Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:vvaintroub@mysql.com-20100213160132-nx1vlocxuta76txh) (merge vers: 5.5.99-m3) (pib:16)
[7 Mar 2010 1:18] Paul Dubois
Already fixed in earlier 5.5.x.