Bug #45940 issues around rli->last_event_start_time
Submitted: 3 Jul 2009 14:58 Modified: 27 Jul 2009 12:53
Reporter: Andrei Elkin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.1+ OS:Any
Assigned to: Andrei Elkin CPU Architecture:Any

[3 Jul 2009 14:58] Andrei Elkin
Description:
rli->last_event_start_time is designed for deferring affects of STOP SLAVE
and let the being executed group of events to finish up first.
More specifically the former design was concerned with the PrimaryKey-free case.
The PK-able case was not attempted to cover relying on the Idempotent slave_exec_mode: the stopped in the middle the PK-group could be replayed. 

there are several issues in implementation of the feature:

1. the scope of the timer is not the group but the statement
   It's set in Rows_log_event::do_apply_event()
     rli->last_event_start_time = time(0)
   and reset in Relay_log_info::cleanup_context()

   Besides not any statement but one marked with STMT_END_F.
   That makes possible the instant stop. However, consequences are well-known:
   bug#319, bug#38205.

2. the timer does not apply to SBR leaving out of consideration
   the mixed transaction (i.e involving a non-trans table)

3. the pk-case instant break of the group is not complimented with
   a warning suggesting to restart in the Idempotent mode

4. the timer is incoherent with fixes for bug#38205 that 
   made the mixed engines group unbreakable but with KILL.
   Particularly, the timer give-out can't get control
   due to

    if (rli->abort_slave && rli->is_in_group() &&
        thd->transaction.all.modified_non_trans_table)
      DBUG_RETURN(0);

   which can lead to STOP SLAVE to hang.

How to repeat:
1-3:
look at code

4:
1.Simulate a part-recorded in the relay log mixed transaction on rbr-group of events.
2.Stop IO thread to not let completion.
3.Issue STOP SLAVE

it will hang forever.

Suggested fix:
1-3: extend the life time of the timer to be reset only at the end of the group
     Improve the warning to refer to possible Idempotent mode at restart would
     be necessary.

4: incorporate logics of the timer and deferring to accept killed status by
   the sql thread inside sql_slave_killed()
[3 Jul 2009 15:03] Andrei Elkin
For prioritizing please consider 
PB2 witnesses p.4 as
 http://pb2.norway.sun.com/web.py?template=mysql_show_test_failuretest_failure_id=2235535

Notice, it's a hanging not the crash but backtracing indeed.
[6 Jul 2009 13:47] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78022

2812 Andrei Elkin	2009-07-06
      Bug #45940  	issues around rli->last_event_start_time
      
      there are several issues in implementation of the feature:
      1. the scope of the timer is not the group but the statement
         It's set in Rows_log_event::do_apply_event()
           rli->last_event_start_time = time(0)
         and reset in Relay_log_info::cleanup_context()
      
         Besides not any statement but one marked with STMT_END_F.
         That makes possible the instant stop. However, consequences are well-known:
         bug#319, bug#38205.
      2. the timer does not apply to SBR leaving out of consideration
         the mixed transaction (i.e involving a non-trans table)
      3. the pk-case instant break of the group is not complimented with
         a warning suggesting to restart in the Idempotent mode
      4. the timer is incoherent with fixes for bug#38205 that 
         made the mixed engines group unbreakable but with KILL.
         Particularly, the timer's giving-out can't get control
         due to
      
          if (rli->abort_slave && rli->is_in_group() &&
              thd->transaction.all.modified_non_trans_table)
            DBUG_RETURN(0);
      
         which can lead to STOP SLAVE hanging.
      
      Fixed 1-4 with incorporating the timer's logics with transaction.all.modified_non_trans_table
      flag and simplifying the timer's routine to be set and reset inside sql_slave_killed().
      Couple of failure simulation are added and a former row-based format test extened to apply
      to any format.
[8 Jul 2009 12:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78203

2813 Andrei Elkin	2009-07-08
      Bug #45940  	issues around rli->last_event_start_time
      
      post-reviewing patch with small improvements.
[23 Jul 2009 10:24] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version source revid:aelkin@mysql.com-20090708180952-73h6cpmg3r5wtxwa) (merge vers: 5.4.4-alpha) (pib:11)
[27 Jul 2009 12:53] Jon Stephens
Documented bugfix in the 5.4.4 changelog as follows:

        When using row-based replication, the incomplete logging of a
        group of events involving both transaction and non-transactional
        tables could cause STOP SLAVE to hang.
[11 Mar 2010 18:02] Paul DuBois
Noted in 5.5.3 changelog.
[26 May 2010 11:35] Sveta Smirnova
Bug #53985 was marked as duplicate of this one.