| Bug #45940 | issues around rli->last_event_start_time | ||
|---|---|---|---|
| Submitted: | 3 Jul 2009 14:58 | Modified: | 27 Jul 2009 12:53 | 
| Reporter: | Andrei Elkin | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) | 
| Version: | 5.1+ | OS: | Any | 
| Assigned to: | Andrei Elkin | CPU Architecture: | Any | 
   [3 Jul 2009 15:03]
   Andrei Elkin        
  For prioritizing please consider PB2 witnesses p.4 as http://pb2.norway.sun.com/web.py?template=mysql_show_test_failuretest_failure_id=2235535 Notice, it's a hanging not the crash but backtracing indeed.
   [6 Jul 2009 13:47]
   Bugs System        
  A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/78022 2812 Andrei Elkin 2009-07-06 Bug #45940 issues around rli->last_event_start_time there are several issues in implementation of the feature: 1. the scope of the timer is not the group but the statement It's set in Rows_log_event::do_apply_event() rli->last_event_start_time = time(0) and reset in Relay_log_info::cleanup_context() Besides not any statement but one marked with STMT_END_F. That makes possible the instant stop. However, consequences are well-known: bug#319, bug#38205. 2. the timer does not apply to SBR leaving out of consideration the mixed transaction (i.e involving a non-trans table) 3. the pk-case instant break of the group is not complimented with a warning suggesting to restart in the Idempotent mode 4. the timer is incoherent with fixes for bug#38205 that made the mixed engines group unbreakable but with KILL. Particularly, the timer's giving-out can't get control due to if (rli->abort_slave && rli->is_in_group() && thd->transaction.all.modified_non_trans_table) DBUG_RETURN(0); which can lead to STOP SLAVE hanging. Fixed 1-4 with incorporating the timer's logics with transaction.all.modified_non_trans_table flag and simplifying the timer's routine to be set and reset inside sql_slave_killed(). Couple of failure simulation are added and a former row-based format test extened to apply to any format.
   [8 Jul 2009 12:49]
   Bugs System        
  A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/78203 2813 Andrei Elkin 2009-07-08 Bug #45940 issues around rli->last_event_start_time post-reviewing patch with small improvements.
   [23 Jul 2009 10:24]
   Bugs System        
  Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version source revid:aelkin@mysql.com-20090708180952-73h6cpmg3r5wtxwa) (merge vers: 5.4.4-alpha) (pib:11)
   [27 Jul 2009 12:53]
   Jon Stephens        
  Documented bugfix in the 5.4.4 changelog as follows:
        When using row-based replication, the incomplete logging of a
        group of events involving both transaction and non-transactional
        tables could cause STOP SLAVE to hang.
 
   [11 Mar 2010 18:02]
   Paul DuBois        
  Noted in 5.5.3 changelog.
   [26 May 2010 11:35]
   Sveta Smirnova        
  Bug #53985 was marked as duplicate of this one.

Description: rli->last_event_start_time is designed for deferring affects of STOP SLAVE and let the being executed group of events to finish up first. More specifically the former design was concerned with the PrimaryKey-free case. The PK-able case was not attempted to cover relying on the Idempotent slave_exec_mode: the stopped in the middle the PK-group could be replayed. there are several issues in implementation of the feature: 1. the scope of the timer is not the group but the statement It's set in Rows_log_event::do_apply_event() rli->last_event_start_time = time(0) and reset in Relay_log_info::cleanup_context() Besides not any statement but one marked with STMT_END_F. That makes possible the instant stop. However, consequences are well-known: bug#319, bug#38205. 2. the timer does not apply to SBR leaving out of consideration the mixed transaction (i.e involving a non-trans table) 3. the pk-case instant break of the group is not complimented with a warning suggesting to restart in the Idempotent mode 4. the timer is incoherent with fixes for bug#38205 that made the mixed engines group unbreakable but with KILL. Particularly, the timer give-out can't get control due to if (rli->abort_slave && rli->is_in_group() && thd->transaction.all.modified_non_trans_table) DBUG_RETURN(0); which can lead to STOP SLAVE to hang. How to repeat: 1-3: look at code 4: 1.Simulate a part-recorded in the relay log mixed transaction on rbr-group of events. 2.Stop IO thread to not let completion. 3.Issue STOP SLAVE it will hang forever. Suggested fix: 1-3: extend the life time of the timer to be reset only at the end of the group Improve the warning to refer to possible Idempotent mode at restart would be necessary. 4: incorporate logics of the timer and deferring to accept killed status by the sql thread inside sql_slave_killed()