MySQL Bugs: #24860: Incorrect SLAVE_TRANSACTION_RETRIES code can result in slave stuck

Bug #24860	Incorrect SLAVE_TRANSACTION_RETRIES code can result in slave stuck
Submitted:	6 Dec 2006 18:17	Modified:	28 Nov 2007 19:03
Reporter:	Rafal Somla	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1.12	OS:	Any
Assigned to:	Mats Kindahl	CPU Architecture:	Any

Description:
Due to incorrect implementation of SLAVE_TRANSACTION_RETRIES variable handling it is possible that replication slave will enter infinite loop repeating statement which triggers transient error.

Replication slave repeats execution of a replication event if it triggers an error considered to be transient (currently defined by has_temporary_error() function in slave.cc). The number of retries is supposed to be limited by the value of global SLAVE_TRANSACTION_RETRIES variable.

This is implemented inside the main event execution loop inside slave's SQL thread (functions handle_slave_sql() and exec_relay_log_event() in slave.cc) which, upon detection that an error returned by ev->exec_event() is transient, rewinds relay log to the position of the event (this is complicated by event groups as explained below). As a result, next iteration of event execution loop will read the same event and execute it again.

When event is repeated, variable rli->trans_retries is increased and checked against the value of SLAVE_TRANSACTION_RETRIES variable so that repetitions are stopped after this many tries. rli->trans_retries is reset to zero upon successful execution of the event.

A possible problem is caused by existence of event grups in the replication log. When replicating single SQL statement (in row based replication), several events can be generated which form single event group in the log. For example, when replicating INSERT statement, first Table_map event is generated and then one or more Write_rows events. Now, when a transient error is encountered during event execution, relay log is rewind to the beginning of the whole event group containing that event. Thus, the whole group will be repeated. Now consider the following event group:

1. Table_map event   // executes ok
2. Write_rows event  // triggers transient error, repeat from 1.

Then rli->trans_retries is increased when event 2 is executed and transient error is detected but it is reset to zero after executing event 1. As a result this event group will be repeated ad infinitum unless the error condition dissapears.

Note: this is not very critical because transient errors, by definition, should dissapear after some time.

How to repeat:
Not easy because it is difficult to provoke a transient error. I managed to get the faulty behaviour by modifying exec_relay_log_event() code so that all event execution errors are considered transient:

    if (slave_trans_retries)
    {
-     if (exec_res && has_temporary_error(thd) )
+     if (exec_res /* && has_temporary_error(thd) */ )
      {
        const char *errmsg;

Still it was not easy to get an error only in a second event of an event group. I managed to do it by creating foreign key constraint on a table, which is satisfied on master, while on slave I modify the external table so that the constraint fails. This way the Table_map event executes correctly but there is an error during Write_rows event. The test case looks as follows:

-- source include/master-slave.inc

-- connection master

CREATE TABLE t1 (a int primary key) engine=InnoDB;
CREATE TABLE t2 (a int, b int, foreign key (b) references t1(a)) engine=InnoDB;

SET AUTOCOMMIT=1;

INSERT INTO t1 VALUES (1);

-- sync_slave_with_master

STOP SLAVE SQL_THREAD;
DELETE FROM t1 WHERE a=1;

-- connection master

INSERT INTO t2 VALUES (2,1);

-- save_master_pos

--connection slave

START SLAVE SQL_THREAD;

-- sync_with_master

Suggested fix:
Reset rli->trans_retries only when a new event group starts.

Rafal,
"Due to incorrect implementation of SLAVE_TRANSACTION_RETRIES"
Hey, it was implemented (guess by who) before row-based, no wonder it may break with row-based: row-based introduced a new type of groups (table maps + rows).
You're right about groups, in fact the existing code below is an attempt to solve this problem when the group is a SBR transaction:
we reset the counter only when we are not in a transaction;
so if
BEGIN;
INSERT; # fails with transient error
we will resume to BEGIN, execute BEGIN ok, but not reset the counter.
Only when we reach the COMMIT will be reset the counter. Which is correct.
      else if (!((thd->options & OPTION_BEGIN) && opt_using_transactions))
      {
        /*
          Only reset the retry counter if the event succeeded or
          failed with a non-transient error.  On a successful event,
          the execution will proceed as usual; in the case of a
          non-transient error, the slave will stop with an error.
         */
        rli->trans_retries= 0; // restart from fresh
      }

I *think* that if we change the condition in the "else" in my post below to:
"else if (rli->group_relay_log_pos == rli->event_relay_log_pos)",
it would work. It would indeed do the resetting only if our just successfully processed event ended a group.
If in a transaction, all events except COMMIT, don't end a group.
If in autocommit mode, Intvar, Rand, Table_maps, don't end a group.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/35989

ChangeSet@1.2579, 2007-10-20 20:16:12+02:00, mats@kindahl-laptop.dnsalias.net +4 -0
  BUG#24860 (Incorrect SLAVE_TRANSACTION_RETRIES code can result in slave stuck):
  
  If a temporary error occured inside a group on an event that was not the first
  event of the group, the slave could get stuck because the retry counter is reset
  whenever an event was executed successfully.
  
  This patch only reset the retry counter when an entire group has been successfully
  executed, or failed with a non-transient error.

Pushed into 5.1.23-rc

Pushed into 6.0.4-alpha

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.23 and 6.0.4 changelogs as:

        If a temporary error occured inside an event group on an event
        that was not the first event of the group, the slave could get
        caught in an endless loop because the retry counter was reset
        whenever an event was executed successfully.