MySQL Bugs: #319: if while a non-transactional slave is replicating a transaction, possiblproblem

Bug #319	if while a non-transactional slave is replicating a transaction, possiblproblem
Submitted:	23 Apr 2003 14:32	Modified:	6 May 2009 14:24
Reporter:	Guilhem Bichot	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	4.0, 4.1, 5.1	OS:	Any (all)
Assigned to:	Andrei Elkin	CPU Architecture:	Any

Description:
Sorry for the synopsis, but the max length is too short for replication bugs ;)

If we have a transaction on the master:

begin;
#position 1000 in relay log
delete from innodbtable1;
#position 1020 in relay log
insert into innodbtable1 values(10);
commit;

Until commit is executed by the slave, relay-log.info will contain 1000.
If someone does STOP SLAVE between the 2 queries : later replication
will resume at position 1000 because this is what is written in relay-log.info.
- If the slave runs with InnoDB tables, rollback will occur, and replication
should later resume at position 1000. relay-log.info is good.
- If the slave runs with --skip-innodb, innodbtable1 is probably MyISAM on the slave, and no rollback, and replication should resume at 1020, not 1000 like it does. BUG.

Monty's proposed solution :
- when STOP SLAVE is received, if on the slave you have seen a BEGIN and you have updated a non transactional table, wait until the COMMIT is executed before stopping.
Use if (test_all_bits(thd->options,OPTION_BEGIN | OPTION_STATUS_NO_TRANS_UPDATE)) for this.
- so before starting executing a transaction on the slave, be sure that you
have the COMMIT in the relay log (otherwise you could wait forever).

Guilhem's comment : what if the slave stops because of an error (error in reading the relay log, or in applying the queries), not because of STOP SLAVE ?.
Then in the transaction some statements will be rolled back, others won't
(those updating a MyISAM table), and replication is badly broken. But it is
probably not a super idea to update a MyISAM table inside a transaction.

How to repeat:

Suggested fix:
Will work on it later. Not very urgent.

This is now documented in the manual.  We will fix this in a future MySQL release.

Time has come to verify that RBR works.

Not sure if I really understood all right. Tested RBR on 5.1 bzr tree. Fact is stop/start position is the same after stop/start slave.

Situation is exactly how Guilhem described it also by using RBR and 5.1.

Verified like described:

create table innodbtable1(i integer)engine=innodb;

begin;
delete from innodbtable1;

Now stop slave:

090202 13:47:06 [Note] Slave I/O thread killed while reading event
090202 13:47:06 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000001', position 408

insert into innodbtable1 values(10);
commit;

Now start slave again:
090202 13:47:44 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000001' at position 408, relay log './ernie-relay-bin.000002' position: 553
090202 13:47:44 [Note] Slave I/O thread: connected to master 'miracee@127.0.0.1:5151',replication started in log 'mysql-bin.000001' at position 408

So it started at same position as it stoped ...

REFINED PROBLEM DESCRIPTION
===========================

ASSUMPTIONS

- Master has transactional tables (e.g. InnoDB)
- Slave has non-transactional tables (e.g. MyISAM)

PROBLEMS

1. When slave is stopped *by a user*, the transactional changes
   are not rolled back on the slave.  This causes the changes to
   be re-applied on the slave at later startup causing errors.

2. When slave is stopped *due to an error*, the transactional
   changes are not rolled back on the slave.  This causes the
   changes to be re-applied on the slave at later startup causing
   errors.

ANALYSIS
========

1. Can be solved by a delayed stop as Monty suggested.

2. Can not be solved in the general case without redesigning
   binlog to be interleaved and having more engine-influenced
   control over the transactions.

Because we have two problems here I opened a separat bug report for the problem "When slave is stopped *by a user*"

New bug report is bug #43217

This bug report is only for problem 2) when slave stop because of error.

Problem 1 is outsourced into bug #43217

The bug contains request to stop processing a mixed transaction in statement base
replication in the middle. If there is a non-transaction table the
whole group will complete and only afterward STOP slave will become
effective.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/70478

2836 Andrei Elkin	2009-03-26
      Bug#38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUP
      Bug#319  if while a non-transactional slave is replicating a transaction possible problem 
      
      It is impossible to roll back a mixed engines transaction when one of the engine is
      non-transaction. In replication that fact is crucial because the slave can not safely
      re-apply a transction that was interrupted with STOP SLAVE.
      
      Fixed with making STOP SLAVE not be effective immediately in the case the current
      group of replication events has modified a non-transaction table. In order for slave to leave
      either the group needs finishing or the user issues KILL QUERY|CONNECTION slave_thread_id.

bug#38205 is being fixed with the same patch.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71784

2837 Andrei Elkin	2009-04-09
      Bug #38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUPP_KEY
      Bug#319  if while a non-transactional slave is replicating a transaction possible problem
      
      only testing related: addressing reviewers' comments.

pushed to 5.1+ - bugteam

Pushed into 5.1.35 (revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (version source revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (merge vers: 5.1.35) (pib:6)

Documented bugfix in the 5.1.35 changelog as follows:

        The transactional behavior of STOP SLAVE has changed. Formerly, 
        it took effect immediately; now, it waits until the current 
        replication event group (if any) has finished executing, or until 
        the user issues a KILL QUERY or KILL CONNECTION statement.

        This was done in order to solve the problem encountered when 
        replication was stopped while a non-transactional slave was 
        replicating a transaction on the master. (It was impossible to 
        roll back a mixed-engines transaction when one of the engines was 
        non-transactional, which meant that the slave could not safely 
        re-apply any transaction that had been interrupted by STOP 
        SLAVE.)

Set status = NDI: waiting on merge to 6.0 tree for version info.

Pushed into 6.0.12-alpha (revid:svoj@sun.com-20090506125450-yokcmvqf2g7jhujq) (version source revid:aelkin@mysql.com-20090421094803-hho7a34gdsby7382) (merge vers: 6.0.11-alpha) (pib:6)

Bugfix also noted in 6.0.12 changelog; closed.

Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090615074202-0r5r2jmi83tww6sf) (version source revid:jonas@mysql.com-20090615070837-9pccutgc7repvb4d) (merge vers: 5.1.35-ndb-6.3.26) (pib:6)

Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090615074335-9hcltksp5cu5fucn) (version source revid:jonas@mysql.com-20090615072714-rmfkvrbbipd9r32c) (merge vers: 5.1.35-ndb-7.0.7) (pib:6)

Pushed into 5.1.35-ndb-6.2.19 (revid:jonas@mysql.com-20090615061520-sq7ds4yw299ggugm) (version source revid:jonas@mysql.com-20090615054654-ebgpz7elwu1xj36j) (merge vers: 5.1.35-ndb-6.2.19) (pib:6)

Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version source revid:aelkin@mysql.com-20090708180952-73h6cpmg3r5wtxwa) (merge vers: 5.4.4-alpha) (pib:11)

Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.

Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.

Ignore previous comment about 5.4.2.

The 5.4 fix now has been pushed into 5.4.2.