MySQL Bugs: #38205: Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUPP

Bug #38205	Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUPP_KEY
Submitted:	17 Jul 2008 16:44	Modified:	6 May 2009 14:47
Reporter:	Eric Jensen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.1.25	OS:	Linux (x86_64)
Assigned to:	Andrei Elkin	CPU Architecture:	Any
Tags:	RBR, replication

Description:
I can't stop and start the replication at all without breaking the
replication. If I do "stop slave" and then "start slave" again it
immediately has an error like this, seemingly for random tables that
have been written to (not one in particular):

Could not execute Write_rows event on table
XXX; Duplicate entry
'YYY' for key 'ZZZ',
Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's
master log binlog.000003, end_log_pos 704849

In my app, I'm doing lots of batched, large "insert values ..." statements into MyISAM tables that take the bulk of the time, so it's tough to be able to stop replication and not hit the middle of one of them. Is it the case that with a non-transactional engine, it is impossible to guarantee that stopping and starting RBR won't break replication then? The previous conversations below seemed to suggest this...

There are a few RBR inconsistency bugs floating around that may explain this, although none of them suggest this happens on every stop/start slave, and none touch on this documentation issue: http://bugs.mysql.com/search.php?search_for=&status=Active&severity=&limit=90&order_by=pri...

I realize this is not a well-encapsulated bug report. I originally inquired about it on a few places:

http://lists.mysql.com/replication/1335
http://forums.mysql.com/read.php?26,216227,216227
http://bugs.mysql.com/bug.php?id=36826

but received no response. It is preventing me from upgrading to 5.1. Shouldn't the lists and forums be merged?

How to repeat:
While doing large batched inserts on the master using either row or mixed log format, stop and start the slave. I also had this problem when trying 5.1.24

The last bullet on http://dev.mysql.com/doc/refman/5.1/en/replication-sbr-rbr.html also seems to also suggest a problem with RBR on non-transactional tables to me, but I find it hard to understand...I think some elaboration is in order.

Thank you for the report.

Please provide output of SHOW CREATE TABLE problem_table and configuration files for both master and slave. Also please indicate if slave is read-only.

CREATE TABLE `problem_table` (
  `a_id` int(10) unsigned NOT NULL,
  `b_id` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`a_id`,`b_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

slave does not have read-only set, but it is not being written to by anything other than the replication

Thank you for the feedback.

I can not repeat the problem with test data. Could you please also send us binary log binlog.000003?

I don't have the actual binlog still sitting around.  But it wasn't one particular incident.  Anytime I stopped and started the slave it would break with this error.

Many thanks for writing a bug report. Did you only stop/start the slave or did you restart the server as well?

I both stopped/started the slave and restarted the server.  Either of them would break it.

Thank you for the feedback.

Problem is I could not repeat described behavior with test data. So example of SHOW CREATE TABLE, query against it and binary log error happens with would be appreciated. Easiest way to get such binary log:

1. On master: FLUSH LOGS
2. On master: issue some query which could break replication
3. On slave: STOP SLAVE; START SLAVE;
4. Check if replication is broken, then send us new small binary log.

But if you can get sequence of queries on master which lead to such behavior it can be enough.

Many thanks for writing a bug report.

This is a known issue for replication of MyISAM tables.

http://dev.mysql.com/doc/refman/5.0/en/replication-features-slaveshutdown.html

According to our documentation, you can give this workaround a try:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-issues.html

SET GLOBAL SLAVE_EXEC_MODE = 'IDEMPOTENT';

Usually, this should work with MyISAM too.

Please let us know, if the workaround helps to shrink your problems with this duplicate key errors.

Thanks for the tip.  I have not tested it yet, but the documentation for SLAVE_EXEC_MODE = 'IDEMPOTENT' says "this mode should be employed in multi-master replication, circular replication, and some other special replication scenarios".  I am not doing anything special, just replicating some inserts (no updates or deletes) into a two-column primary key.  Is just using MyISAM with row-based replication at all an edge case that requires IDEMPOTENT?

I will set your bug report as feature request.

The feature request is to stop this "annoying" duplicate key errors on the slave for MyISAM after restarting the slave.

At the very least this is a documentation bug.  If the default replication (MIXED) breaks with the default storage engine (MyISAM), most people will view this as a bug.  It seems like if you need IDEMPOTENT to be able to stop and start replication at all in the default set up, then IDEMPOTENT should also be the default.  Either that, or the documentation should say in big red letters that you must turn on IDEMPOTENT for replication to work with MyISAM tables.

I am able to hit this repeatedly -- unless --slave-exec-mode=Idempotent is used, issuing START/STOP slave with MyISAM causes row-based replication to break with a duplicate/key or row not found error. This happens even on single-threaded workloads with extremely simple queries and does not require crashing the slave.

This should work in the default mode, so this is a bug.  More
specifically, the slave should stop in the correct position so that
log is not re-executed.

The idempotent mode removes some error conditions so that a log can be
re-executed.  Idempotent = you can apply the log twice with the same
result as applying it once.  It is only recommended in setups where
the logs need to be re-executed in case of failures.  Normally it
should not be needed.

What is the status of this in the 5.1.30 GA?  Can I use RBR with MyISAM without enabling the dangerous replication mode?

Can we get more detail on this? On the bright side, I think the problem is that events are executed twice rather than skipped. Will this always lead to an error? Will this ever lead to a case where the slave data and master data don't match?

I would think that if it was a table without a unique key, the repeated execution would lead to inconsistency, although I haven't tested that case.  It seems like a fairly major bug to me that replication re-executes statements due to a normal start/stop slave on the default table format (myisam).

The behavior of STOP SLAVE in 5.0 was that the slave stops immediately after execution of the current statement, even if inside a transaction. For transactional tables, this is not a problem since in that case, the transaction will be rolled back, but for non-transactional changes, this is disastrous. This behavior have been inherited by 5.1. Note that the same problem will occur when using statement-based replication in both 5.0 and 5.1.

As a resolution to this bug, we will change the semantics of STOP SLAVE so that the currently executing group (that is, transaction or single statement outside a transaction) will finish execution and STOP SLAVE will not return until the SQL thread has stopped executing the transaction (it will, of course, not wait for the SQL thread when a STOP SLAVE IO_THREAD is issued).

This will preserve the current slave_exec_mode default as STRICT in order to catch "real" duplicate key errors, while still retaining backward compatibility with any scripts that expect the slave to have stopped after issuing a STOP SLAVE.

Thanks Mats. I can endorse RBR now, but we don't really use 5.1 yet.

The current is not a Row-Based-Replication issue.
As the referred page
http://dev.mysql.com/doc/refman/5.0/en/replication-features-slaveshutdown.html
claims (although not explains) MyISAM is not shutdown-proof.
SLAVE STOP essentially kills the current executing event, should it be Query or Row one. In the case of MyISAM table and a *query* like UPDATE, DELETE the being killed query can be interrupted in the middle, as on the master so on the slave where an event can be in either the Query or the Row format (INSERT is perculiar in this regard, its multi-rows work is kill-proof).
With the event of either type on Innodb table there is no issue because interruption with STOP SLAVE leads to rollback of the transaction the event is part of.

Idempotent can not be a default because it would release from checking of
existence on the slave of the before image of the row-event.

I dare to say the current is not really a bug, not for RBR at least.
Although we realize STOP SLAVE can be made more soft and intelligent.

After speaking to Support we came to consensus: the current is a bug in too
course implementation of STOP SLAVE to allow interruption of a an event (incl. a plain query!) in the middle.

Correcting my early comments on that STOP SLAVE can interrupt a query event. 
This is not possible. Indeed, STOP SLAVE does not work through kill-query-interface but rather flags the state of the SQL thread a way to stop it only at the end of the being currently handled event.
The can be an internal event for associated with the query event, which is always the last in such group, but there is no harm to execute the internal and stop at the query event itself.
So the current artifact strongly affects the row-based replication.

Still, there is a possibility for this artifact to happen
the Statement format when a transaction on the master modifies both
transaction and non-transaction tables.
Such a mixed group of queries if interrupted on slave will restart from the beginning of the transaction, but the non-transaction tables might be left modified by the first attempt of execution of the transaction.

See also bug #43217 which now appears very like a duplicate of this one. Bug #43217 is a fork of part of bug #319. Should comment on bug #43217 and close as duplicate if this bug fix will fix that problem.

Bug #43686 is set as duplicate of this bug here.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/70478

2836 Andrei Elkin	2009-03-26
      Bug#38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUP
      Bug#319  if while a non-transactional slave is replicating a transaction possible problem 
      
      It is impossible to roll back a mixed engines transaction when one of the engine is
      non-transaction. In replication that fact is crucial because the slave can not safely
      re-apply a transction that was interrupted with STOP SLAVE.
      
      Fixed with making STOP SLAVE not be effective immediately in the case the current
      group of replication events has modified a non-transaction table. In order for slave to leave
      either the group needs finishing or the user issues KILL QUERY|CONNECTION slave_thread_id.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71784

2837 Andrei Elkin	2009-04-09
      Bug #38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUPP_KEY
      Bug#319  if while a non-transactional slave is replicating a transaction possible problem
      
      only testing related: addressing reviewers' comments.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72545

2861 Andrei Elkin	2009-04-21 [merge]
      merge bug#38205 fixes to 5.1-bt

Pushed to 5.1+ -bugteam.

Pushed into 5.1.35 (revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (version source revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (merge vers: 5.1.35) (pib:6)

Documentation: See Bug #319.

Set status NDI: waiting for 6.0 merge.

Pushed into 6.0.12-alpha (revid:svoj@sun.com-20090506125450-yokcmvqf2g7jhujq) (version source revid:aelkin@mysql.com-20090421094803-hho7a34gdsby7382) (merge vers: 6.0.11-alpha) (pib:6)

Also documented in the 6.0.12 changelog (see Bug #319). Closed.

Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090615074202-0r5r2jmi83tww6sf) (version source revid:jonas@mysql.com-20090615070837-9pccutgc7repvb4d) (merge vers: 5.1.35-ndb-6.3.26) (pib:6)

Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090615074335-9hcltksp5cu5fucn) (version source revid:jonas@mysql.com-20090615072714-rmfkvrbbipd9r32c) (merge vers: 5.1.35-ndb-7.0.7) (pib:6)

Pushed into 5.1.35-ndb-6.2.19 (revid:jonas@mysql.com-20090615061520-sq7ds4yw299ggugm) (version source revid:jonas@mysql.com-20090615054654-ebgpz7elwu1xj36j) (merge vers: 5.1.35-ndb-6.2.19) (pib:6)

Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version source revid:aelkin@mysql.com-20090708180952-73h6cpmg3r5wtxwa) (merge vers: 5.4.4-alpha) (pib:11)

Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.

Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.

Ignore previous comment about 5.4.2.

The 5.4 fix now has been pushed into 5.4.2.