Bug #38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUPP_KEY
Submitted: 17 Jul 2008 18:44 Modified: 6 May 16:47
Reporter: Eric Jensen
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:5.1.25 OS:Linux (x86_64)
Assigned to: Andrei Elkin Target Version:5.1+
Tags: RBR, replication
Triage: Triaged: D2 (Serious)

[17 Jul 2008 18:44] Eric Jensen
Description:
I can't stop and start the replication at all without breaking the
replication. If I do "stop slave" and then "start slave" again it
immediately has an error like this, seemingly for random tables that
have been written to (not one in particular):

Could not execute Write_rows event on table
XXX; Duplicate entry
'YYY' for key 'ZZZ',
Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's
master log binlog.000003, end_log_pos 704849 

In my app, I'm doing lots of batched, large "insert values ..." statements into MyISAM
tables that take the bulk of the time, so it's tough to be able to stop replication and
not hit the middle of one of them.  Is it the case that with a non-transactional engine,
it is impossible to guarantee that stopping and starting RBR won't break replication
then?  The previous conversations below seemed to suggest this...

There are a few RBR inconsistency bugs floating around that may explain this, although
none of them suggest this happens on every stop/start slave, and none touch on this
documentation issue: 
http://bugs.mysql.com/search.php?search_for=&status=Active&severity=&limit=90&order_by=pri...

I realize this is not a well-encapsulated bug report.  I originally inquired about it on
a few places:

http://lists.mysql.com/replication/1335 
http://forums.mysql.com/read.php?26,216227,216227
http://bugs.mysql.com/bug.php?id=36826

but received no response.  It is preventing me from upgrading to 5.1.  Shouldn't the
lists and forums be merged?

How to repeat:
While doing large batched inserts on the master using either row or mixed log format,
stop and start the slave.  I also had this problem when trying 5.1.24
[17 Jul 2008 18:59] Eric Jensen
The last bullet on http://dev.mysql.com/doc/refman/5.1/en/replication-sbr-rbr.html also
seems to also suggest a problem with RBR on non-transactional tables to me, but I find it
hard to understand...I think some elaboration is in order.
[21 Jul 2008 13:42] Sveta Smirnova
Thank you for the report.

Please provide output of SHOW CREATE TABLE problem_table and configuration files for both
master and slave. Also please indicate if slave is read-only.
[22 Jul 2008 0:28] Eric Jensen
CREATE TABLE `problem_table` (
  `a_id` int(10) unsigned NOT NULL,
  `b_id` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`a_id`,`b_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
[22 Jul 2008 0:38] Eric Jensen
slave does not have read-only set, but it is not being written to by anything other than
the replication
[22 Jul 2008 19:45] Sveta Smirnova
Thank you for the feedback.

I can not repeat the problem with test data. Could you please also send us binary log
binlog.000003?
[23 Jul 2008 1:25] Eric Jensen
I don't have the actual binlog still sitting around.  But it wasn't one particular
incident.  Anytime I stopped and started the slave it would break with this error.
[23 Jul 2008 12:28] Susanne Ebrecht
Many thanks for writing a bug report. Did you only stop/start the slave or did you restart
the server as well?
[23 Jul 2008 18:56] Eric Jensen
I both stopped/started the slave and restarted the server.  Either of them would break it.
[24 Jul 2008 21:31] Sveta Smirnova
Thank you for the feedback.

Problem is I could not repeat described behavior with test data. So example of SHOW
CREATE TABLE, query against it and binary log error happens with would be appreciated.
Easiest way to get such binary log:

1. On master: FLUSH LOGS
2. On master: issue some query which could break replication
3. On slave: STOP SLAVE; START SLAVE;
4. Check if replication is broken, then send us new small binary log.

But if you can get sequence of queries on master which lead to such behavior it can be
enough.
[4 Aug 2008 15:07] Susanne Ebrecht
Many thanks for writing a bug report.

This is a known issue for replication of MyISAM tables.

http://dev.mysql.com/doc/refman/5.0/en/replication-features-slaveshutdown.html

According to our documentation, you can give this workaround a try:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-issues.html

SET GLOBAL SLAVE_EXEC_MODE = 'IDEMPOTENT';

Usually, this should work with MyISAM too.

Please let us know, if the workaround helps to shrink your problems with this duplicate
key errors.
[4 Aug 2008 18:33] Eric Jensen
Thanks for the tip.  I have not tested it yet, but the documentation for SLAVE_EXEC_MODE =
'IDEMPOTENT' says "this mode should be employed in multi-master replication, circular
replication, and some other special replication scenarios".  I am not doing anything
special, just replicating some inserts (no updates or deletes) into a two-column primary
key.  Is just using MyISAM with row-based replication at all an edge case that requires
IDEMPOTENT?
[5 Aug 2008 12:03] Susanne Ebrecht
I will set your bug report as feature request.

The feature request is to stop this "annoying" duplicate key errors on the slave for
MyISAM after restarting the slave.
[5 Aug 2008 16:04] Eric Jensen
At the very least this is a documentation bug.  If the default replication (MIXED) breaks
with the default storage engine (MyISAM), most people will view this as a bug.  It seems
like if you need IDEMPOTENT to be able to stop and start replication at all in the
default set up, then IDEMPOTENT should also be the default.  Either that, or the
documentation should say in big red letters that you must turn on IDEMPOTENT for
replication to work with MyISAM tables.
[24 Sep 2008 21:41] Philip Stoev
I am able to hit this repeatedly -- unless --slave-exec-mode=Idempotent is used, issuing
START/STOP slave with MyISAM causes row-based replication to break with a duplicate/key
or row not found error. This happens even on single-threaded workloads with extremely
simple queries and does not require crashing the slave.
[28 Sep 2008 17:56] Lars Thalmann
This should work in the default mode, so this is a bug.  More
specifically, the slave should stop in the correct position so that
log is not re-executed.

The idempotent mode removes some error conditions so that a log can be
re-executed.  Idempotent = you can apply the log twice with the same
result as applying it once.  It is only recommended in setups where
the logs need to be re-executed in case of failures.  Normally it
should not be needed.
[28 Nov 2008 7:46] Eric Jensen
What is the status of this in the 5.1.30 GA?  Can I use RBR with MyISAM without enabling
the dangerous replication mode?
[29 Nov 2008 16:02] Mark Callaghan
Can we get more detail on this? On the bright side, I think the problem is that events are
executed twice rather than skipped. Will this always lead to an error? Will this ever lead
to a case where the slave data and master data don't match?
[29 Nov 2008 17:53] Eric Jensen
I would think that if it was a table without a unique key, the repeated execution would
lead to inconsistency, although I haven't tested that case.  It seems like a fairly major
bug to me that replication re-executes statements due to a normal start/stop slave on the
default table format (myisam).
[9 Dec 2008 22:56] Mats Kindahl
The behavior of STOP SLAVE in 5.0 was that the slave stops immediately after execution of
the current statement, even if inside a transaction. For transactional tables, this is
not a problem since in that case, the transaction will be rolled back, but for
non-transactional changes, this is disastrous. This behavior have been inherited by 5.1.
Note that the same problem will occur when using statement-based replication in both 5.0
and 5.1.

As a resolution to this bug, we will change the semantics of STOP SLAVE so that the
currently executing group (that is, transaction or single statement outside a
transaction) will finish execution and STOP SLAVE will not return until the SQL thread
has stopped executing the transaction (it will, of course, not wait for the SQL thread
when a STOP SLAVE IO_THREAD is issued).

This will preserve the current slave_exec_mode default as STRICT in order to catch "real"
duplicate key errors, while still retaining backward compatibility with any scripts that
expect the slave to have stopped after issuing a STOP SLAVE.
[10 Dec 2008 1:49] Mark Callaghan
Thanks Mats. I can endorse RBR now, but we don't really use 5.1 yet.
[10 Dec 2008 19:21] Andrei Elkin
The current is not a Row-Based-Replication issue.
As the referred page
http://dev.mysql.com/doc/refman/5.0/en/replication-features-slaveshutdown.html
claims (although not explains) MyISAM is not shutdown-proof.
SLAVE STOP essentially kills the current executing event, should it be Query or Row one.
In the case of MyISAM table and a *query* like UPDATE, DELETE the being killed query can
be interrupted in the middle, as on the master so on the slave where an event can be in
either the Query or the Row format (INSERT is perculiar in this regard, its multi-rows
work is kill-proof).
With the event of either type on Innodb table there is no issue because interruption with
STOP SLAVE leads to rollback of the transaction the event is part of.

Idempotent can not be a default because it would release from checking of
existence on the slave of the before image of the row-event.

I dare to say the current is not really a bug, not for RBR at least.
Although we realize STOP SLAVE can be made more soft and intelligent.
[10 Dec 2008 20:21] Andrei Elkin
After speaking to Support we came to consensus: the current is a bug in too
course implementation of STOP SLAVE to allow interruption of a an event (incl. a plain
query!) in the middle.
[11 Dec 2008 0:14] Andrei Elkin
Correcting my early comments on that STOP SLAVE can interrupt a query event. 
This is not possible. Indeed, STOP SLAVE does not work through kill-query-interface but
rather flags the state of the SQL thread a way to stop it only at the end of the being
currently handled event.
The can be an internal event for associated with the query event, which is always the
last in such group, but there is no harm to execute the internal and stop at the query
event itself.
So the current artifact strongly affects the row-based replication.

Still, there is a possibility for this artifact to happen
the Statement format when a transaction on the master modifies both
transaction and non-transaction tables.
Such a mixed group of queries if interrupted on slave will restart from the beginning of
the transaction, but the non-transaction tables might be left modified by the first
attempt of execution of the transaction.
[27 Feb 0:11] James Day
See also bug #43217 which now appears very like a duplicate of this one. Bug #43217 is a
fork of part of bug #319. Should comment on bug #43217 and close as duplicate if this bug
fix will fix that problem.
[23 Mar 15:45] Susanne Ebrecht
Bug #43686 is set as duplicate of this bug here.
[26 Mar 9:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/70478

2836 Andrei Elkin	2009-03-26
      Bug#38205 Row-based Replication (RBR) causes inconsistencies: HA_ERR_FOUND_DUP
      Bug#319  if while a non-transactional slave is replicating a transaction possible
problem 
      
      It is impossible to roll back a mixed engines transaction when one of the engine is
      non-transaction. In replication that fact is crucial because the slave can not
safely
      re-apply a transction that was interrupted with STOP SLAVE.
      
      Fixed with making STOP SLAVE not be effective immediately in the case the current
      group of replication events has modified a non-transaction table. In order for
slave to leave
      either the group needs finishing or the user issues KILL QUERY|CONNECTION
slave_thread_id.
[9 Apr 15:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/71784

2837 Andrei Elkin	2009-04-09
      Bug #38205 Row-based Replication (RBR) causes inconsistencies:
HA_ERR_FOUND_DUPP_KEY
      Bug#319  if while a non-transactional slave is replicating a transaction possible
problem
      
      only testing related: addressing reviewers' comments.
[21 Apr 10:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72545

2861 Andrei Elkin	2009-04-21 [merge]
      merge bug#38205 fixes to 5.1-bt
[21 Apr 11:54] Andrei Elkin
Pushed to 5.1+ -bugteam.
[5 May 21:43] Bugs System
Pushed into 5.1.35 (revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (version
source revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (merge vers: 5.1.35)
(pib:6)
[6 May 11:58] Jon Stephens
Documentation: See Bug #319.

Set status NDI: waiting for 6.0 merge.
[6 May 16:07] Bugs System
Pushed into 6.0.12-alpha (revid:svoj@sun.com-20090506125450-yokcmvqf2g7jhujq) (version
source revid:aelkin@mysql.com-20090421094803-hho7a34gdsby7382) (merge vers: 6.0.11-alpha)
(pib:6)
[6 May 16:47] Jon Stephens
Also documented in the 6.0.12 changelog (see Bug #319). Closed.
[15 Jun 10:29] Bugs System
Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090615074202-0r5r2jmi83tww6sf)
(version source revid:jonas@mysql.com-20090615070837-9pccutgc7repvb4d) (merge vers:
5.1.35-ndb-6.3.26) (pib:6)
[15 Jun 11:09] Bugs System
Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090615074335-9hcltksp5cu5fucn)
(version source revid:jonas@mysql.com-20090615072714-rmfkvrbbipd9r32c) (merge vers:
5.1.35-ndb-7.0.7) (pib:6)
[15 Jun 11:50] Bugs System
Pushed into 5.1.35-ndb-6.2.19 (revid:jonas@mysql.com-20090615061520-sq7ds4yw299ggugm)
(version source revid:jonas@mysql.com-20090615054654-ebgpz7elwu1xj36j) (merge vers:
5.1.35-ndb-6.2.19) (pib:6)
[23 Jul 12:24] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version
source revid:aelkin@mysql.com-20090708180952-73h6cpmg3r5wtxwa) (merge vers: 5.4.4-alpha)
(pib:11)
[12 Aug 23:56] Paul DuBois
Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.
[12 Aug 23:58] Paul DuBois
Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.
[15 Aug 0:48] Paul DuBois
Ignore previous comment about 5.4.2.
[7 Oct 3:32] Paul DuBois
The 5.4 fix now has been pushed into 5.4.2.