MySQL Bugs: #41263: Replication breaks after 'INSERT ... ON DUPLICATE UPDATE ...'

Bug #41263	Replication breaks after 'INSERT ... ON DUPLICATE UPDATE ...'
Submitted:	5 Dec 2008 15:18	Modified:	23 Mar 2009 15:23
Reporter:	Stanisław Pitucha	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.1.30	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	insert on duplicate update, replication, stop

Description:
Issue exists in 5.0.45 too. Might be related to bug#17684. If you don't think removing auto_increment column is a valid workaround, severity might as well be S1.

If load is high enough, replication stops after an 'INSERT ... ON DUPLICATE UPDATE ...' with error:

081204 22:49:27 [ERROR] Slave SQL: Error 'Duplicate entry 'XXXX' for key 'ab_idx'' on query. Default database: 'openser'. Query: 'XXXXXXXXXXXXXXXXXXXXXXXX', Error_code: 1062
081204 22:49:27 [Warning] Slave: Duplicate entry 'XXXX' for key 'registration_idx' Error_code: 1062
081204 22:49:27 [ERROR] Error running query, slave SQL thread aborted. Fix the p
roblem, and restart the slave SQL thread with "SLAVE START". We stopped at log '
mysql-bin.000001' position 68178

`show slave status` still displays 'Waiting for master to send event' - just with the Last_error changed.
Error stops occuring after removing the auto_increment column in that table.

Original schema where it was first observed:
CREATE TABLE `location` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL DEFAULT '',
  `domain` varchar(64) DEFAULT NULL,
  `contact` varchar(255) NOT NULL DEFAULT '',
  `received` varchar(128) DEFAULT NULL,
  `path` varchar(128) DEFAULT NULL,
  `expires` datetime NOT NULL DEFAULT '2020-05-28 21:32:15',
  `q` float(10,2) NOT NULL DEFAULT '1.00',
  `callid` varchar(255) NOT NULL DEFAULT 'Default-Call-ID',
  `cseq` int(11) NOT NULL DEFAULT '13',
  `last_modified` datetime NOT NULL DEFAULT '1900-01-01 00:00:01',
  `flags` int(11) NOT NULL DEFAULT '0',
  `cflags` int(11) NOT NULL DEFAULT '0',
  `user_agent` varchar(255) NOT NULL DEFAULT '',
  `socket` varchar(64) DEFAULT NULL,
  `methods` int(11) DEFAULT NULL,
  UNIQUE KEY `registration_idx` (`username`,`contact`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1

How to repeat:
Create a MyISAM table t with:
- auto_increment primary key `id`
- unique index `ab_idx` on two columns `a` and `b` (varchar(255))
- varchar(255) `c`

Replicate the table in master-master scheme.

Generate lots of queries (issue seems to be load-dependant - >10 queries/sec cause it) like:
INSERT INTO t (a, b, c) VALUES ('some', 'long', 'strings') ON DUPLICATE UPDATE c='strings';
With `a` and `b` values that repeat every 10 or so query.

One of the hosts stops replication as described above after couple of seconds.

Thank you for the report.

Please provide configuration files for both servers. Particularly I want to check if you have set auto_increment_offset and/or auto_increment_increment values.

Yes. It's a master-master replication, so one server has:

server-id = 1
replicate-same-server-id = 0
auto-increment-increment = 2
auto-increment-offset = 1

and the other:

server-id = 2
replicate-same-server-id = 0
auto-increment-increment = 2
auto-increment-offset = 2

Just to make the answer complete:

--- >8 --- config:

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql

server-id = 1
replicate-same-server-id = 0
auto-increment-increment = 2
auto-increment-offset = 1

master-host = XXX
master-user = slave-vnm2
master-password = YYY
master-connect-retry = 60

log-bin = /var/lib/mysql/mysql-bin.log
log-slave-updates

relay-log = /var/lib/mysql/slave-relay.log
relay-log-index = /var/lib/mysql/slave-relay-log.index

expire_logs_days = 7
max_binlog_size = 500M

set-variable=long_query_time=3
log-slow-queries = /var/lib/mysql/slow-queries.log

slave_exec_mode = IDEMPOTENT
binlog_format = STATEMENT

# tuning
skip-innodb

key_buffer_size = 64M
thread_cache_size = 4
table_cache = 80
query_cache_size = 0

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

Thank you for the feedback.

I can not repeat described behavior with generic test. Do you have triggers on table location?

No triggers there. There are only 3 queries running on that table:
INSERT INTO location ... ON DUPLICATE UPDATE ...; (avg. every 2 minutes per user, updating all fields, apart from id)
SELECT ... FROM location WHERE username = ... AND domain = ...; (rare)
DELETE ALL FROM location WHERE expires < NOW(); (once every 20 secs)

On a test system with 10 users, I couldn't get the replication to crash for a couple of days. After putting all other users on the system (around 900), it crashes before binlog reaches 2MB.

The frequency of INSERTs is varied though - for each user it's between 10 seconds and 1 hour - average should be ~2 min. I'm not sure what kind of information would be helpful here. I could send you a config to a daemon which causes that behaviour, but I don't want to post the binlog itself, unless there's a way to remove some data (phone numbers + passwords).

Sorry, I am totally confused here. You wrote it happens on version 5.0.45 but when I look into the header there version 5.1.30 is given.

Which version did you have this issue?

Do you have this issue with actual 5.1 (5.1.30) version too?

Sorry for the confusion.
It happens on:
- 5.0.45 (centos package)
- some 5.1.XX version with XX around 20, but I can't remember the exact number there (rpm on mysql site)
- 5.1.30 (rpm from the mysql site)

The attached error message is from 5.1.30, but the previous versions had the same reason for failing (not sure if the same error_code).

This could a well known effect of statement based replication. Please switch to row based replication and let us know if you still get this behaviour. Row based replication is possible with MySQL 5.1. Of course we recommend to use actual GA version here (5.1.30).

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I tried switching to RBR and it worked fine for the last 2 weeks.

But I'm not sure it's "a well known effect of statement based replication.". I looked for it before and browsed the documentation again just now and found only the "For complex statements, the statement must be evaluated and executed on the slave...." point on http://dev.mysql.com/doc/refman/5.1/en/replication-sbr-rbr.html But it's not included in the "Statements which are unsafe for SBR." point.

Maybe something like "INSERT ... ON DUPLICATE ..." should be mentioned there as a specific example of a query unsafe for SBR?

Notice bug http://bugs.mysql.com/bug.php?id=28781 which looks very similar.

This is a duplicate of bug #28781.