MySQL Bugs: #61133: Master -> Slave : duplicate key error after running fine for days

Bug #61133	Master -> Slave : duplicate key error after running fine for days
Submitted:	11 May 2011 13:29	Modified:	20 Dec 2012 20:55
Reporter:	Thomas Förster	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.5.10	OS:	Linux (SuSE Linux Enterprise 11.1)
Assigned to:		CPU Architecture:	Any
Tags:	1062, break, duplicate key, mass insert, replication

Description:
From my forum post: http://forums.mysql.com/read.php?26,416411,416411#msg-416411

6 days a go i set up a master -> slave replication using self-compiled mysql 5.5.10 and linux 2.6 x64 on both ends. The slave was created by copying all files (while the server was down and synced everything on disc) via scp to the slave server.

Config (slave and master):

innodb_file_format=Barracuda
innodb_file_per_table = 1
log_bin
max_binlog_size = 128M
binlog_format = mixed
expire_logs_days = 30

Database size is about 150gb, all tables are innodb.

All ran fine for 6 days until today, when the replication broke with a "duplicate key" error on the slave: 

Slave SQL: Error 'Duplicate entry '10139467' for key 'PRIMARY'' on query. Default database: 'xxx'. Query: 'insert into ABC (uid, iid, t_insert)
(select dat.uid, '14635' as iid, now() as t_insert
from XYZ dat left join ABC ia on ia.iid = 14635 and ia.uid = dat.uid
where dat.complete = 2
and dat.uid != 0
and ia.uid is null
and dat.sid = 118263)', Error_code: 1062

So there occured a duplicate key error where it should not because nothing is writing on the slave (i'm 100% sure about this), binlog_format is "mixed" and the master doesn't have any duplicate key problems.
Something came to my attention when looking at the data in table "ABC" on the master:

+----------+-------+--------+---------------------+---------------------+-----
| id | iid | uid | t_last | t_insert | ....
+----------+-------+--------+---------------------+---------------------+-----
.... some more rows with iid=14635 ....
| 10139464 | 14635 | 287394 | 2011-04-14 09:47:10 | 2011-04-14 09:47:10 |
| 10139465 | 14635 | 187010 | 2011-04-14 09:47:10 | 2011-04-14 09:47:10 |
| 10139466 | 14635 | 106345 | 2011-04-14 09:47:10 | 2011-04-14 09:47:10 |
| 10139467 | 14593 | 789885 | 2011-04-14 09:47:11 | 2011-04-14 09:47:11 | <--- look here
| 10139468 | 14635 | 96475 | 2011-04-14 09:47:10 | 2011-04-14 09:47:10 |
| 10139469 | 14635 | 113499 | 2011-04-14 09:47:10 | 2011-04-14 09:47:10 |
| 10139470 | 14635 | 106104 | 2011-04-14 09:47:10 | 2011-04-14 09:47:10 |
.... some more rows with iid=14635 ....
+----------+-------+--------+---------------------+---------------------+-----

Row with id 10139467 does have a "t_last" / "t_insert" timestamp that is one second later (or in the future) then the rest.

Mysql was executing the sql statement that failed on the slave at 09:47:10. While executing, one second later another process was writing an additional record to the table at 09:47:11. (Our application does not use any locks or transactions here)
So it seems that mysql "mixed" both inserts together. No problem here because innodb should care for that.
But this combination is failing on the slave with a duplicate key error. When looking at the data on the slave server, every "id" from the failing query was missing in the table but id
"10139467" was already there.
The query from 09:47:11 seems to be executed by the slave before the query from 09:47:10. I thought "binlog_format=mixed" should fix those problems but in my case it doesn't.

After deleting row "10139467" from my slave the replication continued but id 10139467 was given to the initial failing query so that the whole table is not in sync anymore. The replication itself is still working well now. 

How to repeat:
Now it's nearly one month later and this happended two times again now. I wasn't able to reproduce this by hand.

Suggested fix:
Maybe setting binlog_format=row can fix this?

We've been having a similar issue and found this way of it happening.

We use statement based replication only.
This is with mysql 5.1.49, I checked the changelog til 5.1.58 and didn't see anything that tells me it is fixed.

According to the binlog it goes like this:

#110514  7:54:49 server id 42  end_log_pos 6683606      Intvar
SET INSERT_ID=993022101/*!*/;

# at 6683606
       
#110514  7:54:49 server id 42  end_log_pos 6684073      Query   thread_id=130928719     exec_time=0     error_code=0
  
SET TIMESTAMP=1305352489/*!*/;

INSERT INTO table () VALUES ()
... skip unrelated stuff ...
#110514  7:54:48 server id 42  end_log_pos 6771858      Intvar                                     
SET INSERT_ID=993021441/*!*/;

# at 6771858

SET TIMESTAMP=1305352488/*!*/;

INSERT INTO table() VALUES (), (), () (about 950 rows added)

... skip unrelated stuff ...

#110514  7:54:49 server id 42  end_log_pos 6914506      Query   thread_id=130928719     exec_time=0
     error_code=0
SET INSERT_ID=993031861/*!*/;
SET TIMESTAMP=1305352489/*!*/;

INSERT INTO table () VALUES ()  

Autoincrement increases by 10 each time.

The first query seems to use an autoincrement value that would have been allocated by the mass insert query. 

Our slaves also report the duplicate id.

Thank you for the report.

Please try with current version 5.5.14 and if problem still exists send us output of SHOW CREATE TABLE ABC and XYZ

I've upgrade my master and slave to 5.5.14 and copied over all data from my master to the slave (via lvm snapshot).

The mysql server on the slave ran fine after starting it until i configured it to act as a slave. After 5 minutes of doing what a slave needs to do mysql crashed. It keeps crashing immediately every time i restart mysql while recovering:

110719 11:56:07 InnoDB: The InnoDB memory heap is disabled
110719 11:56:07 InnoDB: Mutexes and rw_locks use GCC atomic builtins
110719 11:56:07 InnoDB: Compressed tables use zlib 1.2.3
110719 11:56:07 InnoDB: Using Linux native AIO
110719 11:56:07 InnoDB: Initializing buffer pool, size = 256.0M
110719 11:56:07 InnoDB: Completed initialization of buffer pool
110719 11:56:07 InnoDB: highest supported file format is Barracuda.
InnoDB: Log scan progressed past the checkpoint lsn 720247826921
110719 11:56:07  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 720249568654
110719 11:56:19  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 110719 11:56:19 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=67108864
read_buffer_size=8388608
max_used_connections=0
max_threads=20
thread_count=0
connection_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 557283 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x30000
/usr/local/bin/mysqld(my_print_stacktrace+0x33)[0x787b03]
/usr/local/bin/mysqld(handle_segfault+0x37c)[0x510bac]
/lib64/libpthread.so.0(+0xf5d0)[0x7f97105175d0]
/usr/local/bin/mysqld[0x8c32f8]
/usr/local/bin/mysqld[0x8c4b32]
/usr/local/bin/mysqld[0x8b0ef1]
37 /usr/local/bin/mysqld[0x8b279d]
/usr/local/bin/mysqld[0x857c89]
/usr/local/bin/mysqld[0x882f59]
/usr/local/bin/mysqld[0x819c88]
/lib64/libpthread.so.0(+0x75f0)[0x7f971050f5f0]
/lib64/libc.so.6(clone+0x6d)[0x7f970f30f87d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
110719 11:56:19 mysqld_safe mysqld from pid file /srv/mysqldata/server.pid ended

So something is terribly wrong here... maybe the above can help to track down these strange problems.

Are you tried with our binaries instead of self compiled?. Thanks.

No I did not and I'm sorry but I'm currently unable to do so because my slave server was removed.

Thank you for the feedback.

But we still need more information to repeat it on our side: at lease output of SHOW CREATE TABLE ABC and XYZ and compile options you used to build MySQL server. Also if you can test our binaries in your environment this would be better. Please update the report when you are able to provide this information.

Compile options:

CFLAGS="-march=native -O2 -fomit-frame-pointer -pipe" CXXFLAGS="-march=native -O2 -fomit-frame-pointer -pipe" LDFLAGS="-Wl,--as-needed -Wl,-O1" \
cmake \
-DCMAKE_INSTALL_PREFIX=/usr \
-DDEFAULT_CHARSET=utf8 \
-DDEFAULT_COLLATION=utf8_general_ci \
-DMYSQL_DATADIR=/srv/mysqldata \
-DWITH_COMMENT='MySQL 5.5.10' \
-DWITH_SSL=yes \
-DWITH_PIC=yes \
-DMYSQL_UNIX_ADDR=/tmp/mysql.sock \
-DHAVE_AIO_H=/usr \
-DWITH_ARCHIVE_STORAGE_ENGINE=1 \
-DWITH_BLACKHOLE_STORAGE_ENGINE=1 \
-DWITH_DEBUG=OFF \
-DSYSCONFDIR=/etc/mysql \
-DENABLE_DOWNLOADS=1 \
.

Tables:

CREATE TABLE "ABC" (
  "id" int(10) unsigned NOT NULL AUTO_INCREMENT,
  "iid" int(10) unsigned NOT NULL DEFAULT '0',
  "uid" int(10) unsigned NOT NULL DEFAULT '0',
  "t_last" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  "t_insert" timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  "ppty" smallint(6) DEFAULT NULL ,
  "rv" datetime DEFAULT NULL ,
  "av" mediumint(9) DEFAULT NULL ,
  "acv" float(11,4) DEFAULT NULL ,
  PRIMARY KEY ("id"),
  KEY "iid" ("iid"),
  KEY "uid" ("uid"),
  KEY "idx_incentive_account_t_insert" ("t_insert"),
  KEY "idx_sdate" ("ppty"),
  KEY "idx_iid_uid" ("iid","uid"),
  KEY "idx_reverse" ("rv"),
  KEY "idx_alternative_value" ("av"),
  CONSTRAINT "fk_iid" FOREIGN KEY ("iid") REFERENCES "ii" ("iid") ON UPDATE CASCADE,
  CONSTRAINT "fk_inc_account_uid" FOREIGN KEY ("uid") REFERENCES "pw" ("uid") ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

CREATE TABLE "ii" (
  "iid" int(10) unsigned NOT NULL AUTO_INCREMENT,
  "pid" int(10) unsigned DEFAULT NULL,
  "value" mediumint(9) NOT NULL DEFAULT '0',
  "label_intern" varchar(128) NOT NULL,
  "label_extern" varchar(128) NOT NULL,
  "field_code" tinyint(3) unsigned NOT NULL DEFAULT '0',
  "coid" int(10) unsigned NOT NULL DEFAULT '0',
  "sid" int(10) unsigned NOT NULL DEFAULT '0',
  "incentive_type" tinyint(3) unsigned NOT NULL DEFAULT '1',
  "t_last" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  "t_insert" timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  "pcid" int(10) unsigned NOT NULL DEFAULT '1',
  "usage" tinyint(3) unsigned NOT NULL DEFAULT '0' ,
  "cashval" float(11,4) NOT NULL DEFAULT '0.0000',
  "md5_iid" varchar(16) DEFAULT NULL,
  PRIMARY KEY ("iid"),
  UNIQUE KEY "uk_coid_md5" ("coid","md5_iid"),
  KEY "pid" ("pid"),
  KEY "coid" ("coid"),
  KEY "sid" ("sid"),
  KEY "idx_pid_coid" ("pid","coid"),
  KEY "idx_pid_sid_coid" ("pid","sid","coid"),
  KEY "idx_field_code" ("field_code"),
  KEY "idx_pcid" ("pcid"),
  KEY "cashval" ("cashval")
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC

CREATE TABLE "XYZ" (
  "nr" int(10) unsigned NOT NULL AUTO_INCREMENT,
  "uid" int(10) unsigned NOT NULL,
  "email" varchar(255) NOT NULL,
  "lid" varchar(32) NOT NULL,
  "complete" tinyint(4) NOT NULL DEFAULT '0',
  "mailstati" tinyint(4) NOT NULL DEFAULT '0',
  "lsid" tinyint(4) NOT NULL DEFAULT '0',
  "sid" int(11) NOT NULL DEFAULT '0',
  "mid" varchar(255) DEFAULT NULL,
  "extid" int(11) NOT NULL DEFAULT '0',
  "naccess" int(11) NOT NULL DEFAULT '0',
  "feedback_box" varchar(255) DEFAULT NULL,
  "pro_box" varchar(255) DEFAULT NULL,
  "contra_box" varchar(255) DEFAULT NULL,
  "valid" tinyint(4) DEFAULT '1',
  "schedule" tinyint(4) NOT NULL DEFAULT '0',
  "ip_addr" varchar(32) DEFAULT NULL,
  "pimp" varchar(32) DEFAULT NULL,
  "aps_01" tinyint(4) DEFAULT NULL,
  "aps_02" tinyint(4) DEFAULT NULL,
  "aps_03" tinyint(4) DEFAULT NULL,
  "aps_04" tinyint(4) DEFAULT NULL,
  "aps_05" tinyint(4) DEFAULT NULL,
  "t_last" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  "t_insert" timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  "t_start" datetime DEFAULT NULL,
  "t_stop" datetime DEFAULT NULL,
  PRIMARY KEY ("nr"),
  UNIQUE KEY "lid" ("lid"),
  KEY "uid" ("uid"),
  KEY "email" ("email"),
  KEY "complete" ("complete"),
  KEY "mailstati" ("mailstati"),
  KEY "sid" ("sid"),
  KEY "lsid" ("lsid"),
  KEY "extid" ("extid")
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Unfortunately I can't test the replication again because my company decided to remove the slave server so I'm just working with the master which is running fine since months. Hope the above information helps a bit.

Thank you for the feedback.

In version 5.5.10 this failing query was written to binary log in mixed format as a statement. This could lead to collisions. Now (I tested with current version 5.5.28) such query is written as a row. This makes chances for collision smaller.

Actually I can  not say if this query was reason for the collision having only information you provided in this report. Issue could cause queries which you performed earlier.

But since you removed replication setup and I see improvement of handling such queries in newer versions I close this as "Can't repeat"