Bug #68992 MySQL
Submitted: 18 Apr 2013 6:09 Modified: 11 Sep 2013 14:44
Reporter: Salman Akram Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.6.10 OS:Windows (Server 2012)
Assigned to: CPU Architecture:Any
Tags: crash, replication

[18 Apr 2013 6:09] Salman Akram
Description:
We have a Master-Master setup and using multiple slave_parallel_workers.

Server crashes whenever there is Erorr 1032 (Can't find record) or 1062 (Duplicate entry) and the replication is resumed (start slave) after resolving the issue (deleting the duplicate or skipping the log entry). This occurs almost 90% of the time and there is no entry in error log. 

Why these errors come is another issue as there is no update on passive master and it is set as read_only. 

What happened is that server crashed due to a semaphore wait and then it resumed replication from an already exectued log entry. Below is the error that came (IO was slow due to other operations on server):

InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2013-04-13 17:21:04 7024  InnoDB: Assertion failure in thread 28708 in file srv0srv.cc line 1738
InnoDB: We intentionally generate a memory trap.

Is there a way to increase semaphore wait timeout?

Then I restarted server and attempted to start replication but below is the error I got:

First  [ERROR] --relay-log-recovery cannot be executed when the slave was stopped with an error or killed in MTS mode; consider using RESET SLAVE or restart the server with --relay-log-recovery = 0 followed by START SLAVE UNTIL SQL_AFTER_MTS_GAPS

relay_log_recovery was then set to 0 and replication was resumed but then Error 1032 and 1062 started coming. 

Note: sync_binlog, sync_relay_log, sync_relay_log_info have default values. 

How to repeat:
1- Setup Master-Master topology
2- Set slave_parallel_workers=20 
3- Somehow get the query stuck for more than 600 secs on system lock (I assume) to get semaphore wait error.
4- Server will crash.
5- Restart server and resume replication.
6- Replication will start from a previous log entry which is already executed and you will get Error 1062/1032.
7- Remove the duplicate entries manually or skip log entry by "set global sql_slave_skip_counter = 1".
8- Now do start slave.
9- Server will crash (90% of the time).
[20 Apr 2013 17:35] MySQL Verification Team
Can you compress and upload the mysql error log from a crashed server?
It sounds similar to http://bugs.mysql.com/bug.php?id=68506
[22 Apr 2013 6:24] Salman Akram
There isn't really much useful info in error logs as I mentioned in my initial comment. 

Here is what I did to get the below error log:

1- Started server after crash.
2- Deleted all the contents in error log.
3- set global_sql_slave_skip_counter = 1.
4- Start slave.
5- Server crashed

Below is the error log (I changed sensitive info with 'aaa') and the server crashed after last line but there isn't any log:

------------------------------
2013-04-22 00:39:53 13948 [Note] Slave I/O thread: connected to master 'aaa@10.1.10.40:3306',replication started in log 'aaa-bin.000391' at position 862634596
2013-04-22 00:39:53 13948 [Note] Slave SQL thread initialized, starting replication in log 'aaa-bin.000391' at position 623584389, relay log 'C:\aaa\MySQL\Relay Logs\aaa-relay-bin.001318' position: 623584554
2013-04-22 00:39:53 13948 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='C:\aaa\MySQL\Relay Logs\aaa-relay-bin.001318', relay_log_pos='623584554', master_log_name='aaa-bin.000391', master_log_pos='623584389' and new position at relay_log_file='C:\aaa\MySQL\Relay Logs\aaa-relay-bin.001318', relay_log_pos='623584979', master_log_name='aaa-bin.000391', master_log_pos='623584814' 
2013-04-22 00:39:53 13948 [ERROR] Slave SQL: Error 'Duplicate entry '837822' for key 'PRIMARY'' on query. Default database: 'aaa'. Query: 'INSERT INTO `userusagetime` (`UsageID`, `UserID`, `SessionID`, `StartDateTime`, `EndDateTime`, `AppID`, `OldSessionID`) VALUES (837822, 11719, 208939, '2013-04-15 23:25:59', '2013-04-15 23:26:18', 1, NULL)', Error_code: 1062
2013-04-22 00:39:53 13948 [Warning] Slave: Duplicate entry '837822' for key 'PRIMARY' Error_code: 1062
2013-04-22 00:39:53 13948 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'aaa-bin.000391' position 623584814
2013-04-22 00:40:06 13948 [Note] Slave SQL thread initialized, starting replication in log 'aaa-bin.000391' at position 623584814, relay log 'C:\aaa\MySQL\Relay Logs\aaa-relay-bin.001318' position: 623584979
[11 Aug 2013 14:44] MySQL Verification Team
please try 5.6.13 in case this is a duplicate of
http://bugs.mysql.com/bug.php?id=68506 which is fixed.
[12 Sep 2013 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".