Bug #51942 Deadlock in replicated statement always fails on retry
Submitted: 11 Mar 2010 7:38 Modified: 2 Jun 2010 10:04
Reporter: Jesper Wisborg Krogh Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.0.77, 5.1.44 OS:Linux (RHEL 5.4)
Assigned to: CPU Architecture:Any
Tags: Contribution, relay log, replication

[11 Mar 2010 7:38] Jesper Wisborg Krogh
Description:
On our master-master system, we have experienced an issue where the slave occasionally deadlock and all subsequent retries will fail. When this happens, it always fails when applying a user variable event that precedes the actual transaction.

An extract of an actual failure is:

# at 95929638
#100303 15:57:09 server id 1  end_log_pos 125556422 	Xid = 150148149
COMMIT/*!*/;
# at 95929665
#100303 15:57:09 server id 1  end_log_pos 125556550 	User_var
SET @`current_date`:=_binary 0x323031302D30332D30332030343A35373A3039 COLLATE `binary`/*!*/;
# at 95929729
#100303 15:57:09 server id 1  end_log_pos 125556603 	User_var
SET @`current_node`:=2/*!*/;
# at 95929782
#100303 15:57:09 server id 1  end_log_pos 125556650 	User_var
SET @`seqnum`:=0/*!*/;
# at 95929829
#100303 15:57:09 server id 1  end_log_pos 125557055 	Query	thread_id=2265390	exec_time=0	error_code=0
SET TIMESTAMP=1267592229/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
BEGIN
/*!*/;

In the above example the deadlock happened in the transaction starting a position 95929905, however the SQL thread started the next attempt at position 95929665 and it then failed setting @`current_date`. Failing to set a user variable doesn't change thd->net.last_errno, so the retry algorithm will keep retrying until slave_trans_retries attempts has been made.

A "START SLAVE SQL_THREAD" will make the slave start again and at this time all works.

There are three issues with the above example:

1: Why does the slave fail to apply the user variable when it is inside the retry loop?
2: Why is it jumping back further than to the start of the transaction? I know it should be harmless to re-apply the user variables, but the principle seems wrong.
3: It can't be seen from the relay log extract above, but the user variable events between the COMMIT and the BEGIN events actually originated on the slave itself. When a user variable event is replicated, the slave writes the event into its binlog with the server id of the slave instead of the one of the originating server (see patch suggestion below).

How to repeat:
Unfortunately I do now have a simple case of how to replicate the issue with the deadlock retry not being able to resolve. If we put the server under heavy load and on purpose ensure updates to the same tables/records happen at both server, then we are usually able to replicate it.

The wrong server id for the user var events can be replicated by enabling logging of slave updates and issue a query that uses a user variable on the master.

Suggested fix:
Attached is a proposed fix for the server id for the user variable events. Making this change will also make it behave the same way as intvar and rand events. The attached fix is for 5.0.77, but it should be simple enough to port to other versions.

It appears that this stops the deadlock retries from breaking, however I am not really convinced that it takes care of the underlying issue(s) - more that it hides it.
[11 Mar 2010 7:39] Jesper Wisborg Krogh
Patch for MySQL 5.0.77 to have the server id for user var events correspond to the server id of the originating server

Attachment: user_var.patch (text/x-patch), 4.61 KiB.

[21 Apr 2010 8:29] Sveta Smirnova
Thank you for the report.

Please send us output of SHOW SLAVE STATUS\G in time when failure occurs.
[18 May 2010 7:06] Jesper Wisborg Krogh
We have already patched all our systems to use a build of MySQL that includes the patch attached above. Given that I unfortunately do not know how to replicate the issue on cue, I cannot provide an output of SHOW SLAVE STATUS\G, however from memory the Last_Errno and Last_Error is just a normal deadlock error, and then Slave_SQL_Running = No.
[2 Jun 2010 10:04] Sveta Smirnova
Thank you for the feedback.

I was not able to repeat described behavior using test data, so closing report as "Can't repeat". Feel free to reopen it if you have more information.