MySQL Bugs: #89143: Commit order deadlock + retry logic is not considering trx error cases

Bug #89143	Commit order deadlock + retry logic is not considering trx error cases
Submitted:	8 Jan 2018 22:10	Modified:	26 Apr 2018 10:39
Reporter:	Jean-François Gagné	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.20	OS:	Any
Assigned to:	Venkatesh Duggirala	CPU Architecture:	Any

Description:
Hi,

In Bug#89141, I describe a situation generating an error in the Group Replication applier. In my tests and as described in that other bug, the applier gets 4 duplicate key errors and auto-magically restarts/retries after the failure.

This restart is caused by the default value of the slave_transaction_retries global variable (10). I set this variable to 0 and the applier does not automatically retry.

In the manual ([1]), we can read that the slave_transaction_retries global variable controls retries in case of an InnoDB deadlock or an InnoDB timeout. However, I do not think that Bug#89141 qualifies as a deadlock or as a timeout condition.

[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_slave_transa...

My understanding of the slave_transaction_retries global variable is that it is useful to avoid deadlock in parallel replication while preserving commit order (see Bug#74177 for an example). This is not what is happening in Bug#89141 where we get a duplicate key violation.

Also in Bug#86078 (standard replication in MySQL 8.0), the applier did not automatically retry (I recently re-tested with MySQL 8.0.3 and with the default value of ten for slave_transaction_retries, and replication does not resume after the duplicate key violation). So it looks like we have a difference in behavior between standard and group replication (no retry for standard and retries for group, both with slave_transaction_retries set to 10).

Many thanks for looking into that,

JFG

How to repeat:
See Bug#89141.

In Group Replication in MySQL 5.7.20 and MySQL 8.0.3, the applier restarts/retries after the error.

However, in MySQL 8.0.3 with standard replication with Write Set, the applier does not restart/retries (Bug#86078).

Suggested fix:
I would recommend not retrying for a duplicate key violation in Group Replication (like for standard replication).

If you really want to keep the retry behavior in Group Replication, please add another parameter so someone can disable the retry for duplicate key violation and enable retries for deadlock.

Hello Jean,

Thank you for the report and feedback.

Thanks,
Umesh

Taken from Bug#89141

Attachment: 89141_5.7.20.results (application/octet-stream, text), 46.86 KiB.

Post by Developer:
================== 
Hello Jean, 

Thank you for using MySQL and thank for raising the issue.

Please note that I am able to reproduce the issue even on mysql-8.0.3 (standard replication), the issue not just on Group Replication. There is no difference
in the slave_transaction_retries logic between Standard replication and Group Replication. (problem is in detecting commit order dead lock + retry logic which exists in 5.7 as well)

Problem: If two workers are executing two transactions, they can end up
deadlock because of the preserve commit order logic. Once the server detects
that there is a deadlock between two workers, it rollbacks the later transaction (later in the order), so that the first trx (first in the order) will get the lock and continue it's operation and then server will retry the second transaction. But here after server detects the deadlock, it can happen that second transaction is failed due to some other error (temporary error). This case is not considered when we are trying to "retry the transaction".

I will be uploading a test script for your reference, if you run  against 5.7, you can see in the error log  "Duplicate entry" error twice.
The test script is little modification to the first test case in 
mysql-test/suite/rpl/t/rpl_mts_slave_preserve_commit_order_deadlock.test.

I will be changing the category, version and the title of the bug appropriately, please let me know if you have any questions on the same. 

Regards,
Venkatesh.

Test script to reproduce it on 5.7 version

Attachment: bug89413_mysql-5.7.test (application/octet-stream, text), 2.12 KiB.

Posted by developer:
 
I guess this is a duplicate of Bug#27090385.

Posted by developer:
 
Changelog entry added for MySQL 8.0.12 and MySQL 5.7.23:
Automatic retrying of transactions on a replication slave, as specified by the slave_transaction_retries system variable, was taking place even if the transaction had a non-temporary error that would repeat on retrying or that indicated wider issues. Now, transactions are only automatically retried if there is either no error, or an error that is only temporary.

Also noted in slave_transaction_retries documentation.

Posted by developer:
 
Updated releases on changelog entry to 5.7.24 and 8.0.13.