| Bug #72313 | stop sql_thread, start sql_thread causes a trx to log with a different GTID | ||
|---|---|---|---|
| Submitted: | 11 Apr 2014 4:43 | Modified: | 27 Aug 2014 13:45 |
| Reporter: | Santosh Praneeth Banda | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
| Version: | 5.6.16 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | replication | ||
[27 Aug 2014 13:45]
David Moss
The following was added to the 5.6.21 and 5.7.5 changelog: When an event group was spanned across multiple relay log files, a slave could incorrectly identify GTID-header group boundaries. This meant that when a transaction was retried, or if the SQL thread was stopped in the middle of a transaction after some rotates, the Gtid_log_event was being silently skipped on the slave, and the transaction was logged with the slave's GTID. This problem also impacted on using START SLAVE UNTIL MASTER_LOG_POS = log_pos; with GTIDs enabled. If log_pos was in the middle of a transaction, the Gtid_log_event was not correctly detected as the beginning of the transaction and replication stopped before this event. With this fix, threads correctly detect that they are part of a group, and this is used to check if a Gtid_log_event is part of a transaction.
[25 Sep 2014 11:52]
Laurynas Biveinis
revno: 6083
committer: Joao Gramacho <joao.gramacho@oracle.com>
branch nick: mysql-5.6
timestamp: Fri 2014-07-18 03:49:01 +0200
message:
BUG#18652178 STOP SQL_THREAD, START SQL_THREAD CAUSES A TRX TO LOG
WITH A DIFFERENT GTID
BUG#18306199 START SLAVE UNTIL MASTER_LOG_POS=MIDDLE-OF-GTID-EVENT
STOPS BEFORE TRANSACTION
Problem:
=======
Slave loses track of GTID-header group boundaries when the group spans
across multiple relay log files.
This means that when the transaction is retried, or if you stop the SQL
thread in the middle of a transaction after some Rotates (considering
that the transaction/group was spanned into multiple relay log files),
the Gtid_log_event will be silently skipped on slave, and the
transaction will be logged with a slave's GTID.
Also, when using "START SLAVE UNTIL MASTER_LOG_POS = x;", if "x" is in
the middle of a transaction, the server is supposed to complete
the transaction. This works fine when GTIDs are disabled. However,
when GTIDs are enabled, if "x" is in the middle of the Gtid_log_event,
it will stop before the transaction, not considering the
Gtid_log_event as the beginning of the transaction.
Analysis:
========
When Rotate events are applied at the slave, the SQL thread verifies
if it is not inside of a transaction to update SQL thread position.
For STS, the SQL thread uses the Relay_log_info::is_in_group()
function to determine if it is inside of a transaction.
It was found a problem in Relay_log_info::is_in_group(). It wasn't
considering a Gtid_log_event as the beginning of a group/transaction.
Because of this problem, the SQL thread was updating its position
when applying a Rotate_log_event that immediately follows a
Gtid_log_event (i.e., in the middle of a transaction but outside
BEGIN...COMMIT).
This should not happen, as SQL thread position should not be updated
in the middle of a transaction so it could retry (or re-apply) the
transaction from the beginning in the case of failures because of
InnoDB deadlock or because the transaction's execution time exceeded
InnoDB's innodb_lock_wait_timeout (or in the case of a request to
stop the SQL thread).
For the problem with "START SLAVE UNTIL MASTER...", the SQL thread
uses the Relay_log_info::is_until_satisfied() function to verify if
the until condition is satisfied. In this function, when the until
condition is UNTIL_MASTER_POS, the SQL thread will use the current
transaction position if in the middle of a transaction or the current
event position if not in the middle of a transaction. This
verification wasn't considering an Gtid_log_event as the beginning of
a transaction.
Fix:
===
Made Relay_log_info::is_in_group() to check if the thread has a GTID
set to the current transaction. If so, it will return true, stating
that it is already in a group.
Made Relay_log_info::is_until_satisfied() to use the is_in_group()
function to verify if the SQL thread is in the middle of a transaction.
With this change, is_until_satisfied() will consider the
Gtid_log_event as part of the transaction.
[30 Sep 2014 12:27]
David Moss
Posted by developer: Reopened by merge, no new changelog entries needed. Set back to Closed.

Description: repeated stop sql_thread/start sql_thread on a slave with GTID enabled may cause it to log a transaction ignoring the corresponding gtid for that transaction received from master. Transaction from master SET @@SESSION.GTID_NEXT='master_uid:100' BEGIN; INSERT INTO t1 values(1); COMMIT; The same transaction will get logged with a different GTID on slave SET @@SESSION.GTID_NEXT='slave_uid:1' BEGIN; INSERT INTO t1 values(1); COMMIT; Events leading to this situation 1. Either a normal relay log ROTATE_EVENT or a ignored relay log ROTATE EVENT is logged right after GTID_LOG_EVENT 2. sql_thread is stopped after it executed a ROTATE_EVENT right after executing the GTID_LOG_EVENT. 3. sql_thread is started again. But since the group_relay_log_name and group_relay_log_pos are updated after seeing a ROTATE_EVENT, sql_thread starts executing at the start of BEGIN query event instead of the GTID_LOG_EVENT. The buggy code where group_relay_log coordinates are updated is int Rotate_log_event::do_update_pos(Relay_log_info *rli) { ... if (!rli->is_in_group()) rli->inc_group_relay_log_pos(pos) else rli->inc_event_relay_log_pos(); } This breaks the fundamental constraint that same transactions will have same GTID across replication clusters. How to repeat: repeatedly execute stop sql_thead. and slave sql_thread. Difficult to reproduce and need code instrumentation.. Or follow these steps to consistently reproduce 1. stop slave sql_thread; 2. set @@global.max_relay_log_size to a value such that relay log rotate happens with GTID_LOG_EVENT ending on relay_log_1 and BEGIN event ending on relay_log_2 3. start slave sql_thread until relay_log_file='relay_log_2', relay_log_pos=4 4. start slave sql_thread; after sql_thread stops 5. verify the gtid_executed set of slave. Suggested fix: rli->is_in_group() function must be made valid with GTID.