MySQL Bugs: #115613: FLUSH PRIVILEGES on binlogless replica causes gtid

Bug #115613	FLUSH PRIVILEGES on binlogless replica causes gtid_executed gap
Submitted:	16 Jul 2024 19:33	Modified:	17 Jul 2024 7:38
Reporter:	Venkatesh Duggirala	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	8.0.18,8.0.37, 8.0.41	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	replication

Description:
"FLUSH PRIVILEGES" command executed on primary can cause gaps on gtid_executed table on binlogless replicas.

If such a replica is restarted, it thinks that those transactions were missed it asks primary to send them again. 

case 1) If primary has those transactions (not purged yet), it will send them again and replica will apply those transactions again which can cause errors (like duplicate error if it is a DML on a table)

case 2) If primary already purged those transactions, replication will break with an error "Cannot replicate because the source purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new replica from backup".

In both the cases, it is an issue to replication.

 

How to repeat:
I was able to reproduce the issue from 8.0.18 to all the way to 8.0.37 version (did not check 8.0.38 yet).

1) Start primary and replica (start replica with binlog disabled, relay log recovery enabled)
2) Execute "FLUSH PRIVILEGES" multiple times + parallel executed some DDL/DML queries.
3) Notice there is a gap in replica's gtid_executed table (gtid_executed global variable does not contain any gap, but gtid_executed table contains gaps). 
4) Also notice there will be warnings in the mysqld.2.err 
"The transaction owned GTID is already in the gtid_executed table, which is caused by an explicit modifying from user client.
5) After the transactions on primary are fully replicated (verify with gtid_executed global variable if needed), rotate the binary log and purge old binary logs on primary. 
6)Kill and restart replica server
7)start replication 
8) Notice that replication broke with error "Cannot replicate because the source purged required binary logs"

To make the repeat steps easy, following are the required MTR test script files :

Step 0 :  To make the MTR test fail all the time, please add a sleep in gtid_end_transaction like below 

In sql/binlog.cc file: 

```
int MYSQL_BIN_LOG::gtid_end_transaction(THD *thd) {
  DBUG_TRACE;
   
  DBUG_PRINT("info", ("query=%s", thd->query().str));
   
  if (thd->owned_gtid.sidno > 0) {
    assert(thd->variables.gtid_next.type == ASSIGNED_GTID);
    my_sleep(10000000);  // ====> This sleep here helps reproduce it easily 
```

 
Step 1) Place the following opt file in mysql-test/suite/rpl_gtid/t/gtid_gap_bug-slave.opt 

--relay_log_recovery=ON
--disable-log-bin
--disable-log-replica-updates

Step 2) Place the following test file in mysql-test/suite/rpl_gtid/t/gtid_gap_bug.test

===============
--source include/have_debug.inc
--source include/have_binlog_format_row.inc

--let $rpl_skip_start_slave = 1
--let $rpl_gtid_utils = 1
--source include/master-slave.inc

# Starts slave with 8 worker threads
--let $mts_replica_parallel_workers= 8
--let $mts_spco_save_gtid_executed = 1
--let $mts_spco_check_gtid_executed_before_rollback = 0

# Set Up
--source include/rpl_connection_slave.inc

CALL mtr.add_suppression("You need to use --log-bin to make --binlog-format work");
CALL mtr.add_suppression("The transaction owned GTID is already");
CALL mtr.add_suppression("Recovery from master pos");
--let $save_replica_parallel_workers=`select @@global.replica_parallel_workers`
--let $save_replica_parallel_type= `select @@global.replica_parallel_type`
--let $save_replica_preserve_commit_order= `select @@global.replica_preserve_commit_order`
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
--eval SET GLOBAL replica_parallel_workers= $mts_replica_parallel_workers
SET GLOBAL replica_preserve_commit_order= ON;
--source include/start_slave.inc

--source include/rpl_connection_master1.inc

SET @save_binlog_transaction_dependency_tracking= @@global.binlog_transaction_dependency_tracking;
SET @@global.binlog_transaction_dependency_tracking = COMMIT_ORDER;

# Actual Test , do some DMLs along with flush privileges
--let $i= 0
while ($i < 10)
{
--eval CREATE TABLE t$i(i int);
--inc $i
}
--let $i= 0
while ($i < 10)
{
flush privileges;
--eval insert into t$i values (1);
--eval update t$i set i=$i
--eval DROP TABLE t$i
flush privileges;
--inc $i
}

--source include/rpl_connection_master1.inc
--let $SERVER_UUID      = `SELECT @@global.server_uuid`
--source include/sync_slave_sql_with_master.inc

# After replica synced, purge binary logs from the primary.
--source include/rpl_connection_master1.inc
flush binary logs;
--let $binlog_file= query_get_value(SHOW MASTER STATUS, File, 1)
--replace_result $binlog_file BINLOG_FILE
--eval PURGE BINARY LOGS TO '$binlog_file'

# Restart the server to see there is a gap (but primary already purged, so start slave will fail due to bug)

--source include/rpl_connection_slave.inc 
# Kill the slave server
--source include/kill_mysqld.inc
# Restart the slave server
--let $rpl_server_number= 2
--source include/rpl_start_server.inc
# This will fail
--source include/start_slave.inc

--echo # Cleanup
--echo
--source include/rpl_connection_master1.inc
SET GLOBAL binlog_transaction_dependency_tracking=@save_binlog_transaction_dependency_tracking;

--source include/rpl_connection_slave.inc
--source include/stop_slave.inc
--eval SET GLOBAL replica_parallel_type=$save_replica_parallel_type;
--disable_warnings
--eval SET GLOBAL replica_parallel_workers=$save_replica_parallel_workers;
--enable_warnings
--eval SET GLOBAL replica_preserve_commit_order=$save_replica_preserve_commit_order;
--source include/start_slave.inc

--let $rpl_skip_sync = 1
--source include/rpl_end.inc
===================

Suggested fix:

Root cause: 
FLUSH PRIVILEGES is adding same gtid to flush gtid list twice which is causing
  
"The transaction owned GTID is already in the gtid_executed table, which is caused by an explicit modifying from user client" (which is getting ignored but flush gtid list can contain many other gtids in the set if there are parallel applier threads)

Eg: 
gtid_executed table contains 1 1 
and if we are trying to insert 1 5 => this gets error and the entire 1 to 5 is ignored and will not be inserted into gtid_executed table, hence causing the gap.

We need to make sure:
1) There should not be any gaps 
2) Since replica already executed those transactions, it should not request for those transactions again.

Suggested fix:
1) Avoid FLUSH privileges to add the same gtid to flush gtid list twice.
2) Also test if there are any other commands (other than FLUSH privileges) which can cause such issues.

Dear Venkat,

Thank you for the report and feedback.

regards,
Umesh

Using the test case above, this is still reproduceable up to 8.0.41.