Description:
we found that our new-created cluster may have a gtid-gap in slave/dr with little probability
e.g. 9cd48d9c-6e29-11ec-92d8-98039ba567ea:1-402:404-5103, lack of gtid 403
and the case can be repeated successfully.
How to repeat:
step 1.
deploy a python scripts including following context:
heartbeat_sql="replace into repl_heartbeat values (@@hostname,now(3));"
while 1:
cursor.execute(heartbeat_sql)
and start running the scripts.(tips:don't add any code like "time.sleep(1)" in while, make sure that the sql is executed as fastly as possible)
step 2.
log in the master instance of cluster,and execute "reset master",and stop the scripts in step 1 ,then show master status.
If everything goes well, we can find that master's gtid will be like "744c6c9a-766d-11eb-9807-fa163e6a649a:1-31:135"
and we can't find gtid 135 in mysql binlog,it seems like we create a ghost gtid.
step 3.
log in the slave instance to complete the creating of cluster: stop slave;reset master;change master to xxx;start slave;show slave status;
and we can find Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-31 without gtid 135, because slave cannot find gtid 135 in master's binlog.
step 4.
write anything to make master's gtid increase ,until the gtid comes "744c6c9a-766d-11eb-9807-fa163e6a649a:1-140", at this moment, master's gtid-gap disappears.
But in slave/dr, "show slave status" will be like Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-140, the gtid-gap appears.
question: when the actions of "write heartbeat" and "reset master" are executed at the same time, it seems like a crash happened and a ghost gtid was created , which will make a slave/dr's gtid-gap .we think that it is a bug of command "reset master"
Suggested fix:
we plan to add a "flush tables with read lock" before "reset master", to make sure that any insert/replace action(like heartbeat) will not be executed at the same time(haven't experimented, perhaps works)
but the introduction of "reset master" says that "reset master" including a global lock:
This variable can be read and modified in four places:
- During server startup, holding global_sid_lock.wrlock;
- By a client thread holding global_sid_lock.wrlock (doing a RESET MASTER);
- By a client thread calling MYSQL_BIN_LOG::write_gtid function (often the
group commit FLUSH stage leader).
so it's confusing