MySQL Bugs: #102653: Feature enhancement for the GTID_NEXT variable to support '<uuid>:AUTOMATIC'

Bug #102653	Feature enhancement for the GTID_NEXT variable to support '<uuid>:AUTOMATIC'
Submitted:	18 Feb 2021 19:28	Modified:	22 Feb 2021 14:33
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Delay, Injection, Monitoring, replication

Description:
https://dev.mysql.com/doc/refman/8.0/en/replication-options-gtids.html#sysvar_gtid_nexthtt... is the reference to GTID_NEXT.

Normally this value is not touched by users and each new transaction if GTID is enabled generates a new GTID based on the server or replication group's UUID.

People who manage clusters of MySQL servers often want to measure replication delay and be able to determine how far behind the last "insert" replication is. This is usually done by injecting a transaction/event often with a timestamp periodically and then reading on a replica the latest value received from which "replication delay" can be measure from the upstream master.

There are various techniques for doing this including injecting externally or even using the event_scheduler to inject an event on the master.

All this works fine until the master needs to change and there's the possibility that a split brain may occur. How do we detect this:  by checking GTID_EXECUTED on all servers. If there's a difference it could be due to real data missing from the  master but you can't be sure as data is injected via these heartbeat events and they can't easily be excluded.

One possible way to simplify this would be to inject such events using a "custom UUID" known not to be in the GTID_EXECUTED set, and which is different to any master's UUID.

It is possible to inject specific GTIDS by setting GTID_NEXT but if you want to uses a specific value you must provide both the UUID and also the ID value.  To know the next <id> value right now you must do a SELECT to calculate the next one.  That adds extra complexity.

How to repeat:
See above.

Suggested fix:
My proposed solution would be to form an additional format for GTID_NEXT which would be '<UUID>:AUTOMATIC' where <UUID> is a UUID value, chosen by the user.

Once this value is set any subsequent values would generate GTID values based on the next version after the latest <UUID>:<RANGE> value or if no GTID exists for <UUID> then to provide the <UUID>:1.

Other than that all logic should be identical.

External heartbeat/event generators would then be able to basically write code which does something like:

set time_zone = '+00:00';
SET GTID_NEXT = '00000000-1111-2222-3333-444444444444:AUTOMATIC';
INSERT INTO heartbeat_table(last_update, probe_hostname) VALUES (NOW(6), 'probe_hostname') ON DUPLICATE KEY UPDATE ....;
-- delay 1 second
INSERT INTO heartbeat_table(last_update, probe_hostname) VALUES (NOW(6), 'probe_hostname') ON DUPLICATE KEY UPDATE ....;
-- delay 1 second
INSERT ....

Where the table would probably have unique key on probe_hostname and an index on last_update to ensure that it's easy to find the time since the last update even if multiple probes are being used.

Code used by the client can be site specific but when checking or comparing GTID_EXECUTED on different servers you simply exclude any values of '00000000-1111-2222-3333-444444444444:...' and do not need to worry about "injected events" requiring you to double check events in the binlog.

A clarification. I wrote:

"... to inject such events using a "custom UUID" known not to be in the GTID_EXECUTED set ..."

Here I mean it would be known not to be in the GTID set of normal, regular transactions in the database.  MySQL doesn't know this but the user does so can ignore these GTID values as being "irrelevant" when comparing GTID_EXECUTED between servers.

Hello Simon,

Thank you for the feature request!

regards,
Umesh

Posted by developer:

Thank you for the feature request!

It is certainly important to measure lag, and of course measuring lag should not interfere with the GTID state of servers. However, it is not clear to me why the existing lag measuring mechanisms cannot be used. Also, I did not understand what the observed defect is when using heartbeat transactions without a dedicated UUID. So, I would appreciate help understanding the following points:

- When talking about "cluster", do you specifically mean one GR group (as in InnoDB), or more generally any replication topology?

- Note that the server stores a timestamp with every transaction committed by a client: the 'original_commit_timestamp'. This gets replicated to all downstream replicas. Replicas display the original_commit_timestamp for the last transaction committed by each worker in performance_schema.replication_applier_status_by_worker (LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP). It also displays the timestamp when the same transaction was committed on the replica. So the lag can be computed as the difference between these timestamps. So it is not clear to me why the DBA needs to commit specific heartbeat transactions.

- Even in the presence of user-generated heartbeat transactions, I did not understand what problem the GTIDs of those transactions causes that needs to be solved by having a different UUID. I would expect that the heartbeat mechanism commits transactions only on the source server, not on replicas, and that it follows the same logic to determine which server is the source server, as the usual workload follows. So if there are errant heartbeat transactions, there could equally well be errant user transactions, unless I misunderstood the setup. Is the problem that the heartbeat transactions add to the number of errant transactions, making the manual conflict resolution more complicated, or did I misunderstand the setup, or is there some other problem?

> - Note that the server stores a timestamp with every transaction committed by a client: the 'original_commit_timestamp'. This gets replicated to all downstream replicas. Replicas display the original_commit_timestamp for the last transaction committed by each worker in performance_schema.replication_applier_status_by_worker (LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP). It also displays the timestamp when the same transaction was committed on the replica. So the lag can be computed as the difference between these timestamps. So it is not clear to me why the DBA needs to commit specific heartbeat transactions.

A. original_commit_timestamp only indicates the last committed transaction.  If the transaction that is currently being applied (assume single worker) takes 10 seconds, using the original_commit_timestamp will show 10 seconds of lag. 
If there is no transaction at all, you can also have 10 second of lag because the heartbeat wasn't sent yet and no transaction arrived.

So there's no way to tell if the system is lagging or just idle (think multi-tiered setups)

B. In multi tiered replication setups, if there is replication IO lag between primary and intermediate replica, you cannot tell how much real lag there is on a third level replica with the replication original_commit_timestamp. Even if the integrated heartbeat functionality is set to a low number.

In the case of user generated heartbeats, a transaction is committed for example every second. By. looking at the heartbeat table you can see the exact lag, while a larger transaction is applying or when the system is idle (it won't be idle)

> - Even in the presence of user-generated heartbeat transactions, I did not understand what problem the GTIDs of those transactions causes that needs to be solved by having a different UUID. I would expect that the heartbeat mechanism commits transactions only on the source server, not on replicas,

Not true. In Multi-tiered replication setups, you can run heartbeats on the primary and every intermediate replica. That way you can tell the replication lag to the primary and any intermediate replica in the chain.

> and that it follows the same logic to determine which server is the source server, as the usual workload follows. So if there are errant heartbeat transactions, there could equally well be errant user transactions, unless I misunderstood the setup. Is the problem that the heartbeat transactions add to the number of errant transactions, making the manual conflict resolution more complicated, or did I misunderstand the setup, or is there some other problem?

- It makes the conflict resolution nearly impossible by looking at GTID, unless you store all binary logs and parse all binary logs and figure out if the errant transaction is a heartbeat transaction or not. Very impractical. 
- The user ends up having to just fix the errant transactions (by creating empty transactions), which goes against the whole purpose of using the gtid set to see if the transactions applied are the same.

In the case of a multi-tiered replication setup, with heartbeats on primary and every middle tier, promoting a middle-tier replica to become the main primary will fail because of the errant transactions created. 
When the heartbeats would be created as a separate UUID, the user has the ability to just ignore and skip them as they are considered 'administrative gtid events', without losing the benefit of using gtid sets to validate that each server has applied the same set of transactions

I like this feature, and this is not a unique problem, it is the same problem as we are having with the view_change_log_event in multi Group Replication setups.

> ... the existing lag measuring mechanisms cannot be used

You have only 2 measurements the master and intermediate master and in many topologies we manage across multiple datacentres the replication depth may be deeper than that, especially if we split a cluster into 2 due to database size or i/o load reasons. I've see things as deep as 6 servers. While not common often it happens.

"cluster" is perhaps a bad name but everyone uses their own and "replication chain" is a bit weird as there's no "chain" anywhere, apart from one server linked to another.

The specific setup I'm thinking about now is asynchronous replication from a master to a downstream leaf node. However I think the solution should work on GR clusters or even a mix of GR servers with asynchronous slaves which is a setup I also use and intend to rollout further.

> - Note that the server stores a timestamp with every transaction
committed by a client: the 'original_commit_timestamp'.

Yes, but what happens on a server that's not pushing out continuous changes? While I agree that's not a common occurrence on systems I manage it could happen and if it does I don't want to treat the "cluster" differently. Ideally the server itself should generate (if configured) and push out such periodic heartbeats even if user transactions are not taking place.

If you have a better implementation I'm certainly in favour of that. Handling a static "cluster" is simple but I may replace any server in my cluster quite frequently so the master may change. Automation can swap boxes around in the topology but we do have to be careful (if using the current GTID injection approach) to avoid injecting on a box that is no longer the master, or injecting on both boxes (old and new) etc... In real life things are a bit more complex.

Replication delay of > 1 second in some clusters can be troublesome. We do need to be able to measure it, and given this can be induced on the master and intermediate master or even the downstream slave applying relay logs it's good for us to be able to identify the location and the delay.

I'm sure most users don't really care how it's handled, but what we want is a reliable and easy to use mechanism to achieve this (accurately).

> I did
not understand what problem the GTIDs of those transactions causes that
needs to be solved by having a different UUID.

Then you have not come face-to-face with split-brain MySQL clusters. Again real life is more complicated than one imagines. I've seen several split brain scenarios and extra GTIDs on a box that's downstream or missing from a box that's upstream. The big problem here is short of analysing the GTIDs which you can not do quickly on the server it's hard to identify if these are "real, important transactions" or just "system (the user's system) transactions which are NOT important."

When you have such an issue after an automated failover when a server fails you need to decide and act quickly as you want your business to continue and so you need the server to be running. You may want to switch to a different server, but the main thing here is speedy diagnosis and action.

I was speaking to a colleague about this and we have even discussed if such a facility were to become available to separate heartbeat information using a specific UUID, grants using a different one, OSCs to use yet another one so we can more easily identify where the difference is and which "sub-system" (seeing the MySQL management as a whole) is affected.

Having a higher level of control over this would potentially give us more insight than is now possible.

So ideally we should not be using "user transactions" to handle this. Failovers, maintenance and other such actions means that our GTID executed sets are huge compared to what I'd expect you to have seen in smaller environments. I see 30 UUIDs in a typical setup and these keep increasing as we swap over masters more frequently now. I'm still waiting for a way to clean this up, and https://bugs.mysql.com/bug.php?id=84994 has been open for a while now. One cluster has nearly 350 UUIDs. Just seeing the GTID set makes it hard to quickly eye-ball/visualise differences due to the size.

Clearly a cluster should have consistent GTIDs, and most of the time that's the case. When it's not I want to be able to identify "normal user transactions" from anything that is triggered by those that manage the clusters.

If you have a better mechanism that would be good.

Replication_applier_status_by_worker sounds interesting but provide a sys.view so we don't have to figure out the exact invocation, we can share it and it's easy to remember: e.g. select * from sys.replication_delay.

I've seen delays on intermediate masters several times due to missing primary keys replicating as table scans per row update, so it's good to see: (i) there is a delay and (ii) where it is coming from: it's not always a problem on the master. Hence ideally I'd like to see the path from the master and the delays at each step. Cross-dc replication is slower than replication within a datacentre and while in theory the difference is small, sometimes these things can add up. Right now we can't identify very clearly what they are.

I shared my perspective, Kenny's share something similar.

Personally I'd like to see a way where the writable members are able to send a periodic heartbeat, say every 100ms to 1s, but only if they haven't received such a heartbeat themselves. This allows for multi-master setups, and if a box is read-only it wouldn't be expected to write its own heartbeat as it's just a replica.

If a heartbeat is received by replication GR or async then it should be passed downstream unless the server's own UUID is in the path. This avoids loops but does require you to have a path to avoid them. Sending such an event downstream would allow the intermediate master to add on it's own UUID and timestamp which helps track where things get locked up.

These heartbeats should not count as normal GTID events, as they are cluster management events not USER transactions.