Bug #92836 Allow cleanup of gtid_executed, gtid_purged on a master, live
Submitted: 18 Oct 2018 7:59 Modified: 11 Jul 2022 10:33
Reporter: Shlomi Noach (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S4 (Feature request)
Version:5.7, 8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: GTID, replication

[18 Oct 2018 7:59] Shlomi Noach
Description:
gtid_executed + gtid_purged is an ever growing set to the point of bloating. While gtid_executed can be cleared on a replica via RESET MASTER, we pay the price by bloating gtid_purged. On a running master, there is no way to clean up the two. 

To emphasize what "bloat" means, this is an actual gtid_executed in a semi-production replication cluster (production environment, does not serve production traffic):

0025f4dd-7e8c-11e8-b6c9-008cfa5440e4:1-19242,
02dc9383-81be-11e8-a260-008cfa542442:1-5430702,
02ffa8d5-bbde-11e8-8e44-008cfa5440e4:1-58216,
04185125-c7b2-11e8-bc02-008cfa5440e4:1-63916,
04772ac2-93d5-11e8-aa4d-008cfa5440e4:1-54287,
...
dcda46d1-af55-11e8-aaa4-008cfa544064:1-69818,
dd720ef9-bf06-11e8-a88c-008cfa544064:1-6412662,
dde695fa-802f-11e8-be08-008cfa544064:1-63516,
de0f5d84-c482-11e8-8644-008cfa544064:1-71472,
de315cd9-d1e2-11e8-b10a-008cfa542442:1-2188,
de358dcf-ba4f-11e8-a4e8-008cfa544064:1-70939,
de983268-b598-11e8-a97f-008cfa544064:1-72787,
ded794e2-85ab-11e8-9b4c-008cfa5440e4:1-5310105,
e0a3502b-8f09-11e8-acb1-008cfa5440e4:1-643447,
e0d4ec6a-b0e1-11e8-aa5a-008cfa544064:1-71876,
e17bfcb9-b65d-11e8-b0b1-008cfa542442:1-71357,
e1a5fb6c-80f4-11e8-a561-008cfa5440e4:1-5510403,
e1c19e2e-bb14-11e8-97e9-008cfa5440e4:1-65282,
e1ec256a-8737-11e8-920e-008cfa544064:1-9417,
e1f7afac-7e8e-11e8-82fd-008cfa544064:1-64277,
e28f7a9f-c2f8-11e8-9c7c-008cfa5440e4:1-68107,
e312c2f6-c3bd-11e8-a556-008cfa542442:1-75526,
e35ef5a2-a773-11e8-bb5e-008cfa544064:1-58106,
e3bbc392-be41-11e8-b4ce-008cfa542442:1-6581231,
e3d06800-b40c-11e8-9fcf-008cfa542442:1-75904,
e403135f-88cc-11e8-957d-008cfa544064:1-39000,
e5e8e412-b4d3-11e8-96f6-008cfa544064:1-6442128,
e6674fd1-b01c-11e8-87b5-008cfa544064:1-6343336,
e66f6261-8a60-11e8-9232-008cfa544064:1-66867,
e67f443e-d11d-11e8-8d42-008cfa5440e4:1-65974,
e6a1d302-c549-11e8-bbc8-008cfa544064:1-68866,
e6c42d8f-8345-11e8-899b-008cfa544064:1-64489,
e71e9ab3-955c-11e8-9853-008cfa542442:1-63898940,
e73c3486-cc66-11e8-9a2d-008cfa544064:1-20455,
e7a40c37-802a-11e8-b8a7-008cfa5440e4:1-66008,
e89247a8-8b29-11e8-8bd9-008cfa5440e4:1-48384,
e8d63214-c7af-11e8-98ad-008cfa542442:1-70420,
e925a59c-a838-11e8-804a-008cfa544064:1-73481,
e9f19b55-bbdb-11e8-9eff-008cfa544064:1-72214,
ea016138-ac2a-11e8-8ffc-008cfa542442:1-70463,
eab84da9-81bb-11e8-901f-008cfa5440e4:1-64169,
ebea70a2-8280-11e8-baab-008cfa542442:1-62500,
ec088b38-adb6-11e8-bd96-008cfa542442:1-75788,
ec33ed21-a69a-11e8-806a-008cfa544064:1-54975,
ecb19710-ca00-11e8-9ddb-008cfa542442:1-72558,
ecdc7efa-93d2-11e8-bc91-008cfa544064:1-6453202,
ed0ea64c-b98a-11e8-b3b4-008cfa542442:1-6166922,
ed9079fc-b26d-11e8-a09f-008cfa5440e4:1-67948,
eec2f336-bfcd-11e8-9598-008cfa5440e4:1-69904,
f017ac88-c092-11e8-8d1a-008cfa542442:1-69700,
f04eaa36-ab65-11e8-96b7-008cfa542442:1-74351,
f0ea060e-a8ff-11e8-a722-008cfa5440e4:1-60803,
f0fd1daa-80f6-11e8-96e1-008cfa544064:1-66326,
f18f4e3f-b65f-11e8-ab64-008cfa5440e4:1-70044,
f245be9a-d039-11e8-bbde-008cfa542442:1-69809,
f25b83f2-c876-11e8-8b4d-008cfa5440e4:1-68992,
f2bc40f0-8672-11e8-a529-008cfa5440e4:1-9356,
f41a0596-a83a-11e8-8b36-008cfa542442:1-64949,
f42f53b7-c484-11e8-a66b-008cfa542442:1-72794,
f4927af1-a6ae-11e8-aa15-008cfa542442:1-249137,
f4b24428-c3bf-11e8-89f2-008cfa5440e4:1-6173019,
f4d1f6c1-cc68-11e8-a145-008cfa542442:1-1954220,
f50652f5-c93b-11e8-aee6-008cfa5440e4:1-65821,
f539e7cd-bf08-11e8-a732-008cfa5440e4:1-71385,
f59c119b-d11f-11e8-9927-008cfa544064:1-69835,
f7af77a9-ba51-11e8-9996-008cfa5440e4:1-6159825,
f7b7e287-a923-11e8-8893-008cfa542442:1-58453,
f823eb7e-bb16-11e8-8c90-008cfa542442:1-72813,
f82a6c91-b59a-11e8-a16f-008cfa5440e4:1-66282,
f83fc50d-8e3b-11e8-a3f8-008cfa542442:1-45447,
f922b2cf-c2fa-11e8-bf3b-008cfa542442:1-55629,
f9898d0b-c159-11e8-b974-008cfa542442:1-68487,
fa2c987d-ac2c-11e8-a9a1-008cfa544064:1-74559,
fa77ec3c-b1a8-11e8-b6ca-008cfa544064:1-70835,
faeced35-8739-11e8-9771-008cfa5440e4:1-9344,
fafde344-b0e3-11e8-bac0-008cfa5440e4:1-70828,
fb2d3df3-8a64-11e8-99a9-008cfa5440e4:1-62134,
fb6d7cc7-8282-11e8-8f48-008cfa544064:1-68014,
fced661b-5cf0-11e8-9b43-008cfa544064:1-34661946,
fd11ec6e-acf1-11e8-992c-008cfa5440e4:1-73512,
fd2a8a29-bca2-11e8-8406-008cfa544064:1-73243,
fe156d76-be43-11e8-958b-008cfa544064:1-72716,
fe804eb5-8fde-11e8-a2f4-008cfa5440e4:1-1059033,
fe8f6bb1-ca02-11e8-a059-008cfa544064:1-73859,
febed7be-c610-11e8-b337-008cfa544064:1-71551,
fef321ba-b4d5-11e8-9ed1-008cfa542442:1-72082,
ffbe008f-85ad-11e8-8f11-008cfa542442:1-62782,
ffefe288-b01e-11e8-b852-008cfa5440e4:1-70914,
fff4e4f9-8347-11e8-a135-008cfa542442:1-67163

There is total 561 lines and I've in fact had to remove some so as to be able to submit this report.

gtid_purged is similarly bloated.

Operations such as SHOW MASTER STATUS, SHOW SLAVE STATUS yield with hundreds of lines of text. We suspect there may be a scenario where gtid_executed overflows some text capacity.

The vast majority of UUID entries in the above belong to servers no longer in existence. Servers that used to be masters, failed over and got reprovisioned. This kind of operation is increasingly popular with kubernetes deployments and I suspect other users are similarly impacted, and certainly more will come and with greater urgency.

How to repeat:
Keep failing over your master, reprovision it with a new UUID, and gitd_executed will grow.

Suggested fix:
Requesting to add functionality to allow a DBA to prune gtid_executed and gtid_purged.

One suggestion by Mark Leith is to provide a new function in the form:

GTID_FORGET("<gtid_set>")

Such that I could `SELECT GTID_FORGET("fafde344-b0e3-11e8-bac0-008cfa5440e4:1-70828")`
Return value could be 0/1 for "found and deleted" and "not found so did nothing".

Alternatively, a GTID_FORGET_UUID('<uuid>'), such that  `SELECT GTID_FORGET("fafde344-b0e3-11e8-bac0-008cfa5440e4")` would delete any entries for given UUID and range need not be provided.

It will be the user's responsibility and risk to only purge actual stale information, such as UUIDs of servers no longer in existence, or ranges known to have been purged by all servers.
[18 Oct 2018 8:01] Shlomi Noach
actual list of gtid_executed in semi-production environment

Attachment: gtid_executed.txt (text/plain), 25.37 KiB.

[22 Oct 2018 5:52] MySQL Verification Team
Hello Shlomi,

Thank you for the feature request!

regards,
Umesh
[22 Oct 2018 7:40] Sven Sandberg
Thanks for the bug report. This makes sense and is getting more important over time (since your sets are growing).

The use case for having an operation that removes a whole UUID (with all the ranges) is clear:
- After a server has been decommissioned, all the ranges become unneeded
  at once.
- By removing the full set of ranges for a UUID, it reduces the number of lines
  of @@global.gtid_executed and @@global.gtid_purged.

But what is the use case for removing only some of the ranges for a UUID, and keep others having the same UUID?
[6 Nov 2018 13:25] Shlomi Noach
> But what is the use case for removing only some of the ranges for a UUID, and keep others having the same UUID?

The use case is a bit different, and perhaps justifies a different bug report: to be able to remove an errant transaction on a replica without removing the binary logs.

To elaborate, an accidental query on a replica will generate a GTID transaction. The user can verify that that transaction was non offensive to data (e.g. `flush error logs`). But the replica is now "contaminated". Today there are two options: generate an empty transaction on the master with same GTID, which now makes the GTID set on the master even more complex, or to `RESET MASTER` on the replica and update `gtid_executed` and `gtid_purged`. But now we lose the binary logs. This is a problem if the replica is an intermediate master, or if it's a backup server, or some app is tailing its binary logs.

If we can just "forget" that GTID transaction that would simplify the problem.

Again, this is a different use case. I think if the solution to dropping GTID entries happens to be different for "complete UUID drop" and for "partial GTID set drop" then it's worth separating the problem into two bugs.
[21 Nov 2018 13:04] Sven Sandberg
Thanks for the clarification, Shlomi! Yes, the solutions would be quite different. Would you mind filing a new bug for removing errant transactions?

FWIW, the difference is that servers that the UUID of decommissioned servers is not in use any more, which probably makes it easier to remove the UUID (although it is far from trivial, since many parts of the server assume that gtid_executed is monotonically increasing, including e.g. the GTID auto-position protocol).

OTOH, the originating server of an errant transaction may still exist in the topology. Therefore, removing the GTID poses several hard problems:

 1. Once the removal gets applied on the server that created the errant transaction, that GTID would be removed from gtid_executed. That makes it 'free', so next transaction that commits on that server will be assigned the removed GTID. Then we have a situation where the binary log contains two transactions with the same GTID, the first of which is supposed to be removed. Then it's probably hard to distinguish those two transactions.

 2. It will be possible to remove the GTID from one node before it has been replicated to all other nodes. Then the transaction and the removal of the transaction are going to perform a race: If the transaction propagates fast, it may reach all nodes in the topology before the removal. But it is also possible that some node applies the removal before it propagates the transaction downstream the topology, in which case the removal will prevent the propagation. So there seems to be inherent nondeterminism / race-conditions in this feature.

#1 probably means that it would be more sane to replace the transaction by an empty transaction, than to completely remove its GTID.

I don't know if there is a solution to #2. Maybe there is no way to avoid a race. If that is the case, the feature can be quite hard to use correctly, and we would probably have to at least provide ways to monitor and track which transactions were removed and not.
[22 Dec 2018 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[12 Apr 2019 12:30] Simon Mudd
I also bumped into this and on several systems have gradually increasing GTID_EXECUTED values.

Having a way to purge "old" <uuid>:<range>values would be good, but I think this needs to be handled in a way that slaves connecting to masters and requesting an old position should see the older (larger) view and slaves requesting after the "forget event" should only see the newer (smaller) view of gtid_executed values.

Not doing this makes the management of a system with multi-tiered replication and delayed slaves much more problematic.
[25 Feb 2020 0:14] Joshua Varner
It's not clear to me what feedback this bug is waiting on. Can it be reopened?
[13 Jul 2020 5:16] MySQL Verification Team
Bug #100190 is set to be a duplicate of this bug
[13 Jul 2020 5:36] Shlomi Noach
Per comment in https://bugs.mysql.com/bug.php?id=100190, this issue is begin triaged. Thank you. Can you please clarify:

- Where should we continue the discussion?
- Whether the solution is going to be binlog event based (desired) or other (undesired and unuseful). I have no visibility into your thought process and whether suggested solutions are considered. What I fear is, that with lack of discussion, the team will come up with a solution that does not solve the problem in real production systems, so visibility is appreciated.

Thank you!
[13 Jul 2020 7:43] Simon Mudd
See also:
* https://bugs.mysql.com/bug.php?id=84994
[16 Nov 2021 7:59] Hugo Dubois
Any news about this feature request ? I have the same issue and a GTID_FORGET function would be very useful to remove old GTID uuid.
[11 Jul 2022 10:33] Shlomi Noach
Very curious to hear any thoughts, almost 4 years have passed; are engineers still working on this or is this stale? (I might want to contribute a fix if no one is working on this).