MySQL Bugs: #114579: Improve MySQL support for efficient JSON delta binlog transmission

Bug #114579	Improve MySQL support for efficient JSON delta binlog transmission
Submitted:	8 Apr 2024 10:41	Modified:	8 Apr 2024 11:26
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S4 (Feature request)
Version:	8.3 and earlier, most likely 8.4	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	windmill

Description:
https://dev.mysql.com/blog-archive/efficient-json-replication-in-mysql-8-0/ discusses how to efficiently send JSON over a binlog stream in a way which is more efficient than sending the full JSON blob each time. I see in a lot of database binlog traffic a lot of increasingly large JSON "blobs" being sent between servers using replication when the amount of JSON data that actually changed may be very small.

MySQL had a similar problem with non-JSON data in replication and invented this setting: binlog_row_image=minimal in additional to the original binlog_format=ROW.

The blog post REQUIRES the user to change the SQL used to ensure that JSON_REPLACE() is used as part of a normal SQL UPDATE statement.

This requires developer code changes or ORM code changes to handle this and there's no generic solution to this problem which is transparent to the user.

So as binlog sizes keep growing and despite of the possibility of using binlog compression (which helps) the data may not be transmitted as efficiently as it could be and that has a negative impact. Disk space is a problem. I'd rather use it for storing my data not for storing binlogs.

How to repeat:
See description above.

Check DML changes on your system using JSON columns and check the size of binlogs even if you use:

binlog_row_image=minimal, and binlog_format=ROW

With large JSON blobs the ratio of data sent vs actual data changed may be considered too high, despite the settings above.

Note that even if you use binlog_row_value_options=PARTIAL_JSON not all JSON changes will be handled efficiently, only those using JSON_REPLACE as part of the SQL.

- It may not be possible to change the app.
- It may not be possible to change the ORM.
- Multiple languages are often involved.
- It may be the app is reporting a value from an external system so needs to provide the full JSON as it's not aware if there's a change.
- so pushing the burden to the app is not going to efficiently solve this problem.

Suggested fix:
Provide an option, disabled by default (?), to effectively handle the JSON column as if the JSON change were wrapped by the logical equivalent of:

UPDATE t SET data = JSON_SET(JSON_REPLACE(data, ... changes here ...), ...) WHERE we_update_the_row

e.g. you have the code to handle this, make it possible to apply it generically to the whole JSON blob being changed.

There may be some concern over how to generate the "diff" or how much extra time, or CPU, or memory this change may trigger so provide JSON handling metrics so that I can measure the latency, memory original and final binlog "diff size" impact over time with this feature turned off or on. If the impact is too high the feature can be left off, if it's not significant we can turn it on and the metrics will tell us exactly the benefit of the change.

Hi Simon,

Makes sense to have an option allowing binlog to always create diff for json field (and for text and.. ) so that user can decide if they want to have that CPU overhead over space & network overhead. Thanks for the FR.

All best

There are some customers that reached out via Sales Engineering that would like to have a fix for this.