Description:
https://dev.mysql.com/blog-archive/efficient-json-replication-in-mysql-8-0/ discusses how to efficiently send JSON over a binlog stream in a way which is more efficient than sending the full JSON blob each time. I see in a lot of database binlog traffic a lot of increasingly large JSON "blobs" being sent between servers using replication when the amount of JSON data that actually changed may be very small.
MySQL had a similar problem with non-JSON data in replication and invented this setting: binlog_row_image=minimal in additional to the original binlog_format=ROW.
The blog post REQUIRES the user to change the SQL used to ensure that JSON_REPLACE() is used as part of a normal SQL UPDATE statement.
This requires developer code changes or ORM code changes to handle this and there's no generic solution to this problem which is transparent to the user.
So as binlog sizes keep growing and despite of the possibility of using binlog compression (which helps) the data may not be transmitted as efficiently as it could be and that has a negative impact. Disk space is a problem. I'd rather use it for storing my data not for storing binlogs.
How to repeat:
See description above.
Check DML changes on your system using JSON columns and check the size of binlogs even if you use:
binlog_row_image=minimal, and binlog_format=ROW
With large JSON blobs the ratio of data sent vs actual data changed may be considered too high, despite the settings above.
Note that even if you use binlog_row_value_options=PARTIAL_JSON not all JSON changes will be handled efficiently, only those using JSON_REPLACE as part of the SQL.
- It may not be possible to change the app.
- It may not be possible to change the ORM.
- Multiple languages are often involved.
- It may be the app is reporting a value from an external system so needs to provide the full JSON as it's not aware if there's a change.
- so pushing the burden to the app is not going to efficiently solve this problem.
Suggested fix:
Provide an option, disabled by default (?), to effectively handle the JSON column as if the JSON change were wrapped by the logical equivalent of:
UPDATE t SET data = JSON_SET(JSON_REPLACE(data, ... changes here ...), ...) WHERE we_update_the_row
e.g. you have the code to handle this, make it possible to apply it generically to the whole JSON blob being changed.
There may be some concern over how to generate the "diff" or how much extra time, or CPU, or memory this change may trigger so provide JSON handling metrics so that I can measure the latency, memory original and final binlog "diff size" impact over time with this feature turned off or on. If the impact is too high the feature can be left off, if it's not significant we can turn it on and the metrics will tell us exactly the benefit of the change.