Bug #109447 Improve mysql-shell support when loading / dumping large tables and docs
Submitted: 21 Dec 2022 9:09 Modified: 21 Dec 2022 12:10
Reporter: Simon Mudd (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:Shell Dump & Load Severity:S4 (Feature request)
Version:8.0.31 OS:Any
Assigned to: CPU Architecture:Any
Tags: performance, shell, windmill

[21 Dec 2022 9:09] Simon Mudd
Description:
I want to load and dump a single table consistently without affecting the server it's running on (it can be "stopped" / "not accessed" for the period this is happening).  The table may be very large, e.g. 1 TB and I want to do this as quickly as possible.

Reason for doing this may be I have deleted a lot of data but the .ibd file is still large and even so  there is insufficient space to do an inplace ALTER TABLE xxx ENGINE=INNODB which will rebuild the table. 

This sort of problem comes up from time to time and requires working around space issues.  Other usages of dumping / loading large tables may also be applicable.

Documentation provides some basic examples but does not discuss how to optimise this process for larger tables where speed may be important and also maintaining consistency on the server(s) where this is happening.

How to repeat:
See above.

Suggested fix:
I would like to see better documentation:
* describe cases like the one mentioned and how to solve it
* take into account GTID consistency concerns if the table is dumped & loaded back to the same system.
* discuss the potential space savings the compression options may provide. ZSTD seems to allow a significant space saving vs the original .ibd file.
* document when loading that you may need to turn binlogging off: sessionInitSQL: ["SET SESSION sql_log_bin=0;"]
* for consistency: mention replication may need to be stopped or provide a mechanism to take a consistent dump and record if needed the GTIDs.
* compare times to load with the shell vs doing with MySQL dump and loading from the SQL directly. Many people are unaware of how much time/space savings you can make here.

I would like to see better functionality

* potentially record the GTID executed value when dumping as otherwise it's impossible to know the "state" of the dump and when it happened (in terms of gtid_executed)
* current usage with utils.exportTable() and utils.importTable() does not allow the dumping to be done in "chunks" (e.g. to multiple files) via multiple threads, so you become limited by the single threaded performance of the shell or the mysql server dumping the data.  So consider ways to allow the load and dump processes to be chunked and run in parallel to improve performance.
* It was indicated to me that I could use: sessionInitSQL: ["SET SESSION innodb_ddl_threads=8,"]. Does this apply when loading the data into MySQL? It won't help for dumping data.

These are a few things that look to be useful to me
[21 Dec 2022 12:10] MySQL Verification Team
Hello Simon,

Thank you for the feature request!

regards,
Umesh
[21 Dec 2022 13:12] Pawel Andruszkiewicz
Posted by developer:
 
Regarding functionality requests: util.exportTable() is designed to dump into a single output file using a single thread, util.importTable() is able to load a single file using multiple threads, but only if the input file is not compressed.

A single table can be dumped into multiple files in parallel using i.e. `util.dumpTables('s', ['t'], 'out')`, and then loaded (in parallel) using `util.loadDump('out')`. All the util.dump*() utilities are storing the value of gtid_executed server variable in the `@.json` metadata file, which can be later restored automatically using the `updateGtidSet` load option.