Description:
The current clone implementation synchronizes InnoDB with binlog. Other SEs have no way of participating in this synchronization.
How to repeat:
The current clone synchronization happens in the middle of innodb_clone_copy, and only involves InnoDB and binlog. This happens long after clone_begin handlerton call for other SEs, and before their clone_copy call.
Suggested fix:
First, observe that MySQL already has the means to synchronize any wishing to participate transactional storage engines and the binary log: the performance_schema.log_status table. Second, observe that cross-engine synchronization is something that should be done at the plugin layer, even if InnoDB gets to choose at which point it should be done (because log archiver).
Switching to that allows removing a lot of code with no ill effect (I could do InnoDB/binlog clones without other SEs under pfs.log_status synchronization and was not able to find any drawbacks, but maybe I was not looking hard enough).
Thus a possible implementation is (will upload a hand-edited patch which will not apply but will show the relevant code changes. Let me know if you want me to cleanup and upload a proper patch):
- Add table_log_status::get_row method to allow doing the log_status query from inside the server.
- Handlerton: introduce a new method: set_log_stop - which provides the SE the clone stop position information.
- Plugin: introduce Ha_clone_common_cbk, which derives from Ha_clone_cbk, and from which derive all the concrete classes: Client_Cbk, Local_Cbk & Server_Cbk.
- Ha_clone_common_cbk contains the synchronize_engines method, which, only to be called from the main clone thread, performs the performance_schema.LOG_STATUS query, splits its result per-SE and calls handlerton set_log_stop for all SEs.
- The pfs.log_status query still must be called by InnoDB at a certain point in clone. That means that InnoDB is still a special SE in that it must always come first in the clone locator vectors. This will also handle donor waits on clone restarts, as that is implemented in InnoDB-specific code. To ensure this, add a helper clone_hton.cc:make_innodb_first which shuffles the vectors as needed. Call it in all the clone-starting code paths, add asserts that InnoDB comes first.
- Since hton_clone_copy will call into InnoDB clone copy, which at certain point will synchronize engines, calling into the plugin again, and the calling into SEs again, introduce a backup locator vector Ha_clone_cbk::all_locators, and its setters/getters set_all_locators, reset_all_locators, get_all_locators. Use them in the re-entrant hton_clone_copy code.
- On stage transition from the page copy to the redo log copy. call synchronize.
- Implement innodb_clone_set_log_stop by parsing out the desired clone stop LSN from the passed JSON from the log_status query.
- Make InnoDB redo log archiving stop at that LSN instead of on reaching the end of redo log.
- Support stopping redo log archiving in a middle of a log block. As the old implementation may encounter this case only at the end of the redo log, it took that block from the log buffer in memory. This no longer works with the new implementation, where the needed block might reside on disk only. Address this by forcing the whole block containing the target LSN to disk, reading it from there, and patching its memory copy to stop at the required LSN before giving it for the clone to copy.
- To implement redo log block reading on a running server, adapt crash recovery function recv_read_log_seg to work both for crash recovery and on a running server. Maybe a better implementation would be to repurpose some of the log archiving code instead.
- Make sure not to archive not-yet-written log: pfs.log_status query gets the log system LSN, which is the value of highest log reservation. Thus in an archiving iteration, archive up to the lower values of log archiving LSN, and write_lsn. Set the log archiving LSN to LSN_MAX at clone initialization, so that clone archiving may proceed until the stop LSN is set.
- The above implementation results in cloned LSN != log end LSN for the cloned instance, thus adjust crash recovery to work with that. The existing crash recovery throws away any crash-recovered metadata and reads its snapshot from the data files, whereas the snapshot was not written in any consistency with the clone LSN. Fix by no longer throwing away the crash-recovered InnoDB dynamic metadata and by not re-reading the the last metadata snapshot. This forces also not to re-serialize the metadata in the crash recovery as that writes redo log which is to be avoided during clone crash recovery.
- Additionally, at the start of the redo log copy the old implementation requested all InnoDB updates to be flushed to disk immediately, to avoid clone recovery generating new redo log. This is also disabled as it becomes redundant with the fix. This issue is similar to Percona XtraBackup https://jira.percona.com/browse/PXB-1634
- Disable clone.local_xa test because the new synchronization no longer blocks on XA transactions.
- Re-record two clone-related tests in the component_keyring_file MTR suite.
Note that this change breaks some of the clone-related group replication MTR tests, fixing which was outside the scope of my work.