Description:
A DDL transaction is stuck in "Waiting for clone PAGE_COPY state " for a long time.
Checking the clone progress and clone status tables, no unfinished tasks were found either. Upon checking the log, it was found that a corrupt page error occurred during a certain clone and it abnormally exited. It is suspected that this exception has caused clone_sys to have residual tasks in PAGE COPY.
gdb analysis of the waiting transaction reveals that there is an uncleaned clone handle object in clone_sys.
Why wasn't the clone handle object cleaned up?
From the innodb_clone_end code, it can be seen that drop clone will clean up the clone handle object, provided that clone handle is not referenced. At this time, the waiting transaction holds the handle of clone_hdl through Acquire_clone before waiting.
Why hasn't the waiting ended?
The transaction is waiting for the Clone Snopshot state CLONE_SNAPSHOT_PAGE_COPY to end or transform to aborted, but when calling innodb_clone_end, Clone_Snopshot is not set to aborted.
int Clone_Snapshot::wait(Wait_type wait_type, const Clone_file_ctx *ctx,
bool no_wait, bool check_intr) {
....
case Wait_type::STATE_END_PAGE_COPY:
/* If clone has aborted, don't wait for state to end. */
wait = !is_aborted() && (get_state() == CLONE_SNAPSHOT_PAGE_COPY);
DBUG_EXECUTE_IF("clone_ddl_abort_wait_page_copy", {
if (wait) {
my_error(ER_INTERNAL_ERROR, MYF(0), "Simulated Clone DDL error");
return ER_INTERNAL_ERROR;
}
});
break;
....
How to repeat:
1. The donor initiates the clone process and enters the stage of copying data files.
2. The ddl transaction starts to modify the table, for example, by adding index.
3. The ddl transaction modifies the table and first enters the final ddl cleanup process.
4. Then the clone process enters the PAGE COPY stage and begins to copy the data leaf
5. Meanwhile, the ddl cleanup process enters the begin ddl state and encounters the clone PAGE COPY state. The thread displays "Waiting for clone PAGE COPY state".
6. Then a certain clone task thread encounters a corrupt page during the data page copying stage, an exception termination for clone is triggered.
Suggested fix:
In innodb_clone_end, in addition to checking in_err, it is also necessary to check m_saved_error to determine whether aborted needs to be set.
Description: A DDL transaction is stuck in "Waiting for clone PAGE_COPY state " for a long time. Checking the clone progress and clone status tables, no unfinished tasks were found either. Upon checking the log, it was found that a corrupt page error occurred during a certain clone and it abnormally exited. It is suspected that this exception has caused clone_sys to have residual tasks in PAGE COPY. gdb analysis of the waiting transaction reveals that there is an uncleaned clone handle object in clone_sys. Why wasn't the clone handle object cleaned up? From the innodb_clone_end code, it can be seen that drop clone will clean up the clone handle object, provided that clone handle is not referenced. At this time, the waiting transaction holds the handle of clone_hdl through Acquire_clone before waiting. Why hasn't the waiting ended? The transaction is waiting for the Clone Snopshot state CLONE_SNAPSHOT_PAGE_COPY to end or transform to aborted, but when calling innodb_clone_end, Clone_Snopshot is not set to aborted. int Clone_Snapshot::wait(Wait_type wait_type, const Clone_file_ctx *ctx, bool no_wait, bool check_intr) { .... case Wait_type::STATE_END_PAGE_COPY: /* If clone has aborted, don't wait for state to end. */ wait = !is_aborted() && (get_state() == CLONE_SNAPSHOT_PAGE_COPY); DBUG_EXECUTE_IF("clone_ddl_abort_wait_page_copy", { if (wait) { my_error(ER_INTERNAL_ERROR, MYF(0), "Simulated Clone DDL error"); return ER_INTERNAL_ERROR; } }); break; .... How to repeat: 1. The donor initiates the clone process and enters the stage of copying data files. 2. The ddl transaction starts to modify the table, for example, by adding index. 3. The ddl transaction modifies the table and first enters the final ddl cleanup process. 4. Then the clone process enters the PAGE COPY stage and begins to copy the data leaf 5. Meanwhile, the ddl cleanup process enters the begin ddl state and encounters the clone PAGE COPY state. The thread displays "Waiting for clone PAGE COPY state". 6. Then a certain clone task thread encounters a corrupt page during the data page copying stage, an exception termination for clone is triggered. Suggested fix: In innodb_clone_end, in addition to checking in_err, it is also necessary to check m_saved_error to determine whether aborted needs to be set.