Bug #108876 CLONE operation results in mysqld server hang on startup
Submitted: 25 Oct 2022 23:00 Modified: 27 Oct 2022 6:43
Reporter: Andrew Ernst Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:8.0.28 OS:CentOS ( 7.9.2009)
Assigned to: MySQL Verification Team CPU Architecture:x86

[25 Oct 2022 23:00] Andrew Ernst
Description:
Performing a CLONE INSTANCE FROM, or joining an empty member into a group replication cluster, I'm experiencing what looks like success until the MySQL service is restarted.

Once restarted, all of the CLONE log rolling forward operations occur, and we reach the step where all ib_logfile* files are created.  Then, all disk I/O activity stops in my data directory, and mysqld appears idle.  An strace on the process shows `futex(0x7f3ac5b179d0, FUTEX_WAIT, 30683, NULL` (though that's likely not a reliable indicator of inactivity)

I'll upload the output of the mysqld.log after the clone process completes and mysqld is started for the first time.

Killing the process will ultimately result in an unstartable mysqld service:

2022-10-25T22:35:42.976290Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.28) starting as process 32386
2022-10-25T22:35:42.981377Z 0 [Note] [MY-012366] [InnoDB] Using Linux native AIO
2022-10-25T22:35:42.981529Z 0 [Warning] [MY-012358] [InnoDB] Option innodb_dedicated_server is ignored for innodb_buffer_pool_size because innodb_buffer_pool_size=87882203136 is specified explicitly.
2022-10-25T22:35:42.982275Z 0 [Note] [MY-010747] [Server] Plugin 'FEDERATED' is disabled.
2022-10-25T22:35:42.983695Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-10-25T22:35:42.983752Z 1 [Note] [MY-013546] [InnoDB] Atomic write enabled
2022-10-25T22:35:42.983798Z 1 [Note] [MY-012932] [InnoDB] PUNCH HOLE support available
2022-10-25T22:35:42.983827Z 1 [Note] [MY-012944] [InnoDB] Uses event mutexes
2022-10-25T22:35:42.983852Z 1 [Note] [MY-012945] [InnoDB] GCC builtin __atomic_thread_fence() is used for memory barrier
2022-10-25T22:35:42.983872Z 1 [Note] [MY-012948] [InnoDB] Compressed tables use zlib 1.2.11
2022-10-25T22:35:42.985689Z 1 [Note] [MY-013251] [InnoDB] Number of pools: 1
2022-10-25T22:35:42.985804Z 1 [Note] [MY-012951] [InnoDB] Using hardware accelerated crc32 and polynomial multiplication.
2022-10-25T22:35:42.986233Z 1 [Note] [MY-012203] [InnoDB] Directories to scan './'
2022-10-25T22:35:42.986372Z 1 [ERROR] [MY-011976] [InnoDB] [FATAL] Clone File Roll Forward: Invalid File State: 0
2022-10-25T22:35:42.986397Z 1 [ERROR] [MY-013183] [InnoDB] Assertion failure: clone0api.cc:1090:ib::fatal triggered thread 140632836507392

How to repeat:
In my case, this is just one instance out of a fleet of multiple Group Replication clusters (all running 8.0.28).

I haven't been able to repeat this in any other clusters, but my process has been either:

Method 1: On an existing MySQL server in a GR topology
======================================================
> stop group_replication;
> reset master;
> start group_replication; -- triggers clone operation

Method 2: On an empty server
=============================
> CLONE INSTANCE FROM 'replicationuser'@'donorhost':3306 IDENTIFIED BY '[redacted-password]';

In both cases I receive the same result.
[26 Oct 2022 18:04] MySQL Verification Team
Hi,

I cannot reproduce this with 8.0.31. Can you retry this with 8.0.31. 

Thanks
[27 Oct 2022 6:43] Andrew Ernst
I can attempt to reproduce on 8.0.31, though the instances I experienced this on are currently locked on 8.0.28 because we leverage Xtrabackup from Percona and can't upgrade to 8.0.30+ yet.
[27 Oct 2022 17:22] MySQL Verification Team
Hi,

If you are using binary from percona that might be the issue, as I cannot reproduce this with our 8.0.28 neither but I cannot confirm what the issue is.

Please try to reproduce with 8.0.31 when you can