Bug #101274 Clone Operation succeed but node failing join GR when relay logs are present
Submitted: 22 Oct 2020 11:45 Modified: 24 Nov 2020 11:58
Reporter: Marco Tusa Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.22 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: Clone, group replication, relay logs, replication

[22 Oct 2020 11:45] Marco Tusa
Description:
Using asynchronous replication to connect a cluster using PXC (PXC 8.0.20) with a MySQL Group Replication cluster (8.0.22). 
Where PXC is the primary and GR is the replica.

If the Primary node fails, when the node comes back and require a clone operation, clone operation succeed but node is not able to join back the Group Replication cluster.

How to repeat:
Scenario
 - PXC 8.0.20 cluster in DC1 3 nodes
 - Group replication cluster DC2 4 nodes MySQL 8.0.22
 
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
    | group_replication_applier | 38809616-e149-11ea-b0cc-080027d0603e | gr1         |        3306 | ONLINE       | PRIMARY     | 8.0.22         |
    | group_replication_applier | 959cc074-e14b-11ea-b90c-080027d0603e | gr2         |        3306 | ONLINE       | SECONDARY   | 8.0.22         |
    | group_replication_applier | a56b38ec-e14b-11ea-b9ce-080027d0603e | gr3         |        3306 | ONLINE       | SECONDARY   | 8.0.22         |
    | group_replication_applier | b7039e3b-f744-11ea-854c-08002734ed50 | gr4         |        3306 | ONLINE       | SECONDARY   | 8.0.22         |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
 
 Group replication cluster Primary 'gr1' connect as slave to one of the nodes from PXC cluster.
 Node gr1 fails and Node gr2 become the new primary.
 
 I restart gr1 and FORCE cloning with :
    set global group_replication_clone_threshold = 1;
 Then START group_replication;
 
 Node start cloning from node gr3:
 >SELECT * FROM performance_schema.clone_status;
+------+------+-------------+-------------------------+----------+----------+----------------+----------+---------------+-------------+-----------------+---------------+
| ID   | PID  | STATE       | BEGIN_TIME              | END_TIME | SOURCE   | DESTINATION    | ERROR_NO | ERROR_MESSAGE | BINLOG_FILE | BINLOG_POSITION | GTID_EXECUTED |
+------+------+-------------+-------------------------+----------+----------+----------------+----------+---------------+-------------+-----------------+---------------+
|    1 |   27 | In Progress | 2020-10-21 16:53:48.826 | NULL     | gr3:3306 | LOCAL INSTANCE |        0 |               |             |               0 |               |
+------+------+-------------+-------------------------+----------+----------+----------------+----------+---------------+-------------+-----------------+---------------+

Node complete cloning:
(root@localhost) [(none)]>SELECT * FROM performance_schema.clone_status\G
*************************** 1. row ***************************
             ID: 1
            PID: 0
          STATE: Completed
     BEGIN_TIME: 2020-10-21 16:53:48.826
       END_TIME: 2020-10-21 16:59:40.747
         SOURCE: gr3:3306
    DESTINATION: LOCAL INSTANCE
       ERROR_NO: 0
  ERROR_MESSAGE: 
    BINLOG_FILE: binlog.000015
BINLOG_POSITION: 39239797
  GTID_EXECUTED: 06779870-54cb-11ea-866d-08002734ed50:1-5,
7c695844-cb3f-11ea-b252-b35802f30c19:1-378,
8f98109a-12c3-11eb-8f47-339bdc860e71:1-12810,
a1e2f3a7-54cd-11ea-90b7-37299557067c:1-200118,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-125674,
be97dbde-a7c6-11ea-822b-08002734ed50:1-5306,
caf70b4b-8885-11ea-9c80-17253fa4edc0:1-109944,
e314189e-b0bf-11ea-adc3-cb4a792f4c60:1-64950,
f9e250ad-a7c7-11ea-8ee2-0a670dbc5818:1-133047

Node is ready to have Group replication started. 

BUT:
2020-10-21T16:00:05.688835Z 0 [System] [MY-010931] [Server] /opt/mysql_templates/mysql-8P/bin/mysqld: ready for connections. Version: '8.0.22'  socket: '/opt/mysql_instances/gr8/mysql.sock'  port: 3306  MySQL Community Server - GPL.
2020-10-21T16:00:57.326476Z 12 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2020-10-21T16:00:57.329043Z 13 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2020-10-21T16:00:57.432364Z 14 [ERROR] [MY-010544] [Repl] Failed to open the relay log '/opt/mysql_instances/gr8/logs/relay-group_replication_applier.000020' (relay_log_pos 39235826). <------------------| File is not in the receiver file system 
2020-10-21T16:00:57.432484Z 14 [ERROR] [MY-011059] [Repl] Could not find target log file mentioned in relay log info in the index file '/opt/mysql_instances/gr8/logs/relay-group_replication_applier.index' during relay log initialization.
2020-10-21T16:00:57.435654Z 14 [ERROR] [MY-011533] [Repl] Plugin group_replication reported: 'Failed to setup the group replication applier thread.'
2020-10-21T16:00:57.436117Z 12 [ERROR] [MY-011669] [Repl] Plugin group_replication reported: 'Unable to initialize the Group Replication applier module.'
2020-10-21T16:00:57.436627Z 12 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
2020-10-21T16:00:57.438453Z 13 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'

Check mysql.slave_relay_log_info in NODE gr1:
(root@localhost) [(none)]>select * from mysql.slave_relay_log_info\G
*************************** 1. row ***************************
                Number_of_lines: 12
                 Relay_log_name: /opt/mysql_instances/gr8/logs/relay-group_replication_applier.000020
                  Relay_log_pos: 39235826
                Master_log_name: 
                 Master_log_pos: 0
                      Sql_delay: 0
              Number_of_workers: 4
                             Id: 1
                   Channel_name: group_replication_applier
      Privilege_checks_username: NULL
      Privilege_checks_hostname: NULL
             Require_row_format: 1
Require_table_primary_key_check: STREAM
1 row in set (0.00 sec)

Which correctly is the same as in gr3.

But the files ARE NOT copy over from gr3.
Files in logs directory for Node gr1:
[root@gr1 gr8]# ll logs/
total 1536048
-rw-r----- 1 mysql mysql       171 Oct 21 11:59 binlog.000001
-rw-r----- 1 mysql mysql       152 Oct 21 12:00 binlog.000002
-rw-r----- 1 mysql mysql        88 Oct 21 12:00 binlog.index
-rw-r----- 1 mysql mysql 524288000 Oct 21 12:52 ib_logfile0
-rw-r----- 1 mysql mysql 524288000 Oct 21 11:59 ib_logfile1
-rw-r----- 1 mysql mysql 524288000 Oct 21 12:52 ib_logfile2
-rw-r----- 1 mysql mysql       171 Oct 21 11:39 relay-dc1_to_dc2.000001
-rw-r----- 1 mysql mysql        54 Oct 21 10:38 relay-dc1_to_dc2.index
-rw-r----- 1 mysql mysql       171 Oct 21 11:52 relay-group_replication_applier.000001
-rw-r----- 1 mysql mysql       217 Oct 21 11:53 relay-group_replication_applier.000002
-rw-r----- 1 mysql mysql       292 Oct 21 11:59 relay-group_replication_applier.000003
-rw-r----- 1 mysql mysql       171 Oct 21 12:00 relay-group_replication_applier.000004
-rw-r----- 1 mysql mysql       276 Oct 21 12:00 relay-group_replication_applier.index
-rw-r----- 1 mysql mysql       171 Oct 21 11:59 relay-group_replication_recovery.000001
-rw-r----- 1 mysql mysql        70 Oct 21 11:53 relay-group_replication_recovery.index

Files in logs directory for Node gr3:
[root@gr3 gr8]#  ll logs/
total 2059292
-rw-r----- 1 mysql mysql 457323188 Oct  6 09:38 binlog.000013
-rw-r----- 1 mysql mysql      1966 Oct 20 12:48 binlog.000014
-rw-r----- 1 mysql mysql  39239797 Oct 21 11:53 binlog.000015
-rw-r----- 1 mysql mysql       132 Oct 20 12:49 binlog.index
-rw-r----- 1 mysql mysql 524288000 Oct 21 12:01 ib_logfile0
-rw-r----- 1 mysql mysql 524288000 Sep 17 11:49 ib_logfile1
-rw-r----- 1 mysql mysql 524288000 Oct 21 12:01 ib_logfile2
-rw-r----- 1 mysql mysql       152 Oct 21 08:43 relay-dc1_to_dc2.000001
-rw-r----- 1 mysql mysql        54 Oct 21 08:43 relay-dc1_to_dc2.index
-rw-r----- 1 mysql mysql      3048 Oct 21 09:13 relay-group_replication_applier.000019
-rw-r----- 1 mysql mysql  39235826 Oct 21 11:53 relay-group_replication_applier.000020
-rw-r----- 1 mysql mysql       138 Oct 21 09:13 relay-group_replication_applier.index
-rw-r----- 1 mysql mysql       218 Oct 21 09:14 relay-group_replication_recovery.000001
-rw-r----- 1 mysql mysql       273 Oct 21 09:14 relay-group_replication_recovery.000002
-rw-r----- 1 mysql mysql       140 Oct 21 09:14 relay-group_replication_recovery.index

Suggested fix:
This is not the fix but the workaround to have the node join:
In order to have the Node gr1 start correctly and then join ght GR cluster I had to copy over the relay log and clean up any reference to the other channel 'dc1_to_dc2' from mysql.

delete from mysql.slave_master_info where Channel_name 'dc1_to_dc2';
delete from mysql.slave_relay_log_info where Channel_name = 'dc1_to_dc2';

And of course from the file system in logs.
Then restart.
[22 Oct 2020 19:10] MySQL Verification Team
Hi Marco,

I tried reoroducing this without PXC, using a single MySQL Server as asynchronous replication source and I did not have a problem.

Can you reproduce this with MySQL server (without using pxc) ?

Thanks
Bogdan
[23 Oct 2020 14:08] Marco Tusa
Hi Bogdan ,
I will try, but cannot do it just right now because I am on another activity and this was an incident while testing an architecture. 
But will include in the schedule and come back to you.
[25 Nov 2020 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".