Bug #91356 Member state is Recovering, but group_replication_recovery thread is not running
Submitted: 21 Jun 2018 11:51 Modified: 9 Jul 2018 7:58
Reporter: Bingwei Wang Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.11 OS:Any
Assigned to: CPU Architecture:Any

[21 Jun 2018 11:51] Bingwei Wang
Description:
The problem I encounter is: one node in group shows a Recovering Member State, but the group_replication_recovery is not running.

mysql> select * from replication_applier_status;
+----------------------------+---------------+-----------------+----------------------------+
| CHANNEL_NAME               | SERVICE_STATE | REMAINING_DELAY | COUNT_TRANSACTIONS_RETRIES |
+----------------------------+---------------+-----------------+----------------------------+
| group_replication_applier  | ON            |            NULL |                          0 |
| group_replication_recovery | OFF           |            NULL |                          0 |
+----------------------------+---------------+-----------------+----------------------------+
2 rows in set (0.01 sec)

mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-----------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST     | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-----------------+-------------+--------------+-------------+----------------+
| group_replication_applier | ec5e51ae-68a0-11e8-8571-52540017b589 | 192.168.216.145 |        3316 | ONLINE       | PRIMARY     | 8.0.11         |
| group_replication_applier | ec725fcd-692c-11e8-bb5f-525400f9f72b | 192.168.216.147 |        3316 | RECOVERING   | PRIMARY     | 8.0.11         |
+---------------------------+--------------------------------------+-----------------+-------------+--------------+-------------+----------------+
2 rows in set (0.02 sec)

Actually, the applier thread is running:

mysql> show processlist;
+-----+-----------------+-----------+--------------------+---------+---------+---------------------------------------------------+------------------+
| Id  | User            | Host      | db                 | Command | Time    | State                                             | Info             |
+-----+-----------------+-----------+--------------------+---------+---------+---------------------------------------------------+------------------+
|   4 | event_scheduler | localhost | NULL               | Daemon  | 1316371 | Waiting on empty queue                            | NULL             |
| 256 | system user     |           | NULL               | Connect |    1039 | Suspending                                        | NULL             |
| 259 | system user     |           | NULL               | Query   |       0 | Waiting for slave workers to process their queues | NULL             |
| 260 | system user     |           | NULL               | Query   |    2770 | Executing event                                   | NULL             |
| 261 | system user     |           | NULL               | Query   |    2770 | Executing event                                   | NULL             |
| 266 | root            | localhost | performance_schema | Query   |       0 | starting                                          | show processlist |
+-----+-----------------+-----------+--------------------+---------+---------+---------------------------------------------------+------------------+
6 rows in set (0.10 sec)

I also checked the back-log in applier queue:

mysql> select Count_transactions_remote_in_applier_queue from performance_schema.replication_group_member_stats where member_id=@@server_uuid;
+--------------------------------------------+
| Count_transactions_remote_in_applier_queue |
+--------------------------------------------+
|                                          0 |
+--------------------------------------------+
1 row in set (0.05 sec)

It shows 0, apparently not right. As I just finished a load test using sysbench, it lagged behind a lot.

How to repeat:
1. Set up GR with two nodes using multi-primary mode (three nodes should produce the same result.), and using MTS with 2 applier threads.

2. Using sysbench to generate some write load to one node (Node1), to make the other node (Node2)lag behind.

3. On Node2, restart group_replication.

4. Check member status by querying table replication_group_members 

5. Check applier status by querying table replication_applier_status

6 Check the back-log queue of the applier by querying replication_group_member_stats

Suggested fix:
Well ... it's hard to say. Actually , If recovering state is right for node 2 ,  the recovery thread should be used to handle the back-log;

If it is really necessary to  use the applier thread, the member state should be marked as online and correct the value of column Count_transactions_remote_in_applier_queue in table replication_group_member_stats . 

Both  will cause less confusing
[9 Jul 2018 7:58] Bingwei Wang
The design of distributed recovery  uses a 2-phase recovery mechanism.  In this case, phase 1 cannot be started until all the back-log in applier binlog are consumed. So this is a phase prior to any of the 2-phase mentioned in the distributed recovery. However the joiner is marked as read only before its state is changed to ONLINE, it doesn't cause any critical issue. Maybe it is not a bug. ^-^