Description:
The problem I encounter is: one node in group shows a Recovering Member State, but the group_replication_recovery is not running.
mysql> select * from replication_applier_status;
+----------------------------+---------------+-----------------+----------------------------+
| CHANNEL_NAME | SERVICE_STATE | REMAINING_DELAY | COUNT_TRANSACTIONS_RETRIES |
+----------------------------+---------------+-----------------+----------------------------+
| group_replication_applier | ON | NULL | 0 |
| group_replication_recovery | OFF | NULL | 0 |
+----------------------------+---------------+-----------------+----------------------------+
2 rows in set (0.01 sec)
mysql> select * from replication_group_members;
+---------------------------+--------------------------------------+-----------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-----------------+-------------+--------------+-------------+----------------+
| group_replication_applier | ec5e51ae-68a0-11e8-8571-52540017b589 | 192.168.216.145 | 3316 | ONLINE | PRIMARY | 8.0.11 |
| group_replication_applier | ec725fcd-692c-11e8-bb5f-525400f9f72b | 192.168.216.147 | 3316 | RECOVERING | PRIMARY | 8.0.11 |
+---------------------------+--------------------------------------+-----------------+-------------+--------------+-------------+----------------+
2 rows in set (0.02 sec)
Actually, the applier thread is running:
mysql> show processlist;
+-----+-----------------+-----------+--------------------+---------+---------+---------------------------------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-----------------+-----------+--------------------+---------+---------+---------------------------------------------------+------------------+
| 4 | event_scheduler | localhost | NULL | Daemon | 1316371 | Waiting on empty queue | NULL |
| 256 | system user | | NULL | Connect | 1039 | Suspending | NULL |
| 259 | system user | | NULL | Query | 0 | Waiting for slave workers to process their queues | NULL |
| 260 | system user | | NULL | Query | 2770 | Executing event | NULL |
| 261 | system user | | NULL | Query | 2770 | Executing event | NULL |
| 266 | root | localhost | performance_schema | Query | 0 | starting | show processlist |
+-----+-----------------+-----------+--------------------+---------+---------+---------------------------------------------------+------------------+
6 rows in set (0.10 sec)
I also checked the back-log in applier queue:
mysql> select Count_transactions_remote_in_applier_queue from performance_schema.replication_group_member_stats where member_id=@@server_uuid;
+--------------------------------------------+
| Count_transactions_remote_in_applier_queue |
+--------------------------------------------+
| 0 |
+--------------------------------------------+
1 row in set (0.05 sec)
It shows 0, apparently not right. As I just finished a load test using sysbench, it lagged behind a lot.
How to repeat:
1. Set up GR with two nodes using multi-primary mode (three nodes should produce the same result.), and using MTS with 2 applier threads.
2. Using sysbench to generate some write load to one node (Node1), to make the other node (Node2)lag behind.
3. On Node2, restart group_replication.
4. Check member status by querying table replication_group_members
5. Check applier status by querying table replication_applier_status
6 Check the back-log queue of the applier by querying replication_group_member_stats
Suggested fix:
Well ... it's hard to say. Actually , If recovering state is right for node 2 , the recovery thread should be used to handle the back-log;
If it is really necessary to use the applier thread, the member state should be marked as online and correct the value of column Count_transactions_remote_in_applier_queue in table replication_group_member_stats .
Both will cause less confusing