Bug #88414 After online node is restored, the state has not changed to mysql5.7.20
Submitted: 9 Nov 2017 7:42 Modified: 15 Jan 2018 12:40
Reporter: lu kong Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:mysql5.7.20 OS:Red Hat (7.3)
Assigned to: CPU Architecture:Any

[9 Nov 2017 7:42] lu kong
Description:
1.mysql5.7.20 multi master environment, after abnormal recovery, after data synchronization, the state has been RECOVERING

Prerequisite:
Multi master state, using sqltest script service
10.46.178.140  
10.46.178.19
10.46.178.152

Operation steps:
178.140:
mysql> STop GROUP_REPLICATION;
Query OK, 0 rows affected (1.00 sec)

mysql> SET GLOBAL group_replication_bootstrap_group=ON; 
Query OK, 0 rows affected (0.00 sec)

mysql> START GROUP_REPLICATION;     
Query OK, 0 rows affected (2.01 sec)

mysql>

mysql> SET GLOBAL group_replication_bootstrap_group=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql

178.19:
mysql> STop GROUP_REPLICATION;
Query OK, 0 rows affected (1.04 sec)

mysql> START GROUP_REPLICATION;                       
SET GLOBAL group_replication_bootstrap_group=OFF;

ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.
mysql> SET GLOBAL group_replication_bootstrap_group=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> 
mysql> 
mysql> 
mysql> set  global  group_replication_allow_local_disjoint_gtids_join=ON;  
Query OK, 0 rows affected (0.00 sec)

mysql> START GROUP_REPLICATION; 
Query OK, 0 rows affected (3.32 sec)

mysql> 
mysql> 
mysql> select count(*) from test.TBL_TRANSACTION_TRACE_1;
+----------+
| count(*) |
+----------+
|     3803 |
+----------+
1 row in set (0.01 sec)

mysql>

178.152:
mysql> 
mysql> set  global  group_replication_allow_local_disjoint_gtids_join=ON;  
Query OK, 0 rows affected (0.00 sec)

mysql> START GROUP_REPLICATION; 
Query OK, 0 rows affected (6.04 sec)

mysql>select count(*) from test.TBL_TRANSACTION_TRACE_1;
+----------+
| count(*) |
+----------+
|    22846 |
+----------+
1 row in set (0.02 sec)

mysql>

178.19 after the environment starts, it's still in sync
mysql> select count(*) from test.TBL_TRANSACTION_TRACE_1;
+----------+
| count(*) |
+----------+
|    16533 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from test.TBL_TRANSACTION_TRACE_1;
+----------+
| count(*) |
+----------+
|    16560 |
+----------+
1 row in set (0.02 sec)

mysql> 
mysql> select count(*) from test.TBL_TRANSACTION_TRACE_1;
+----------+
| count(*) |
+----------+
|    22846 |
+----------+
1 row in set (0.03 sec)

View group replication status:
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 142076a7-becb-11e7-87e1-00e081d4b651 | centos19    |        3308 | RECOVERING   |
| group_replication_applier | 243880d9-becb-11e7-ad9f-e0071bf256bc | centos140   |        3308 | ONLINE       |
| group_replication_applier | 97484eff-bed3-11e7-b3cb-e0071bf246f4 | redhat152   |        3308 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

mysql> 

Test result:
178.19 after synchronization, the state has been RECOVERING
178.152 and 178.140 updated successfully
152:
mysql> create database a1;
Query OK, 1 row affected (0.00 sec)

mysql> 
140:
mysql> create database a3;
Query OK, 1 row affected (0.01 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| a1                 |
| a3                 |
| abc                |
| mysql              |
| performance_schema |
| sys                |
| test               |
| tt                 |
+--------------------+
9 rows in set (0.00 sec)

mysql> 

178.19 update failed:
mysql> create database a2;
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
mysql> 

You need to restart the 178.19 groups of replication services for recovery
Journal:
The 3 node prints the same log
2017-11-01T08:39:36.712874Z 0 [Warning] Plugin group_replication reported: 'The member with address centos140:3308 has already sent the stable set. Therefore discarding the second message.'
2017-11-01T08:39:53.328047Z 0 [Warning] Plugin group_replication reported: 'The member with address centos19:3308 has already sent the stable set. Therefore discarding the second message.'
2017-11-01T08:40:36.722983Z 0 [Warning] Plugin group_replication reported: 'The member with address centos140:3308 has already sent the stable set. Therefore discarding the second message.'
2017-11-01T08:40:53.345927Z 0 [Warning] Plugin group_replication reported: 'The member with address centos19:3308 has already sent the stable set. Therefore discarding the second message.'

How to repeat:
When the service is down, the down drops the network card to boot the host (the service is guided by the main down pressure)

Suggested fix:

After synchronization, the status is updated to onlie
[15 Jan 2018 9:38] Bogdan Kecman
Hi,
I reproduced the issue but would still like to better understand your use case.. the 

"When the service is down, the down drops the network card to boot the host (the service is guided by the main down pressure)"

what do you mean by that?

The bug is IMO not in the fact that node is showing "recovering" as the node obviously is not recovered (update don't work for e.g.), but the bug is in fact that node "never recovers" unless you restart it.

All best
Bogdan
[15 Jan 2018 12:40] Nuno Carvalho
Posted by developer:
 
Hi,

The problem is here:

178.19:
mysql> STop GROUP_REPLICATION;
Query OK, 0 rows affected (1.04 sec)

mysql> START GROUP_REPLICATION;                       
SET GLOBAL group_replication_bootstrap_group=OFF;

ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.

This ^ START failed because 178.19 did not have a subset of 178.140 data, that is, it had transactions not present on 178.140.
The detailed log message is on the error log.

Then you did use:
mysql> set  global  group_replication_allow_local_disjoint_gtids_join=ON;  
Query OK, 0 rows affected (0.00 sec)

Which allowed 178.19 to join, but it was not able to synchronize with the group.

The option group_replication_allow_local_disjoint_gtids_join will be deprecated on 5.7.21 and removed on 8.0 to avoid these scenarios.
The manual already explains that on
https://dev.mysql.com/doc/refman/5.7/en/group-replication-options.html#sysvar_group_replic...

Please start your group servers from the same data.

Best regards,
Nuno Carvalho