Bug #91433 when a member goes to ERROR, other members see it UNREACHABLE
Submitted: 27 Jun 2018 8:12 Modified: 6 Sep 2018 16:01
Reporter: Dhruthi Komarlu Vasudeva Murthy Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.13-wl11570 OS:Any
Assigned to: CPU Architecture:Any

[27 Jun 2018 8:12] Dhruthi Komarlu Vasudeva Murthy
Description:
When a member(M) fails to process messages from GCS and goes to ERROR state, other
members of the group considers M as UNREACHABLE and waits for it to return.
This is easy to hit with WL#11570.

Scenario:
Consider a group of 5 members with (eg: gr_member_expel_timeout=2000s)
1. Start sysbench load on 5 members.
2. drop n/w on M5.
3. Wait for ~1800s and restore network on M5.
4. M5 goes to ERROR state.
   Other members of the group see M5 UNREACHABLE and waits for it again for gr_member_expel_timeout seconds.

From M5's errorlog:
...
2018-06-26T17:43:11.698016Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address brage08:30005 is reachable again.'
2018-06-26T17:43:11.698072Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address brage09:30007 is reachable again.'
2018-06-26T17:43:12.237145Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Node 4 is unable to get messages, since the group is too far ahead. Node will now exit.'
2018-06-26T17:43:12.305510Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The node has missed messages that can no longer be recovered from the other nodes' caches. GCS will now terminate.'

From M1's errorlog:
...
2018-06-26T17:43:09.259382Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address brage17:30009 is reachable again.'
::xcom_receive_local_view invoked! CV Nodes 5 LV Nodes 5
::xcom_receive_local_view: Local view has node brage06:30101
::xcom_receive_local_view: Local view has node brage07:30103
::xcom_receive_local_view: Local view has node brage08:30105
::xcom_receive_local_view: Local view has node brage09:30107
::xcom_receive_local_view: Local view has node brage17:30109
::xcom_receive_local_view: Local view has suspected member: brage17:30109
2018-06-26T17:43:17.257756Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address brage17:30009 has become unreachable.'

Note: Incase of applier error, if a member goes to ERROR state, other members
detect it and evicts the member immediately.

How to repeat:
see description.

Suggested fix:
When a member goes to ERROR state, Instead of waiting for it, other members should be able detect and evict the member right away (As it happens in case applier error). There is no point in waiting for a member which will not return.
[6 Sep 2018 16:01] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.13 changelog:
When a group member resumes after being suspended for some time and is not able to process all pending messages, it enters the ERROR state. However, the remaining members see it as UNREACHABLE, and wait until the member's suspicion expires to evict it from the group. The behavior has now been modified and a member stopping due to some error tries to connect to a known peer to request its removal from the group, before installing the leave view.