Description:
When a member(M) fails to process messages from GCS and goes to ERROR state, other
members of the group considers M as UNREACHABLE and waits for it to return.
This is easy to hit with WL#11570.
Scenario:
Consider a group of 5 members with (eg: gr_member_expel_timeout=2000s)
1. Start sysbench load on 5 members.
2. drop n/w on M5.
3. Wait for ~1800s and restore network on M5.
4. M5 goes to ERROR state.
Other members of the group see M5 UNREACHABLE and waits for it again for gr_member_expel_timeout seconds.
From M5's errorlog:
...
2018-06-26T17:43:11.698016Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address brage08:30005 is reachable again.'
2018-06-26T17:43:11.698072Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address brage09:30007 is reachable again.'
2018-06-26T17:43:12.237145Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Node 4 is unable to get messages, since the group is too far ahead. Node will now exit.'
2018-06-26T17:43:12.305510Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The node has missed messages that can no longer be recovered from the other nodes' caches. GCS will now terminate.'
From M1's errorlog:
...
2018-06-26T17:43:09.259382Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address brage17:30009 is reachable again.'
::xcom_receive_local_view invoked! CV Nodes 5 LV Nodes 5
::xcom_receive_local_view: Local view has node brage06:30101
::xcom_receive_local_view: Local view has node brage07:30103
::xcom_receive_local_view: Local view has node brage08:30105
::xcom_receive_local_view: Local view has node brage09:30107
::xcom_receive_local_view: Local view has node brage17:30109
::xcom_receive_local_view: Local view has suspected member: brage17:30109
2018-06-26T17:43:17.257756Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address brage17:30009 has become unreachable.'
Note: Incase of applier error, if a member goes to ERROR state, other members
detect it and evicts the member immediately.
How to repeat:
see description.
Suggested fix:
When a member goes to ERROR state, Instead of waiting for it, other members should be able detect and evict the member right away (As it happens in case applier error). There is no point in waiting for a member which will not return.