MySQL Bugs: #91722: MGR may becomes a one-member-online system but still provide service

Bug #91722	MGR may becomes a one-member-online system but still provide service
Submitted:	20 Jul 2018 3:08	Modified:	14 Aug 2018 9:21
Reporter:	Fengchun Hua	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S1 (Critical)
Version:	5.7.22	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When shutdown and then recover the network of all the secondary s, the group may becomes a one-member-online system. Only one node's status is online, but the others' status are error.
But the only node online will continue provide service, which is conflict with the official documents. This situation may have a risk of data loss.

How to repeat:
How to reproduce
I run there node on a same machine, using iptables to stop all the port that these three nodes use.
I use 33061,33062,33063 as mysqld's service port, and 33071,33072,33073 as MGR's port.
This situation is not easy to reproduce, you have to:
1. shutdown the network of all the node at the same time.
2. wait for 10s~15s.
3. recover the network.
4. see MGR member's status.
Some times, you will see the status below.

Node 1 primary
mysql: [Warning] Using a password on the command line interface can be insecure.
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
| group_replication_applier | bbb47f53-8bc3-11e8-8419-fa163ead0538 | euler-mysql-hfc.novalocal | 33061 | ONLINE |
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+

Node2 secondary
mysql: [Warning] Using a password on the command line interface can be insecure.
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
| group_replication_applier | c07867a7-8bc3-11e8-9307-fa163ead0538 | euler-mysql-hfc.novalocal | 33062 | ERROR |
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
Node3 secondary
mysql: [Warning] Using a password on the command line interface can be insecure.
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+
| group_replication_applier | c6379093-8bc3-11e8-bfbb-fa163ead0538 | euler-mysql-hfc.novalocal | 33063 | ERROR |
+---------------------------+--------------------------------------+---------------------------+-------------+--------------+

Suggested fix:
Think this situation:
1. Cut off the network connection of two secondary nodes.
2. Detector_task find that node 3 is unreachable, send a new view 1, 1, 0 (1=online, 0=not online).
3. This view is accepted by node 1 and node 2, this view message become a learn messsage and wait to install.
4. Detector_task find that node 2 is unreachable, only node1 online, can not make a decision. Node 1 is not able to reach a majority of members, block transactions.
5. Recover network connection.
6. Node 3 recovers first, node 1 detected view changed, send a new view 1, 0, 1, Node 1 resume contact with a majority of members, and unblock transactions.
7. This view is accept by node 1 and node 3, this view message become a learn message and wait to install.
8. Node 2 recovers.
9. The MGR group will execute these two views, the first view kick node 3 out, and the second view kick node 2 out.
10. Node 1 is one member online but still provide service.
The question is, when the first view installed, the second view is expired and should not install. This is the key to solve this problem.

How to solve
• Add a parameter named base_view in app_data struct, that means which view is the base of this new view.(the current view when this new view generated).
• When learn a view_msg, compare the current view nodeno that learned (maybe not installed completely) and this view_msg’s base_view nodeno.
• If this view_msg’s base nodeno bigger than current view nodeno, that means this view is new and should be install. Otherwise, means this view is expired and should drop it.
I tried this, and it does solve this problem.

The last comment's suggest is not good enough. I find a better way to solve it.
in bool Gcs_xcom_control::xcom_receive_global_view

1.Record last view and current member_ids, this member_ids is different from current_member it will pretend expel members have kicked out for checking view confliction.

2.refresh member_ids when last_view is empty or last view is not equals to current view. 

3.pretend expel members is kicked out. erase it from member_ids

4.after pretend kick, check member_ids count

5.drop view that will lead to one member online.

Hi,

I do understand what you did
 - start 3 nodes
 - kill network
 - only master node is alive, other ones are disconnected

but I don't understand what you consider a bug here? 

I  understand you might like different behavior but I don't see how behaving as is is a bug.

thanks
Bogdan

1.By analysis the code, I found that this situation is because MGR apply a expired view msg. 
2.And I think only major members can make a decision. The fact is, primary kicks both secondary out. 
3.I don't think one-member can provide service, because when two secondary is not reachable, MGR will hang all the transactions. 

so I consider it as a bug. I just report it, in case of data loose. Our team treat it as a bug, and will fix it.

Hi,

Well I did verify the behavior. I'm not 100% sure if it's a bug or not so I'll leave it to replication team to decide if they want to "fix" it or leave it as is. I will set it to "verified".

Thank you for your report!

kind regards
Bogdan

Thanks!

This violates the terms of mencius, This bug also affected me.