Bug #91465 network is ok, but GR node been unreachable
Submitted: 28 Jun 2018 12:20 Modified: 16 Jul 2018 11:28
Reporter: 冯 国强 Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7.20-log OS:Linux (Red Hat 4.8.5-16)
Assigned to: Bogdan Kecman CPU Architecture:x86

[28 Jun 2018 12:20] 冯 国强
Description:

Apology for my poor English!!!

we have 3 Group Replication Node :oms-mysql-01(read/write node ) 、  oms-mysql-02 (backup node ) and  oms-mysql-03 (write node ).

Sometimes, GR will be halt,like this:

oms-mysql-01: 
2018-06-28T16:15:37.287348+08:00 0 [Warning] Plugin group_replication reported: 'Member with address oms-mysql-03l:6603 has become unreachable.'
 
 
oms-mysql-03
 
2018-06-28T16:15:41.293922+08:00 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.' 

but , there is a ping script task on  oms-mysql-01, here is the log (10.10.19.27 is oms-mysql-03):

ip=10.10.19.27 rtt=1.1 ms seq=190742 20180628161517
ip=10.10.19.27 rtt=0.3 ms seq=190743 20180628161518
ip=10.10.19.27 rtt=0.3 ms seq=190744 20180628161519
ip=10.10.19.27 rtt=0.3 ms seq=190745 20180628161520
ip=10.10.19.27 rtt=0.3 ms seq=190746 20180628161521
ip=10.10.19.27 rtt=0.5 ms seq=190747 20180628161522
ip=10.10.19.27 rtt=0.3 ms seq=190748 20180628161523
ip=10.10.19.27 rtt=0.3 ms seq=190749 20180628161524
ip=10.10.19.27 rtt=0.3 ms seq=190750 20180628161525
ip=10.10.19.27 rtt=0.3 ms seq=190751 20180628161526
ip=10.10.19.27 rtt=0.3 ms seq=190752 20180628161527
ip=10.10.19.27 rtt=0.3 ms seq=190753 20180628161529
ip=10.10.19.27 rtt=0.3 ms seq=190754 20180628161530
ip=10.10.19.27 rtt=1.5 ms seq=190755 20180628161531
ip=10.10.19.27 rtt=0.4 ms seq=190756 20180628161532
ip=10.10.19.27 rtt=0.3 ms seq=190757 20180628161533
ip=10.10.19.27 rtt=0.3 ms seq=190758 20180628161534
ip=10.10.19.27 rtt=0.3 ms seq=190759 20180628161535
ip=10.10.19.27 rtt=0.3 ms seq=190760 20180628161536
ip=10.10.19.27 rtt=0.3 ms seq=190761 20180628161537
ip=10.10.19.27 rtt=1.5 ms seq=190762 20180628161538
ip=10.10.19.27 rtt=7.4 ms seq=190763 20180628161539
ip=10.10.19.27 rtt=0.9 ms seq=190764 20180628161540
ip=10.10.19.27 rtt=0.3 ms seq=190765 20180628161541
ip=10.10.19.27 rtt=0.4 ms seq=190766 20180628161542
ip=10.10.19.27 rtt=0.4 ms seq=190767 20180628161544
ip=10.10.19.27 rtt=0.3 ms seq=190768 20180628161545
ip=10.10.19.27 rtt=0.3 ms seq=190769 20180628161546
ip=10.10.19.27 rtt=0.4 ms seq=190770 20180628161547
ip=10.10.19.27 rtt=0.5 ms seq=190771 20180628161548
ip=10.10.19.27 rtt=0.4 ms seq=190772 20180628161549
ip=10.10.19.27 rtt=0.4 ms seq=190773 20180628161550
ip=10.10.19.27 rtt=0.4 ms seq=190774 20180628161551

so ,why does  the oms-mysql-03  been unreachable ?

(Member with address oms-mysql-03l:6603 has become unreachable)

thanks!!!

 

How to repeat:
none
[28 Jun 2018 12:21] 冯 国强
log

Attachment: report.txt (text/plain), 8.86 KiB.

[16 Jul 2018 11:28] Bogdan Kecman
Hi,

The ping is not highly reliable proof that network is ok since it sends a small and sparse packages... a flood ping with large packets would be more likely but it will load both the port and the network so it's not something you can run "normally non stop", but you can use to test your network.

Anyhow, I can't reproduce this problem unless I really introduce issues with network so I'd assume this is not a bug but that your network is somehow corrupted.

thanks
Bogdan
[5 Nov 2018 12:06] test sdf
same issue.
i have a setup of master slave and trying to move to innodb cluster of 3 nodes.
so 1 node acting as a slave which synces from the master slave setup.
once a huge update tries to execute - 19M rows - the cluster throws:
2018-11-05T10:57:31.746778Z 0 [Warning] Plugin group_replication reported: 'Member with address db02:3306 has become unreachable.'
2018-11-05T10:57:31.753144Z 0 [Warning] Plugin group_replication reported: 'Member with address db03:3306 has become unreachable.'
2018-11-05T10:57:31.753942Z 0 [ERROR] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
2018-11-05T10:57:40.567265Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'

At this point the transaction is reverted - this is happening 100% of the times I try to restart the slave and exceute the huge update.

Please note that with a simple Master->Slave setup the huge update passes so my conclusion it's a bug with the group replication plugin.
[5 Nov 2018 13:50] Bogdan Kecman
I cannot reproduce this with a normal network, only on shady cloud setups where network is problematic. 

Look at latest 8.0.13 there is 

group_replication_member_expel_timeout:
https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/group-replication-options.html#sysvar_gr...

Also check this link for more details
https://mysqlhighavailability.com/group-replication-coping-with-unreliable-failure-detecti...
[25 Jun 2019 4:45] Yoshihide Miyuki
Bogdan Kecman's answer is MySQL 8.0 compliant, is there any MySQL 5.7 compliant?
[25 Jun 2019 7:55] Bogdan Kecman
Hi,

No. Either use the latest 8.x or setup system on a network that works ok.

I suggest, strongly, using a setup with proper network quality