MySQL Bugs: #91870: ALL MEMBERS ARE EXITED FROM GR WHEN 1/4 NODE TRIED TO REJOIN AFTER network drop

Bug #91870	ALL MEMBERS ARE EXITED FROM GR WHEN 1/4 NODE TRIED TO REJOIN AFTER network drop
Submitted:	2 Aug 2018 11:33	Modified:	12 Dec 2018 14:20
Reporter:	Ramana Yeruva	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.13	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Observing all nodes are moving out from Group Replication when one node network dropped and restored back during server under heavy load, below are the messages shown by Read Write node

2018-08-02T10:27:13.247443Z 0 [Note] [MY-011501] [Repl] Plugin group_replication reported: 'Members joined the group: vale19:3308'
2018-08-02T10:27:13.247525Z 0 [Note] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to vale19:3306, vale19:3307, vale19:3308, vale19:3309 on view 15332050649063362:6.'
2018-08-02T10:30:20.002772Z 20 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel 'group_replication_applier': Event too big
2018-08-02T10:30:20.002833Z 20 [ERROR] [MY-013121] [Repl] Slave SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: MY-013121
2018-08-02T10:30:20.002860Z 20 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2018-08-02T10:30:20.002876Z 20 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2018-08-02T10:31:01.750784Z 17 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2018-08-02T10:31:01.754373Z 17 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

How to repeat:
steps:
1.start 4 servers on the same host
2.setup GR configuration
3.add 4 nodes in cluster using mysqlsh
4.run sysbench heavy prepare, with 16 tables and each table 10million rows
5.wait for some time or wait until 2million rows are inserted into 16 tables
6.now bring down communication of node3 using kill -19 pid
7.wait for 5minutes or so
8.bring up node3 communication back using kill -18 pid
9.try to rejoin this node to the cluster again, observe all nodes moving out from group replication with below errors:

2018-08-02T10:27:13.247443Z 0 [Note] [MY-011501] [Repl] Plugin group_replication reported: 'Members joined the group: vale19:3308'
2018-08-02T10:27:13.247525Z 0 [Note] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to vale19:3306, vale19:3307, vale19:3308, vale19:3309 on view 15332050649063362:6.'
2018-08-02T10:30:20.002772Z 20 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel 'group_replication_applier': Event too big
2018-08-02T10:30:20.002833Z 20 [ERROR] [MY-013121] [Repl] Slave SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: MY-013121
2018-08-02T10:30:20.002860Z 20 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2018-08-02T10:30:20.002876Z 20 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2018-08-02T10:31:01.750784Z 17 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2018-08-02T10:31:01.754373Z 17 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

detailed steps are added below:

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.26 / 8.0.14 changelog:
When adding a new member to a group, if the certification information was too big to transmit, an event was generated that caused failures in all group members. To avoid this situation, now if the certification information is too large an error is generated which makes the joining member leave the group.