Bug #85667 GR node is in RECOVERING state if binlog_checksum configured on running server
Submitted: 28 Mar 2017 9:47 Modified: 28 Jun 2017 8:55
Reporter: Ramana Yeruva Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7.18 OS:Any
Assigned to: CPU Architecture:Any

[28 Mar 2017 9:47] Ramana Yeruva
Description:
when node configured in GR setup with binlog_checksum='NONE' on running server, then start group replication return success but node is in RECOVERING mode forever.and server shutdown hangs at this moment.only option is kill the server

How to repeat:
initialize database:
./mysqld --no-defaults -uroot --basedir=../ --datadir=./data --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=1 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24901" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --loose-group_replication_enforce_update_everywhere_checks=TRUE --initialize-insecure &

start server:
./mysqld --no-defaults -uroot --basedir=../ --datadir=./data --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=1 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24901" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --loose-group_replication_enforce_update_everywhere_checks=TRUE &

connect to server using mysql client and run the below commands which are essential for setting up node in GR:
SET SQL_LOG_BIN=0;
CREATE USER rpl_user@'%';
GRANT REPLICATION SLAVE ON *.* TO rpl_user@'%' IDENTIFIED BY 'rpl_pass';
FLUSH PRIVILEGES;
SET SQL_LOG_BIN=1;
CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery';
INSTALL PLUGIN group_replication SONAME 'group_replication.so';
SET GLOBAL group_replication_bootstrap_group=ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group=OFF;

observe the node in ONLINE state:
mysql> select * from performance_schema.replication_group_members\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
   MEMBER_ID: 004c6463-1399-11e7-91f4-b86b23aa36b6
 MEMBER_HOST: ramanay
 MEMBER_PORT: 3306
MEMBER_STATE: ONLINE
1 row in set (0.00 sec)

execute shutdown command from mysql client
mysql> shutdown;
Query OK, 0 rows affected (0.00 sec)

restart server with same settings as before except --binlog_checksum=NONE option

now,lets try to start group replication with below commands again:
mysql> SET GLOBAL group_replication_bootstrap_group=ON;
Query OK, 0 rows affected (0.00 sec)

mysql> set @@global.binlog_checksum='NONE';
Query OK, 0 rows affected (0.04 sec)

mysql> START GROUP_REPLICATION;<--observe this command successful
Query OK, 0 rows affected (1.08 sec)
-->after above statement executed observe below error in server log:
2017-03-28T09:30:53.583757Z 6 [ERROR] Slave I/O for channel 'group_replication_applier': Replication event checksum verification failed while reading from network. Error_code: 1743

mysql> SET GLOBAL group_replication_bootstrap_group=OFF;
Query OK, 0 rows affected (0.00 sec)

-->observe node status from PS table:
mysql> select * from performance_schema.replication_group_members\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
   MEMBER_ID: 004c6463-1399-11e7-91f4-b86b23aa36b6
 MEMBER_HOST: ramanay
 MEMBER_PORT: 3306
MEMBER_STATE: RECOVERING<--this is in RECOVERING state forever,which should have come online
1 row in set (0.00 sec)

now try to shutdown the server from mysql client, shutdown command returns success but mmysqld didn't shutdown
call stack attached below
[28 Jun 2017 8:55] Margaret Fisher
Posted by developer:
 
Added to changelogs for MySQL 5.7 and 8.0:
Replication: When binlog_checksum=NONE was set on a MySQL server after startup, and then Group Replication was started, if an error occurred, the server remained in RECOVERING state and could not be shut down. (Bug #25793366)