MySQL Bugs: #87701: members are exited from GR when 1/3 node tried to rejoin after shutdown under he

Bug #87701	members are exited from GR when 1/3 node tried to rejoin after shutdown under he
Submitted:	8 Sep 2017 6:51	Modified:	14 Dec 2017 15:13
Reporter:	Ramana Yeruva	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	5.7.20	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I am seeing all nodes going out of GR when one of 3 nodes,tried to rejoin into GR after shutdown.This is happening when heavy transactions running on node1.

How to repeat:
node1 data dir initialize and startup:
rm -rf data mysql1.err
./mysqld --no-defaults -uroot --basedir=../ --datadir=./data --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=1 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24901" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --gdb --initialize-insecure --log-error=$PWD/mysql1.err

./mysqld --no-defaults -uroot --basedir=../ --datadir=./data --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=1 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24901" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --gdb --log-error=$PWD/mysql1.err &

node2 data dir initialize and startup:
./mysqld --no-defaults -uroot --basedir=../ --datadir=./data1 --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=2 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24902" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --gdb --initialize-insecure --log-error=$PWD/mysql2.err

./mysqld --no-defaults -uroot --basedir=../ --datadir=./data1 --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=2 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24902" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --gdb --port=3307 --socket=/tmp/mysql1.sock --log-error=$PWD/mysql2.err &

node3 data dir initialize and startup:
./mysqld --no-defaults -uroot --basedir=../ --datadir=./data2 --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=3 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24903" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --gdb --initialize-insecure --log-error=$PWD/mysql3.err

./mysqld --no-defaults -uroot --basedir=../ --datadir=./data2 --gdb --enforce_gtid_consistency=ON --gtid_mode=ON --log_bin=1 --log_slave_updates=ON --master_info_repository=TABLE --relay_log_info_repository=TABLE --transaction_write_set_extraction=XXHASH64 --plugin-load=authentication_pam.so --server-id=3 --binlog_checksum=NONE --binlog_format=ROW --transaction_write_set_extraction=XXHASH64 --loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" --loose-group_replication_start_on_boot=off --loose-group_replication_local_address="127.0.0.1:24903" --loose-group_replication_group_seeds="127.0.0.1:24901,127.0.0.1:24902,127.0.0.1:24903" --loose-group_replication_bootstrap_group=off --loose-group_replication_single_primary_mode=FALSE --gdb --port=3308 --socket=/tmp/mysql2.sock --log-error=$PWD/mysql3.err &

now setup GR with three nodes as below:

connect to node1 and below sql statements:
cat node1.sql
set password = 'root';
rename user root@localhost to root@'%';
INSTALL PLUGIN group_replication SONAME 'group_replication.so';
SET GLOBAL group_replication_bootstrap_group=ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group=OFF;
CREATE USER rpl_user@'%';
GRANT REPLICATION SLAVE ON *.* TO rpl_user@'%' IDENTIFIED BY 'rpl_pass';
FLUSH PRIVILEGES;
CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery';

connect to node2 and node3 and run below sql statements:
cat node2_3.sql
INSTALL PLUGIN group_replication SONAME 'group_replication.so';
CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery';
START GROUP_REPLICATION;

wait for all the nodes come to online:
mysql> select *from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 2a7c1c58-945e-11e7-93d0-0021f6f9cba6 | vigdis47    |        3306 | ONLINE       |
| group_replication_applier | 2ca7e832-945e-11e7-9629-0021f6f9cba6 | vigdis47    |        3307 | ONLINE       |
| group_replication_applier | 306c9161-945e-11e7-999a-0021f6f9cba6 | vigdis47    |        3308 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

now run the sysbench RW test by connecting to node1 with command:source run_gr.sh > run_gr.out 2>&1 &
cat run_gr.sh 
export MYSQL_HOME=/export/home/tmp/qauser/Ramana/5.7.20/mysql-advanced
export TEST_BASEDIR=/export/home/tmp/qauser/Ramana/remote_connectivity_sysbench
export TESTDIR=$TEST_BASEDIR/test_results/5.7/HC
cd $TEST_BASEDIR
mkdir -p $TEST_BASEDIR/test_results/5.7/HC
export LIB=$MYSQL_HOME/lib
export INC=$MYSQL_HOME/include

cd $TEST_BASEDIR/sysbench-0.4.16/sysbench/
rm -rf CMakeCache.txt
cmake ./CMakeLists.txt -DWITH_MYSQL=ON
make
export LD_LIBRARY_PATH=$MYSQL_HOME/lib
cd $TEST_BASEDIR/sysb_tests
$MYSQL_HOME/bin/mysql -uroot -proot --protocol=tcp -e "set @@global.max_prepared_stmt_count=351100;set @@global.max_connections=2000;create database test;"

$TEST_BASEDIR/sysbench-0.4.16/sysbench/sysbench --num-threads=1200 --test=oltp --oltp-table-size=100000 --oltp-dist-type=uniform --oltp-num-tables=16 --mysql-host=localhost --mysql-user=root --mysql-password=root --mysql-port=3306 --mysql-db=test --myisam-max-rows=1000 --mysql-table-engine=innodb --mysql-protocol=tcp --mysql-socket=/tmp/mysql.sock --max-requests=0 --oltp-read-only=off prepare

$TEST_BASEDIR/sysbench-0.4.16/sysbench/sysbench --num-threads=1200 --test=oltp --oltp-table-size=100000 --oltp-dist-type=uniform --oltp-num-tables=16 --mysql-host=localhost --mysql-user=root --mysql-password=root --mysql-port=3306 --mysql-db=test --myisam-max-rows=1000 --mysql-table-engine=innodb --max-time=3600 --mysql-protocol=tcp  --mysql-socket=/tmp/mysql.sock --max-requests=0 --oltp-read-only=off --db-ps-mode=auto run

###setting default values of OPC instance to bring to the original state
$MYSQL_HOME/bin/mysql -uroot -proot --protocol=tcp -hlocalhost -e "set @@global.max_prepared_stmt_count=16382;set @@global.max_connections=214;drop table test.sbtest;"
$MYSQL_HOME/bin/mysql -uroot -proot --protocol=tcp -hlocalhost -e "set @@global.max_prepared_stmt_count=16382;set @@global.max_connections=214;"
cd $TEST_BASEDIR

-->sysbench attached in the bug report

after around 10mins,shutdown node3 with command:./mysql -uroot -proot --protocol=tcp --port=3308 -e "shutdown;"

restart node3 again with same parameter as that of initial start

connect to node3 and try to rejoin into GR setup with below commands and observe those commands fine:
CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery';
START GROUP_REPLICATION;

observe errors on node1 and node2 as below:
node1:
2017-09-08T06:32:20.983759Z 0 [Note] Plugin group_replication reported: 'Group membership changed to vigdis47:3306, vigdis47:3307, vigdis47:3308 on view 15048518166859418:5.'
2017-09-08T06:32:23.561950Z 15 [ERROR] Plugin group_replication reported: 'Unable to convert the event into a packet on the applier! Error: -7
'
2017-09-08T06:32:23.561993Z 15 [ERROR] Plugin group_replication reported: 'Failed to fetch transaction data containing required transaction info for applier'
2017-09-08T06:32:23.646836Z 15 [ERROR] Plugin group_replication reported: 'Error at event handling! Got error: 1'
2017-09-08T06:32:23.742792Z 15 [ERROR] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2017-09-08T06:32:23.743450Z 0 [Note] Plugin group_replication reported: 'getstart group_id 4317e324'
2017-09-08T06:32:23.743747Z 15 [ERROR] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2017-09-08T06:32:23.758365Z 0 [Note] Plugin group_replication reported: 'getstart group_id 4317e324'
2017-09-08T06:32:23.785538Z 15 [Warning] Plugin group_replication reported: 'Due to a plugin error, some transactions can't be certified and will now rollback.'
2017-09-08T06:32:23.958656Z 1079 [ERROR] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is on ERROR state. Check for errors and restart the plugin'
2017-09-08T06:32:23.958702Z 1079 [ERROR] Run function 'before_commit' in plugin 'group_replication' failed
2017-09-08T06:32:23.958841Z 258 [ERROR] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is on ERROR state. Check for errors and restart the plugin'
2017-09-08T06:32:23.958855Z 258 [ERROR] Run function 'before_commit' in plugin 'group_replication' failed
2017-09-08T06:32:23.992540Z 18 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2017-09-08T06:32:24.005914Z 15 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
2017-09-08T06:32:27.088537Z 0 [Note] Plugin group_replication reported: 'state 4410 action xa_terminate'
2017-09-08T06:32:27.116637Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-09-08T06:32:27.116685Z 0 [Note] Plugin group_replication reported: 'state 4337 action xa_exit'
2017-09-08T06:32:27.144540Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-09-08T06:32:27.144582Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-09-08T06:32:27.358660Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

node2:
2017-09-08T06:32:23.565665Z 5 [ERROR] Plugin group_replication reported: 'Unable to convert the event into a packet on the applier! Error: -7
'
2017-09-08T06:32:23.565789Z 5 [ERROR] Plugin group_replication reported: 'Failed to fetch transaction data containing required transaction info for applier'
2017-09-08T06:32:23.654912Z 5 [ERROR] Plugin group_replication reported: 'Error at event handling! Got error: 1'
2017-09-08T06:32:23.743976Z 0 [Note] Plugin group_replication reported: 'getstart group_id 4317e324'
2017-09-08T06:32:23.757110Z 5 [ERROR] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2017-09-08T06:32:23.757792Z 0 [Note] Plugin group_replication reported: 'getstart group_id 4317e324'
2017-09-08T06:32:23.758130Z 5 [ERROR] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2017-09-08T06:32:23.838368Z 8 [Note] Slave SQL thread for channel 'group_replication_applier' exiting, replication stopped in log 'FIRST' at position 823
2017-09-08T06:32:23.840488Z 5 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
2017-09-08T06:32:27.089093Z 0 [Note] Plugin group_replication reported: 'state 4410 action xa_terminate'
2017-09-08T06:32:27.117093Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-09-08T06:32:27.117144Z 0 [Note] Plugin group_replication reported: 'state 4337 action xa_exit'
2017-09-08T06:32:27.144967Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-09-08T06:32:27.145013Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-09-08T06:32:27.352141Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

node1:
mysql> select *from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 2a7c1c58-945e-11e7-93d0-0021f6f9cba6 | vigdis47    |        3306 | ERROR        |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)

node2:
mysql> select *from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 2ca7e832-945e-11e7-9629-0021f6f9cba6 | vigdis47    |        3307 | ERROR        |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)

node3:
mysql> select *from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 306c9161-945e-11e7-999a-0021f6f9cba6 | vigdis47    |        3308 | RECOVERING   |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)

mysql> select *from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 306c9161-945e-11e7-999a-0021f6f9cba6 | vigdis47    |        3308 | ERROR        |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.21 / 8.0.4 changelog:
In a group with heavy load, joining members could need to retrieve a large amount of data to gain synchrony with the group. If the amount of data retrieved exceeded the View_change packet size of 4Mb the members would fail to join the group and enter Error state. Now, the packet size is taken from slave_max_allowed_packet, which defaults to 1GB. Depending on the load your group processes, you might want to increase the packet size further by configuring slave_max_allowed_packet.

Now published.

Bug#90046 suspect error code in Pipeline_event::convert_log_event_to_packet

was marked as a duplicate.