Bug #86979 group replicat master automatically quit
Submitted: 7 Jul 2017 7:11 Modified: 11 Apr 2018 16:07
Reporter: lu kong Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:mysql5.7.18 OS:Ubuntu
Assigned to: CPU Architecture:Any
Tags: group replicat

[7 Jul 2017 7:11] lu kong
Description:
3 node group replication environment  , kill -9 two slave  process at the same time ,after 20 second start one slave,as a result master automatically quit 

It happened by accident 
 mysql log as follows :
2017-07-07T06:41:17.649691Z 11 [Note] Aborted connection 11 to db: 'performance_schema' user: 'ebaseomm' host: 'localhost' (Got an error reading communication packets)
2017-07-07T06:46:16.117121Z 3459 [Note] Aborted connection 3459 to db: 'performance_schema' user: 'ebaseomm' host: 'localhost' (Got an error reading communication packets)
2017-07-07T06:47:44.188960Z 367 [Note] Stop asynchronous binlog_dump to slave (server_id: 11)
2017-07-07T06:47:44.189073Z 367 [Note] Aborted connection 367 to db: 'unconnected' user: 'rpl_user' host: 'redhat140' (failed on flush_net())
xpecd: /home/rdb/rdb/rapid/plugin/group_replication/libmysqlgcs/src/bindings/xcom/xcom/xcom_cache.c:154: lru_get: Assertion `!is_busy_machine(&retval->pax)' failed.
06:48:22 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=6
max_threads=151
thread_count=5
connection_count=2
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68190 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/home/rdb/bin/xpecd(my_print_stacktrace+0x2c)[0xebee9c]
/home/rdb/bin/xpecd(handle_fatal_signal+0x451)[0x7a5951]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f683076e340]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)[0x7f682fbaef79]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f682fbb2388]
/lib/x86_64-linux-gnu/libc.so.6(+0x2fe36)[0x7f682fba7e36]
/lib/x86_64-linux-gnu/libc.so.6(+0x2fee2)[0x7f682fba7ee2]
/home/rdb/lib/plugin/group_replication.so(+0x86d3b)[0x7f67b9b42d3b]
/home/rdb/lib/plugin/group_replication.so(dispatch_op+0x161f)[0x7f67b9b3ed8f]
/home/rdb/lib/plugin/group_replication.so(acceptor_learner_task+0x440)[0x7f67b9b40200]
/home/rdb/lib/plugin/group_replication.so(task_loop+0x50)[0x7f67b9b4f8e0]
/home/rdb/lib/plugin/group_replication.so(xcom_taskmain2+0xd4)[0x7f67b9b38584]
/home/rdb/lib/plugin/group_replication.so(_ZN19Gcs_xcom_proxy_impl9xcom_initEt+0x21)[0x7f67b9b77911]
/home/rdb/lib/plugin/group_replication.so(+0xa5b0e)[0x7f67b9b61b0e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f6830766182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f682fc7330d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

How to repeat:
3 node group replication environment  , kill -9 two slave  process at the same time ,after 20 second start one slave,
[11 Jul 2017 7:33] Umesh Shastry
Hello lu kong,

Thank you for the report and feedback.
I'm not able to reproduce the issue reported here at my end with default settings(3 node cluster - 1 primary), could you please provide exact configuration used in your environment along with any test case/steps which triggered(other than what provided in initial report) this issue at your end would be greatly appreciated? Thanks...

Thanks,
Umesh
[11 Jul 2017 8:03] lu kong
1.Cluster environment
10.46.178.177  centos6.5   centos177 
10.46.178.152  ubentu 14.04  dhss-ProLiant-DL380-Gen9
10.46.178.140  redhat7.2  redhat140 
10.46.178.17  redhat6.5
2.The cluster configuration 
[general]
instance_num=1

# The xpecd server 
[mysqld]

key_buffer_size = 8M
max_allowed_packet = 4M
table_open_cache = 2000
sort_buffer_size = 256K
read_buffer_size = 128K
read_rnd_buffer_size = 256K
thread_cache_size = 0
query_cache_type=0
query_cache_size = 0
myisam_sort_buffer_size = 8M

innodb_buffer_pool_size = 600M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb-open-files = 300
innodb_log_file_size=100M
innodb_log_files_in_group=2
skip-external-locking
explicit_defaults_for_timestamp
innodb-file-per-table
skip-show-database = 1
max_connect_errors = 100000
innodb_data_file_path=ibdata1:500M:autoextend
innodb_flush_method = O_DIRECT
[mysqld1]
bind_address=10.46.178.152
port=5518
socket=/home/rdb/bin/mysql1.sock
datadir=/home/rdb/data/data
innodb_data_home_dir=/home/rdb/data/data
innodb_log_group_home_dir=/home/rdb/data/redo
log-error=/home/rdb/log/mysqld1.log
pid-file=/home/rdb/bin/mysqld1.pid

secure_file_priv=
show_compatibility_56 = ON

#replication options
server-id=1
log-bin=binlog
#relay-log=../relaylog/relay-bin
#max_binlog_size=10485760
binlog_format=ROW
log-slave-updates=ON
#slave-skip-errors=1007,1008,1050,1060,1061,1062,1068

#semi_sync options on master-host
#rpl_semi_sync_master_enabled = OFF
#rpl_semi_sync_master_timeout = 10000
#rpl_semi_sync_master_trace_level = 64

#semi_sync option on slave-host
#rpl_semi_sync_slave_enabled = OFF
#rpl_semi_sync_slave_trace_level = 32

gtid_mode=ON
enforce_gtid_consistency=ON
master_info_repository=TABLE
relay_log_info_repository=TABLE
binlog_checksum=NONE
#slave_parallel_type=LOGICAL_CLOCK
#slave_parallel_workers=16
#relay_log_recovery=ON

#instructs the server that for each transaction it has to collect the write set and encode it as a hash using the XXHASH64 hashing algorithm.
transaction_write_set_extraction=XXHASH64

#tells the plugin that the group that it is joining, or creating, is named "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa".
loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"

#whitelist
#group_replication_ip_whitelist=127.0.0.1/16

#instructs the plugin to not start operations automatically when the server starts.
loose-group_replication_start_on_boot=off

#tells the plugin to use the IP address 127.0.0.1, or localhost, and port 24901 for incoming connections from other members in the group.
loose-group_replication_local_address= "10.46.178.152:24903"

#tells the plugin that the following members on those hosts and ports should be contacted in case it needs to join the group.
loose-group_replication_group_seeds= "10.46.178.177:24901,10.46.178.140:24902,10.46.178.152:24903,10.46.178.17:24904"

#instructs the plugin whether to boostrap the group or not.
loose-group_replication_bootstrap_group= off

# The following options will be passed to all ec clients 
[client]
#password       = your_password 
port=5518
socket=/home/rdb/bin/mysql1.sock

[mysql]
no-auto-rehash

3.Test case
(1)START GROUP_REPLICATION;
mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (1.04 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST              | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 |        5518 | ONLINE       |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST              | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 |        5518 | ONLINE       |
| group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140                |        5518 | ONLINE       |
| group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177                |        5518 | ONLINE       |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
3 rows in set (0.00 sec)
(2) kill  -9  the  redhat140 and centos177 node at the same time
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST              | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 |        5518 | ONLINE       |
| group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140                |        5518 | UNREACHABLE  |
| group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177                |        5518 | UNREACHABLE  |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
3 rows in set (0.00 sec)

(3) start  the  redhat140 and centos177 node,and the primary node  dhss-ProLiant-DL380-Gen9(10.46.178.152) will crash. it also maybe crashed when kill  -9  the  redhat140 and centos177 node.
[11 Jul 2017 11:03] Umesh Shastry
Thank you for providing requested details.
Even with the same configuration and steps, I'm still not seeing the issue at my end. Could you please confirm if there was any load happening on primary when slave1/slave2 were killed and restarted? Also, when you say "start  the  redhat140 and centos177 node" I assume you meant bringing up MySQL instances and also "START GROUP_REPLICATION;" or just bringing up MySQL server instance? 

Thank you!
[12 Jul 2017 1:34] lu kong
The detailed steps are as follows :
(1)START GROUP_REPLICATION;
mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (1.04 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST              | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 |        5518 | ONLINE       |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST              | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 |        5518 | ONLINE       |
| group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140                |        5518 | ONLINE       |
| group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177                |        5518 | ONLINE       |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
3 rows in set (0.00 sec)
(2)Create database and table,and insert  40000  datas;
create database per character set utf8;
create table per.TBL_TRANSACTION_TRACE(
TRACE_TYPE varchar(2) not null default '1', 
SERIAL_NO int primary key, 
REF_SERIAL_NO varchar(36) default 'abcdefg', 
CORRELATION_ID int ,  
IS_ONE_WAY int default 1, 
RESPONSE_FLAG char(2) default '1', 
REQUEST_ID varchar(10) not null, 
REQUEST_ALIAS varchar(30) not null default 'abcdefg', 
CHANNEL_ID varchar(10) default 'abcdefg', 
CHANNEL_SERIAL_NO varchar(36) default 'abcdefg', 
PROVIDER_ID varchar(10) not null , 
PROVIDER_ALIAS varchar(30) not null default 'abcdefg', 
TARGET_SERVICE varchar(130) not null default 'abcdefg', 
VERSION varchar(10) not null default '1.0', 
OPERATION_NAME varchar(60) default 'abcdefg', 
REQ_QUEUE_MANAGER varchar(48) not null default 'abcdefg', 
REQ_QUEUE_NAME varchar(48) not null default 'abcdefg', 
REQ_REPLYTOQ varchar(48) not null default 'abcdefg', 
REQ_BROKER_NAME varchar(10) not null default 'abcdefg', 
REQ_MSG_ID varchar(48) not null default 'abcdefg', 
REQ_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
TIMEOUT_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
REQ_SERVICE_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
REQ_DB_TIMESTAMP TIMESTAMP not null DEFAULT CURRENT_TIMESTAMP, 
RSP_OPERATION_NAME varchar(60) default 'abcdefg', 
RESPONSE_TYPE char(2) default '1', 
RESPONSE_CODE varchar(6) default 'abc', 
RSP_QUEUE_MANAGER varchar(48) default 'abcdefg', 
RSP_QUEUE_NAME varchar(48) default 'abcdefg', 
RSP_BROKER_NAME varchar(10) default 'abcdefg', 
RSP_MSG_ID varchar(48) default 'abcdefg', 
RSP_SERVICE_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
RSP_DB_TIMESTAMP DATETIME, 
TFR_DB_TIMESTAMP varchar(14) default 'abcdefg', 
REQ_MESSAGE_CONTENT BLOB(16384), 
RSP_MESSAGE_CONTENT BLOB(16384)) ;
create index TBL_TRANSACTION_TRACE_index1 on per.TBL_TRANSACTION_TRACE(CORRELATION_ID);
create index TBL_TRANSACTION_TRACE_index2 on per.TBL_TRANSACTION_TRACE(RSP_DB_TIMESTAMP);
(3)after all datas has insert into table, kill  -9  the  redhat140 and centos177 node at the same time
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST              | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
| group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 |        5518 | ONLINE       |
| group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140                |        5518 | UNREACHABLE  |
| group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177                |        5518 | UNREACHABLE  |
+---------------------------+--------------------------------------+--------------------------+-------------+--------------+
3 rows in set (0.00 sec)

(4) start mysql in the  redhat140 and centos177 node,and execute ”set global group_replication_allow_local_disjoint_gtids_join=ON;START GROUP_REPLICATION;“ ,the primary node  dhss-ProLiant-DL380-Gen9(10.46.178.152) will crash. I’m not  sure that crashing is happened when kill  -9  the  redhat140 and centos177 node  or  start the two  nodes' GROUP_REPLICATION'.

notes:Crash has happened three  times。The probability of happening is about 20% 。
[14 Jul 2017 10:36] Umesh Shastry
Thank you for the details.
Discussed internally with Alfranio, and concluded that it is duplicate existing internal Bug 26236857 - LRU_GET: ASSERTION `!IS_BUSY_MACHINE(&RETVAL->PAX)' FAILED 

Thanks,
Umesh
[11 Apr 2018 16:07] Erlend Dahl
Fixed in 5.7.22 under the heading of

Bug#26236857 LRU_GET: ASSERTION `!IS_BUSY_MACHINE(&RETVAL->PAX)' FAILED