Bug #86979 | group replicat master automatically quit | ||
---|---|---|---|
Submitted: | 7 Jul 2017 7:11 | Modified: | 11 Apr 2018 16:07 |
Reporter: | lu kong | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S3 (Non-critical) |
Version: | mysql5.7.18 | OS: | Ubuntu |
Assigned to: | CPU Architecture: | Any | |
Tags: | group replicat |
[7 Jul 2017 7:11]
lu kong
[11 Jul 2017 7:33]
MySQL Verification Team
Hello lu kong, Thank you for the report and feedback. I'm not able to reproduce the issue reported here at my end with default settings(3 node cluster - 1 primary), could you please provide exact configuration used in your environment along with any test case/steps which triggered(other than what provided in initial report) this issue at your end would be greatly appreciated? Thanks... Thanks, Umesh
[11 Jul 2017 8:03]
lu kong
1.Cluster environment 10.46.178.177 centos6.5 centos177 10.46.178.152 ubentu 14.04 dhss-ProLiant-DL380-Gen9 10.46.178.140 redhat7.2 redhat140 10.46.178.17 redhat6.5 2.The cluster configuration [general] instance_num=1 # The xpecd server [mysqld] key_buffer_size = 8M max_allowed_packet = 4M table_open_cache = 2000 sort_buffer_size = 256K read_buffer_size = 128K read_rnd_buffer_size = 256K thread_cache_size = 0 query_cache_type=0 query_cache_size = 0 myisam_sort_buffer_size = 8M innodb_buffer_pool_size = 600M innodb_log_buffer_size = 8M innodb_flush_log_at_trx_commit = 1 innodb_lock_wait_timeout = 50 innodb-open-files = 300 innodb_log_file_size=100M innodb_log_files_in_group=2 skip-external-locking explicit_defaults_for_timestamp innodb-file-per-table skip-show-database = 1 max_connect_errors = 100000 innodb_data_file_path=ibdata1:500M:autoextend innodb_flush_method = O_DIRECT [mysqld1] bind_address=10.46.178.152 port=5518 socket=/home/rdb/bin/mysql1.sock datadir=/home/rdb/data/data innodb_data_home_dir=/home/rdb/data/data innodb_log_group_home_dir=/home/rdb/data/redo log-error=/home/rdb/log/mysqld1.log pid-file=/home/rdb/bin/mysqld1.pid secure_file_priv= show_compatibility_56 = ON #replication options server-id=1 log-bin=binlog #relay-log=../relaylog/relay-bin #max_binlog_size=10485760 binlog_format=ROW log-slave-updates=ON #slave-skip-errors=1007,1008,1050,1060,1061,1062,1068 #semi_sync options on master-host #rpl_semi_sync_master_enabled = OFF #rpl_semi_sync_master_timeout = 10000 #rpl_semi_sync_master_trace_level = 64 #semi_sync option on slave-host #rpl_semi_sync_slave_enabled = OFF #rpl_semi_sync_slave_trace_level = 32 gtid_mode=ON enforce_gtid_consistency=ON master_info_repository=TABLE relay_log_info_repository=TABLE binlog_checksum=NONE #slave_parallel_type=LOGICAL_CLOCK #slave_parallel_workers=16 #relay_log_recovery=ON #instructs the server that for each transaction it has to collect the write set and encode it as a hash using the XXHASH64 hashing algorithm. transaction_write_set_extraction=XXHASH64 #tells the plugin that the group that it is joining, or creating, is named "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa". loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" #whitelist #group_replication_ip_whitelist=127.0.0.1/16 #instructs the plugin to not start operations automatically when the server starts. loose-group_replication_start_on_boot=off #tells the plugin to use the IP address 127.0.0.1, or localhost, and port 24901 for incoming connections from other members in the group. loose-group_replication_local_address= "10.46.178.152:24903" #tells the plugin that the following members on those hosts and ports should be contacted in case it needs to join the group. loose-group_replication_group_seeds= "10.46.178.177:24901,10.46.178.140:24902,10.46.178.152:24903,10.46.178.17:24904" #instructs the plugin whether to boostrap the group or not. loose-group_replication_bootstrap_group= off # The following options will be passed to all ec clients [client] #password = your_password port=5518 socket=/home/rdb/bin/mysql1.sock [mysql] no-auto-rehash 3.Test case (1)START GROUP_REPLICATION; mysql> START GROUP_REPLICATION; Query OK, 0 rows affected (1.04 sec) mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 | 5518 | ONLINE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 | 5518 | ONLINE | | group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140 | 5518 | ONLINE | | group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177 | 5518 | ONLINE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ 3 rows in set (0.00 sec) (2) kill -9 the redhat140 and centos177 node at the same time mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 | 5518 | ONLINE | | group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140 | 5518 | UNREACHABLE | | group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177 | 5518 | UNREACHABLE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ 3 rows in set (0.00 sec) (3) start the redhat140 and centos177 node,and the primary node dhss-ProLiant-DL380-Gen9(10.46.178.152) will crash. it also maybe crashed when kill -9 the redhat140 and centos177 node.
[11 Jul 2017 11:03]
MySQL Verification Team
Thank you for providing requested details. Even with the same configuration and steps, I'm still not seeing the issue at my end. Could you please confirm if there was any load happening on primary when slave1/slave2 were killed and restarted? Also, when you say "start the redhat140 and centos177 node" I assume you meant bringing up MySQL instances and also "START GROUP_REPLICATION;" or just bringing up MySQL server instance? Thank you!
[12 Jul 2017 1:34]
lu kong
The detailed steps are as follows : (1)START GROUP_REPLICATION; mysql> START GROUP_REPLICATION; Query OK, 0 rows affected (1.04 sec) mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 | 5518 | ONLINE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 | 5518 | ONLINE | | group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140 | 5518 | ONLINE | | group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177 | 5518 | ONLINE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ 3 rows in set (0.00 sec) (2)Create database and table,and insert 40000 datas; create database per character set utf8; create table per.TBL_TRANSACTION_TRACE( TRACE_TYPE varchar(2) not null default '1', SERIAL_NO int primary key, REF_SERIAL_NO varchar(36) default 'abcdefg', CORRELATION_ID int , IS_ONE_WAY int default 1, RESPONSE_FLAG char(2) default '1', REQUEST_ID varchar(10) not null, REQUEST_ALIAS varchar(30) not null default 'abcdefg', CHANNEL_ID varchar(10) default 'abcdefg', CHANNEL_SERIAL_NO varchar(36) default 'abcdefg', PROVIDER_ID varchar(10) not null , PROVIDER_ALIAS varchar(30) not null default 'abcdefg', TARGET_SERVICE varchar(130) not null default 'abcdefg', VERSION varchar(10) not null default '1.0', OPERATION_NAME varchar(60) default 'abcdefg', REQ_QUEUE_MANAGER varchar(48) not null default 'abcdefg', REQ_QUEUE_NAME varchar(48) not null default 'abcdefg', REQ_REPLYTOQ varchar(48) not null default 'abcdefg', REQ_BROKER_NAME varchar(10) not null default 'abcdefg', REQ_MSG_ID varchar(48) not null default 'abcdefg', REQ_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, TIMEOUT_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, REQ_SERVICE_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, REQ_DB_TIMESTAMP TIMESTAMP not null DEFAULT CURRENT_TIMESTAMP, RSP_OPERATION_NAME varchar(60) default 'abcdefg', RESPONSE_TYPE char(2) default '1', RESPONSE_CODE varchar(6) default 'abc', RSP_QUEUE_MANAGER varchar(48) default 'abcdefg', RSP_QUEUE_NAME varchar(48) default 'abcdefg', RSP_BROKER_NAME varchar(10) default 'abcdefg', RSP_MSG_ID varchar(48) default 'abcdefg', RSP_SERVICE_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP, RSP_DB_TIMESTAMP DATETIME, TFR_DB_TIMESTAMP varchar(14) default 'abcdefg', REQ_MESSAGE_CONTENT BLOB(16384), RSP_MESSAGE_CONTENT BLOB(16384)) ; create index TBL_TRANSACTION_TRACE_index1 on per.TBL_TRANSACTION_TRACE(CORRELATION_ID); create index TBL_TRANSACTION_TRACE_index2 on per.TBL_TRANSACTION_TRACE(RSP_DB_TIMESTAMP); (3)after all datas has insert into table, kill -9 the redhat140 and centos177 node at the same time mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ | group_replication_applier | 3cc750e9-5ae6-11e7-a510-e0071bf246f4 | dhss-ProLiant-DL380-Gen9 | 5518 | ONLINE | | group_replication_applier | 659742df-617b-11e7-a882-e0071bf256bc | redhat140 | 5518 | UNREACHABLE | | group_replication_applier | af5eaa0c-5aa3-11e7-8a7b-e0071bf38438 | centos177 | 5518 | UNREACHABLE | +---------------------------+--------------------------------------+--------------------------+-------------+--------------+ 3 rows in set (0.00 sec) (4) start mysql in the redhat140 and centos177 node,and execute ”set global group_replication_allow_local_disjoint_gtids_join=ON;START GROUP_REPLICATION;“ ,the primary node dhss-ProLiant-DL380-Gen9(10.46.178.152) will crash. I’m not sure that crashing is happened when kill -9 the redhat140 and centos177 node or start the two nodes' GROUP_REPLICATION'. notes:Crash has happened three times。The probability of happening is about 20% 。
[14 Jul 2017 10:36]
MySQL Verification Team
Thank you for the details. Discussed internally with Alfranio, and concluded that it is duplicate existing internal Bug 26236857 - LRU_GET: ASSERTION `!IS_BUSY_MACHINE(&RETVAL->PAX)' FAILED Thanks, Umesh
[11 Apr 2018 16:07]
Erlend Dahl
Fixed in 5.7.22 under the heading of Bug#26236857 LRU_GET: ASSERTION `!IS_BUSY_MACHINE(&RETVAL->PAX)' FAILED