Bug #92450 | innodb cluster replacation crash down | ||
---|---|---|---|
Submitted: | 17 Sep 2018 6:06 | Modified: | 9 Oct 2018 3:00 |
Reporter: | wangkun wang | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S5 (Performance) |
Version: | 8.0.12 | OS: | CentOS |
Assigned to: | CPU Architecture: | x86 | |
Tags: | innodb cluster source command import data |
[17 Sep 2018 6:06]
wangkun wang
[24 Sep 2018 16:38]
MySQL Verification Team
Hi, I need more data to reproduce this. Your config? Data you are loading? - creating 3 node cluster on centos7 - import 5G data worked ok thanks Bogdan
[26 Sep 2018 2:02]
wangkun wang
Do you use the config argument in the error log?I use the config argument and check the group replication status,I can find the mysql group replication from three to two,one mysql instance is down and the error log is as the discrib。My config is as follows. [mysqld] server_id=1001 port=33060 socket=/var/lib/mysql/mysql.sock pid-file=/var/run/mysqld/mysqld.pid datadir=/usr/local/mysql/data tmpdir=/usr/local/mysql/tmp lower_case_table_names=1 user=mysql enforce_gtid_consistency=ON gtid_mode=ON binlog_checksum=none default_authentication_plugin=mysql_native_password skip-name-resolve=ON master_info_repository=TABLE relay_log_info_repository=TABLE log-slave-updates=1 table_open_cache=2048 open_files_limit=1024 max_connections=500 max_connect_errors=10 wait_timeout=28800 interactive_timeout=28800 sql_mode="STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION" log_timestamps=SYSTEM default-time-zone='+8:00' explicit_defaults_for_timestamp=on character-set-server=utf8 default-storage-engine=InnoDB transaction_isolation=READ-COMMITTED skip-external-locking #innodb innodb_data_file_path=ibdata1:1024M:autoextend innodb_buffer_pool_size=4096M innodb_flush_log_at_trx_commit=1 innodb_io_capacity=600 innodb_lock_wait_timeout=120 innodb_log_buffer_size=16M innodb_log_file_size=256M innodb_log_files_in_group=3 innodb_max_dirty_pages_pct=85 innodb_read_io_threads=8 innodb_write_io_threads=8 innodb_thread_concurrency=32 innodb_file_per_table=1 innodb_rollback_on_timeout=on innodb_flush_method=O_DIRECT #session group_concat_max_len=102400 max_allowed_packet=128M sort_buffer_size=8M join_buffer_size=8M key_buffer_size=8M read_buffer_size=8M read_rnd_buffer_size=8M bulk_insert_buffer_size=32M thread_cache_size=16 tmp_table_size=32M max_heap_table_size=32M tmp_table_size=32M #log log_slave_updates=on sync_binlog=1 binlog_format=row log-bin-trust-function-creators=1 binlog_cache_size=64M max_binlog_cache_size=512M max_binlog_size=1024M log-bin=/usr/local/mysql/binlog/mysqlbin log_bin_index=/usr/local/mysql/binlog/mysqlbin.index long_query_time=0.4 log_slow_admin_statements=ON slow_query_log=1 slow_query_log_file=/usr/local/mysql/log/slow.log log-error=/usr/local/mysql/log/error.log log_queries_not_using_indexes=ON log_throttle_queries_not_using_indexes=10 log_output=FILE,TABLE master_info_file=/usr/local/mysql/binlog/master.info binlog_expire_logs_seconds=604800 back_log=512 #mgrset mysqlx_port=33101 mysqlx_socket=/var/lib/mysql/mysqlx.sock transaction_write_set_extraction=XXHASH64 loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" loose-group_replication_start_on_boot=off loose-group_replication_local_address="mysql1:33101" loose-group_replication_group_seeds="mysql1:33101,mysql2:33101,mysql3:33101" loose-group_replication_bootstrap_group=off loose-group_replication_ip_whitelist="192.168.66.0/24" loose-group_replication_auto_increment_increment=1 loose-group_replication_single_primary_mode=on [mysqldump] quick
[26 Sep 2018 2:05]
wangkun wang
I am sorry I can not give you the data.But I have 5.8GB mysql dump file,and one table have 4000000 data.I use the dump file to run source and the problem show again.
[26 Sep 2018 2:10]
MySQL Verification Team
Hi, > I am sorry I can not give you the data.But I have 5.8GB mysql dump file, > and one table have 4000000 data.I use the dump file to run source and > the problem show again. You start with blank servers in replication, you load that dump into master and you get this crash? Can you try loading only that table with 4M rows and nothing else? kind regards Bogdan
[26 Sep 2018 3:58]
wangkun wang
Thanks for your regard! Yes,I did this as the first question described. But,I do not understand the second question.Do you mean I use the source command with a 4M dump file or change the config to 4M.If you mean use the 4M dump file,it ok,this not show the error log. If you mean change the config file form 8M to 4M.In my opinion,I think it is not to change the 8M to 4M to change this problem.the max_threads is the argument in config file max_connection,in my test,the max_threads change as I change the max_connection.I do not think it is right.Because the max_connections is different with max_threads.Of course,may be what I think is wrong.But I think pre cache for the max_connection is right,but this make the memory not enough is something worng.
[26 Sep 2018 5:04]
MySQL Verification Team
Hi, > But,I do not understand the second question.Do you mean I use the source > command with a 4M dump file or change the config to 4M. > If you mean use the 4M dump file,it ok,this not show the error log. Yes, I meant you can try loading one table at a time, to see when the crash will happen. Will it happen if you load that one huge table only, or you need a whole dump to crash it. Does it always crash on the same place or not... So, no need to change anything in config, just loading different data to see what might be the reason for the crash, as I run a 3 node innodb cluster and load 100G database and everything runs without problem so there's maybe something in your data that's crashing it
[26 Sep 2018 10:16]
wangkun wang
I do not think you give the write reason, the primary never crash down,but the slave crash down frequently.So I think the reason you give is not right.But I will test by you say.
[1 Oct 2018 7:03]
Geert Vanderkelen
Hi guys, We had a similar or probably the exact same stack trace with a crashing secondary member. Our issue might have been some network hick-up (in OpenStack). The MySQL member apparently saw this, handled it, recovered, but then crashed (pretty much in the same second). Maybe race-condition? We are using: * MySQL 8.0.12 * Official binaries for Debian 9 (installed from MySQL APT repositories) As far as I can tell, this is now the 4th time this happened (also with 8.011). Your pal, Geert ``` 2018-09-30T21:28:44.166190Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address ..2-fb5uwcb46fcq:3306 is reachable again.' 2018-09-30T21:28:44.166289Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address ..1-egtbu7j4kxnp:3306 is reachable again.' 2018-09-30T21:28:44.166308Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.' 2018-09-30T21:28:44.206274Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.' 2018-09-30T21:28:44.208220Z 0 [ERROR] [MY-013173] [Repl] Plugin group_replication reported: 'The plugin encountered a critical error and will abort: Fatal error during execution of Group Replication' ``` ``` stack_bottom = 0 thread_stack 0x46000 /usr/sbin/mysqld(my_print_stacktrace(unsigned char*, unsigned long)+0x2e) [0x5595c77c526e] /usr/sbin/mysqld(handle_fatal_signal+0x4c1) [0x5595c6aa0e21] /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0) [0x7f9e8a74e0c0] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf) [0x7f9e889e3fff] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a) [0x7f9e889e542a] /usr/lib/mysql/plugin/group_replication.so(abort_plugin_process(char const*)+0x18e) [0x7f9e556f988e] /usr/lib/mysql/plugin/group_replication.so(Applier_module::kill_pending_transactions(bool, bool)+0x5d4) [0x7f9e556bcd64] /usr/lib/mysql/plugin/group_replication.so(Plugin_gcs_events_handler::was_member_expelled_from_group(Gcs_view const&) const+0x357) [0x7f9e556cf7a7] /usr/lib/mysql/plugin/group_replication.so(Plugin_gcs_events_handler::on_view_changed(Gcs_view const&, std::vector<std::pair<Gcs_member_identifier*, Gcs_message_data*>, std::allocator<std::pair<Gcs_member_identifier*, Gcs_message_data*> > > const&) const+0x9f) [0x7f9e556d760f] /usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_control::install_view(Gcs_xcom_view_identifier*, Gcs_group_identifier const&, std::map<Gcs_member_identifier, Xcom_member_state*, std::less<Gcs_member_identifier>, std::allocator<std::pair<Gcs_member_identifier const, Xcom_member_state*> > >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifier*> >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifier/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_control::install_leave_view(Gcs_view::Gcs_view_error_code)+0x308) [0x7f9e5575fb38] /usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_control::xcom_receive_global_view(synode_no, Gcs_xcom_nodes*, bool)+0x60c) [0x7f9e55763d8c] /usr/lib/mysql/plugin/group_replication.so(do_cb_xcom_receive_global_view(synode_no, synode_no, Gcs_xcom_nodes*)+0xc2) [0x7f9e55726142] /usr/lib/mysql/plugin/group_replication.so(Global_view_notification::do_execute()+0x20) [0x7f9e5572fac0] /usr/lib/mysql/plugin/group_replication.so(Parameterized_notification<false>::operator()()+0xa) [0x7f9e5572fb4a] /usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_engine::process()+0x97) [0x7f9e5572ff07] /usr/lib/mysql/plugin/group_replication.so(process_notification_thread(void*)+0x9) [0x7f9e55730169] /usr/sbin/mysqld(+0x1ee820f) [0x5595c7c5920f] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f9e8a744494] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f9e88a99acf] ``` Maybe relevant, bug conservatively boring config: ``` [mysqld] transaction_isolation = READ-COMMITTED server_id = 1427 log_bin = binlog binlog_expire_logs_seconds = 864000 # 10 days gtid_mode = ON enforce_gtid_consistency = ON binlog_checksum = NONE binlog_rows_query_log_events = ON log_bin_trust_function_creators = ON # Replication relay_log = relaylog relay_log_recovery = ON log_slave_updates = ON master_info_repository = TABLE relay_log_info_repository = TABLE slave_parallel_type = LOGICAL_CLOCK slave_parallel_workers = 4 slave_preserve_commit_order = 1 slave_type_conversions = ALL_NON_LOSSY disabled_storage_engines = "MyISAM,FEDERATED,ARCHIVE,BLACKHOLE" # Group Replication plugin_load=group_replication.so group_replication_start_on_boot = OFF group_replication_bootstrap_group = OFF group_replication_ip_whitelist = "localhost,***" group_replication_group_name = 2b2fa659-22c7-4cfd-910b-c5ee96fc29e9 group_replication_local_address = **** group_replication_group_seeds = "**3members*" ```
[1 Oct 2018 7:06]
Geert Vanderkelen
For the record, the failing group member successfully recovered. MySQL HA was never an issue here :)
[1 Oct 2018 10:35]
Geert Vanderkelen
I can easily reproduce this with virtual boxes: 1) Run 3 mysqld in VirtualBox, and configure MySQL Replication so it is nicely running. 2) Get the primary, for example host 'mysqld1' is primary, 'mysqld2' a secondary. 3) "Unplug" network for both a secondary, and the primary: $ VBoxManage controlvm mysqld1 setlinkstate2 off ; VBoxManage controlvm mysqld2 setlinkstate2 off Pretty much consistently crashes MySQL with the above trace.
[1 Oct 2018 16:12]
MySQL Verification Team
Hey Geert, long time, Verified with VBox without a problem ... I was concentrating on "data import" part but looks it's about network.. thanks as always :) all best Bogdan
[9 Oct 2018 3:00]
wangkun wang
I think this problem is not solved in mysql.But I use the 8GB centos this problem disappear.So I decide to close this bug.But as a programer,I do not think this problem is over.I will continue to pay attention to the innodb cluster solution.This is a big project in the future.
[7 Nov 2018 10:26]
Nuno Carvalho
Hi, The case you are presented is not a bug, let me explain why. Your system is not capable of handling all the load, this makes the member to be expelled from the group, that is, the load is so high that the system is not even to perform the network operations need to inform the group that all members are alive. You already replied that increasing the memory, which provides more computing resources, allows the system to perform. Once a member is expelled two things can happen: 1) a member can be put in read-only mode. On which writes are disallowed, but reads are allowed. 2) a member does kill it self; This will break all connections to that member, disallowing any read or write. This behaviour is controlled by the option group_replication_exit_state_action, which default is ABORT_SERVER. This is exactly what you are seeing. https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic... To overcome this scenario you have 3 alternatives: 1) increase the hardware resources, you already did that. 2) if the expel is due to network latency (not your case), you can configure the timeout after which a member is expelled using the option group_replication_member_expel_timeout https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic... 3) change the exit_state_action to READ_ONLY, it will not solve the expel but will dodge the abort. Best regards, Nuno Carvalho
[11 Dec 2020 4:00]
v v
Please add a log. [ERROR] MySQL server will crash shutdon becauase of group_replication_exit_state_action = ABORT_SERVER.