MySQL Bugs: #92450: innodb cluster replacation crash down

Bug #92450	innodb cluster replacation crash down
Submitted:	17 Sep 2018 6:06	Modified:	9 Oct 2018 3:00
Reporter:	wangkun wang	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S5 (Performance)
Version:	8.0.12	OS:	CentOS
Assigned to:		CPU Architecture:	x86
Tags:	innodb cluster source command import data

Description:
when I use innodb cluster 8.0.12 to source dump data,the secondary cluster crash down.

How to repeat:
You can use three machines with centos 7 64,make a group replacation ,with 4Gb memory,source command to import 4.8Gb datas,it crashed.
the error is as follows.
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=8388608
max_used_connections=8
max_threads=500
thread_count=7
connection_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8205723 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x46000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char*, unsigned long)+0x3d) [0x1aa273d]
/usr/sbin/mysqld(handle_fatal_signal+0x423) [0xcd7b83]
/lib64/libpthread.so.0(+0xf6d0) [0x7f98c6e616d0]
/lib64/libc.so.6(gsignal+0x37) [0x7f98c5178277]
/lib64/libc.so.6(abort+0x148) [0x7f98c5179968]
/usr/lib64/mysql/plugin/group_replication.so(abort_plugin_process(char const*)+0x1b1) [0x7f983ca100b1]
/usr/lib64/mysql/plugin/group_replication.so(Applier_module::kill_pending_transactions(bool, bool)+0x30a) [0x7f983c9d303a]
/usr/lib64/mysql/plugin/group_replication.so(Plugin_gcs_events_handler::was_member_expelled_from_group(Gcs_view const&) const+0x367) [0x7f983c9e6087]
/usr/lib64/mysql/plugin/group_replication.so(Plugin_gcs_events_handler::on_view_changed(Gcs_view const&, std::vector<std::pair<Gcs_member_identifier*, Gcs_message_data*>, std::allocator<std::pair<Gcs_member_identifier*, Gcs_message_data*> > > const&) const+0xa5) [0x7f983c9edbc5]
/usr/lib64/mysql/plugin/group_replication.so(Gcs_xcom_control::install_view(Gcs_xcom_view_identifier*, Gcs_group_identifier const&, std::map<Gcs_member_identifier, Xcom_member_state*, std::less<Gcs_member_identifier>, std::allocator<std::pair<Gcs_member_identifier const, Xcom_member_state*> > >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifier*> >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifi/usr/lib64/mysql/plugin/group_replication.so(Gcs_xcom_control::install_leave_view(Gcs_view::Gcs_view_error_code)+0x2b8) [0x7f983ca715e8]
/usr/lib64/mysql/plugin/group_replication.so(Gcs_xcom_control::xcom_receive_global_view(synode_no, Gcs_xcom_nodes*, bool)+0x1939) [0x7f983ca7b649]
/usr/lib64/mysql/plugin/group_replication.so(do_cb_xcom_receive_global_view(synode_no, synode_no, Gcs_xcom_nodes*)+0xc2) [0x7f983ca3da52]
/usr/lib64/mysql/plugin/group_replication.so(Global_view_notification::do_execute()+0x20) [0x7f983ca478a0]
/usr/lib64/mysql/plugin/group_replication.so(Parameterized_notification<false>::operator()()+0xa) [0x7f983ca4792a]
/usr/lib64/mysql/plugin/group_replication.so(Gcs_xcom_engine::process()+0x97) [0x7f983ca47cc7]
/usr/lib64/mysql/plugin/group_replication.so(process_notification_thread(void*)+0x9) [0x7f983ca47f09]
/usr/sbin/mysqld() [0x1f42a3f]
/lib64/libpthread.so.0(+0x7e25) [0x7f98c6e59e25]
/lib64/libc.so.6(clone+0x6d) [0x7f98c5240bad]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

Hi,
I need more data to reproduce this. Your config? Data you are loading?

- creating 3 node cluster on centos7
- import 5G data

worked ok

thanks
Bogdan

Do you use the config argument in the error log?I use the config argument and check the group replication status,I can find the mysql group replication from three to two,one mysql instance is down and the error log is as the discrib。My config is as follows.
[mysqld]
server_id=1001
port=33060
socket=/var/lib/mysql/mysql.sock
pid-file=/var/run/mysqld/mysqld.pid
datadir=/usr/local/mysql/data
tmpdir=/usr/local/mysql/tmp
lower_case_table_names=1
user=mysql
enforce_gtid_consistency=ON
gtid_mode=ON
binlog_checksum=none
default_authentication_plugin=mysql_native_password
skip-name-resolve=ON
master_info_repository=TABLE
relay_log_info_repository=TABLE
log-slave-updates=1
table_open_cache=2048
open_files_limit=1024
max_connections=500
max_connect_errors=10
wait_timeout=28800
interactive_timeout=28800
sql_mode="STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION"
log_timestamps=SYSTEM
default-time-zone='+8:00'
explicit_defaults_for_timestamp=on
character-set-server=utf8
default-storage-engine=InnoDB
transaction_isolation=READ-COMMITTED
skip-external-locking
#innodb
innodb_data_file_path=ibdata1:1024M:autoextend
innodb_buffer_pool_size=4096M
innodb_flush_log_at_trx_commit=1
innodb_io_capacity=600
innodb_lock_wait_timeout=120
innodb_log_buffer_size=16M
innodb_log_file_size=256M
innodb_log_files_in_group=3
innodb_max_dirty_pages_pct=85
innodb_read_io_threads=8
innodb_write_io_threads=8
innodb_thread_concurrency=32
innodb_file_per_table=1
innodb_rollback_on_timeout=on
innodb_flush_method=O_DIRECT
#session
group_concat_max_len=102400
max_allowed_packet=128M
sort_buffer_size=8M
join_buffer_size=8M
key_buffer_size=8M
read_buffer_size=8M
read_rnd_buffer_size=8M
bulk_insert_buffer_size=32M
thread_cache_size=16
tmp_table_size=32M
max_heap_table_size=32M
tmp_table_size=32M
#log
log_slave_updates=on
sync_binlog=1
binlog_format=row
log-bin-trust-function-creators=1
binlog_cache_size=64M
max_binlog_cache_size=512M
max_binlog_size=1024M
log-bin=/usr/local/mysql/binlog/mysqlbin
log_bin_index=/usr/local/mysql/binlog/mysqlbin.index
long_query_time=0.4
log_slow_admin_statements=ON
slow_query_log=1
slow_query_log_file=/usr/local/mysql/log/slow.log
log-error=/usr/local/mysql/log/error.log
log_queries_not_using_indexes=ON
log_throttle_queries_not_using_indexes=10
log_output=FILE,TABLE
master_info_file=/usr/local/mysql/binlog/master.info
binlog_expire_logs_seconds=604800
back_log=512
#mgrset
mysqlx_port=33101
mysqlx_socket=/var/lib/mysql/mysqlx.sock
transaction_write_set_extraction=XXHASH64
loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
loose-group_replication_start_on_boot=off
loose-group_replication_local_address="mysql1:33101"
loose-group_replication_group_seeds="mysql1:33101,mysql2:33101,mysql3:33101"
loose-group_replication_bootstrap_group=off
loose-group_replication_ip_whitelist="192.168.66.0/24"
loose-group_replication_auto_increment_increment=1
loose-group_replication_single_primary_mode=on
[mysqldump]
quick

I am sorry I can not give you the data.But I have 5.8GB mysql dump file,and one table have 4000000 data.I use the dump file to run source and the problem show again.

Hi,

> I am sorry I can not give you the data.But I have 5.8GB mysql dump file,
> and one table have 4000000 data.I use the dump file to run source and 
> the problem show again.

You start with blank servers in replication, you load that dump into master and you get this crash?

Can you try loading only that table with 4M rows and nothing else?

kind regards
Bogdan

Thanks for your regard!

Yes,I did this as the first question described.

But,I do not understand the second question.Do you mean I use the source command with a 4M dump file or change the config to 4M.If you mean use the 4M dump file，it ok,this not show the error log.
If you mean change the config file form 8M to 4M.In my opinion,I think it is not to change the 8M to 4M to change this problem.the max_threads is the argument in config file max_connection,in my test,the max_threads change as I change the max_connection.I do not think it is right.Because the max_connections is different with max_threads.Of course，may be what I think is wrong.But I think pre cache for the max_connection is right,but this make the memory not enough is something worng.

Hi,

> But,I do not understand the second question.Do you mean I use the source
> command with a 4M dump file or change the config to 4M.
> If you mean use the 4M dump file，it ok,this not show the error log.

Yes, I meant you can try loading one table at a time, to see when the crash will happen. Will it happen if you load that one huge table only, or you need a whole dump to crash it. Does it always crash on the same place or not... 

So, no need to change anything in config, just loading different data to see what might be the reason for the crash, as I run a 3 node innodb cluster and load 100G database and everything runs without problem so there's maybe something in your data that's crashing it

I do not think you give the write reason, the primary never crash down,but the slave crash down frequently.So I think the reason you give is not right.But I will test by you say.

Hi guys,

We had a similar or probably the exact same stack trace with a crashing secondary member.

Our issue might have been some network hick-up (in OpenStack). The MySQL member apparently saw this, handled it, recovered, but then crashed (pretty much in the same second). Maybe race-condition?

We are using:
* MySQL 8.0.12
* Official binaries for Debian 9 (installed from MySQL APT repositories)

As far as I can tell, this is now the 4th time this happened (also with 8.011).

Your pal,
Geert

```
2018-09-30T21:28:44.166190Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address ..2-fb5uwcb46fcq:3306 is reachable again.'
2018-09-30T21:28:44.166289Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address ..1-egtbu7j4kxnp:3306 is reachable again.'
2018-09-30T21:28:44.166308Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2018-09-30T21:28:44.206274Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2018-09-30T21:28:44.208220Z 0 [ERROR] [MY-013173] [Repl] Plugin group_replication reported: 'The plugin encountered a critical error and will abort: Fatal error during execution of Group Replication'

```

```
stack_bottom = 0 thread_stack 0x46000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char*, unsigned long)+0x2e) [0x5595c77c526e]
/usr/sbin/mysqld(handle_fatal_signal+0x4c1) [0x5595c6aa0e21]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0) [0x7f9e8a74e0c0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf) [0x7f9e889e3fff]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a) [0x7f9e889e542a]
/usr/lib/mysql/plugin/group_replication.so(abort_plugin_process(char const*)+0x18e) [0x7f9e556f988e]
/usr/lib/mysql/plugin/group_replication.so(Applier_module::kill_pending_transactions(bool, bool)+0x5d4) [0x7f9e556bcd64]
/usr/lib/mysql/plugin/group_replication.so(Plugin_gcs_events_handler::was_member_expelled_from_group(Gcs_view const&) const+0x357) [0x7f9e556cf7a7]
/usr/lib/mysql/plugin/group_replication.so(Plugin_gcs_events_handler::on_view_changed(Gcs_view const&, std::vector<std::pair<Gcs_member_identifier*, Gcs_message_data*>, std::allocator<std::pair<Gcs_member_identifier*, Gcs_message_data*> > > const&) const+0x9f) [0x7f9e556d760f]
/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_control::install_view(Gcs_xcom_view_identifier*, Gcs_group_identifier const&, std::map<Gcs_member_identifier, Xcom_member_state*, std::less<Gcs_member_identifier>, std::allocator<std::pair<Gcs_member_identifier const, Xcom_member_state*> > >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifier*> >*, std::set<Gcs_member_identifier*, std::less<Gcs_member_identifier*>, std::allocator<Gcs_member_identifier/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_control::install_leave_view(Gcs_view::Gcs_view_error_code)+0x308) [0x7f9e5575fb38]
/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_control::xcom_receive_global_view(synode_no, Gcs_xcom_nodes*, bool)+0x60c) [0x7f9e55763d8c]
/usr/lib/mysql/plugin/group_replication.so(do_cb_xcom_receive_global_view(synode_no, synode_no, Gcs_xcom_nodes*)+0xc2) [0x7f9e55726142]
/usr/lib/mysql/plugin/group_replication.so(Global_view_notification::do_execute()+0x20) [0x7f9e5572fac0]
/usr/lib/mysql/plugin/group_replication.so(Parameterized_notification<false>::operator()()+0xa) [0x7f9e5572fb4a]
/usr/lib/mysql/plugin/group_replication.so(Gcs_xcom_engine::process()+0x97) [0x7f9e5572ff07]
/usr/lib/mysql/plugin/group_replication.so(process_notification_thread(void*)+0x9) [0x7f9e55730169]
/usr/sbin/mysqld(+0x1ee820f) [0x5595c7c5920f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f9e8a744494]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f9e88a99acf]

```

Maybe relevant, bug conservatively boring config:

```
[mysqld]
transaction_isolation = READ-COMMITTED

server_id = 1427
log_bin = binlog
binlog_expire_logs_seconds = 864000  # 10 days
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_checksum = NONE
binlog_rows_query_log_events = ON
log_bin_trust_function_creators = ON

# Replication
relay_log = relaylog
relay_log_recovery = ON
log_slave_updates = ON
master_info_repository = TABLE
relay_log_info_repository = TABLE
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 4
slave_preserve_commit_order = 1
slave_type_conversions = ALL_NON_LOSSY
disabled_storage_engines = "MyISAM,FEDERATED,ARCHIVE,BLACKHOLE"

# Group Replication
plugin_load=group_replication.so
group_replication_start_on_boot = OFF
group_replication_bootstrap_group = OFF
group_replication_ip_whitelist = "localhost,***"
group_replication_group_name = 2b2fa659-22c7-4cfd-910b-c5ee96fc29e9
group_replication_local_address = ****
group_replication_group_seeds = "**3members*"
```

For the record, the failing group member successfully recovered. MySQL HA was never an issue here :)

I can easily reproduce this with virtual boxes:

1) Run 3 mysqld in VirtualBox, and configure MySQL Replication so it is nicely running.
2) Get the primary, for example host 'mysqld1' is primary, 'mysqld2' a secondary.
3) "Unplug" network for both a secondary, and the primary:

$ VBoxManage controlvm mysqld1 setlinkstate2 off ; VBoxManage controlvm mysqld2 setlinkstate2 off

Pretty much consistently crashes MySQL with the above trace.

Hey Geert, long time,

Verified with VBox without a problem ... I was concentrating on "data import" part but looks it's about network.. thanks as always :)

all best
Bogdan

I think this problem is not solved in mysql.But I use the 8GB centos this problem disappear.So I decide to close this bug.But as a programer,I do not think this problem is over.I will continue to pay attention to the innodb cluster solution.This is a big project in the future.

Hi,

The case you are presented is not a bug, let me explain why.

Your system is not capable of handling all the load, this makes the member to
be expelled from the group, that is, the load is so high that the system is
not even to perform the network operations need to inform the group that all
members are alive.
You already replied that increasing the memory, which provides more computing
resources, allows the system to perform.

Once a member is expelled two things can happen:
 1) a member can be put in read-only mode.
    On which writes are disallowed, but reads are allowed.
 2) a member does kill it self;
    This will break all connections to that member, disallowing any read or
    write.
    This behaviour is controlled by the option
    group_replication_exit_state_action, which default is ABORT_SERVER.
This is exactly what you are seeing.
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

To overcome this scenario you have 3 alternatives:
 1) increase the hardware resources, you already did that.
 2) if the expel is due to network latency (not your case), you can configure the timeout after which a member is expelled using the option group_replication_member_expel_timeout
     https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...
 3) change the exit_state_action to READ_ONLY, it will not solve the expel but will dodge the abort.

Best regards,
Nuno Carvalho

Please add a log.
[ERROR] MySQL server will crash shutdon becauase of group_replication_exit_state_action = ABORT_SERVER.