Bug #84731 Group Replication: mysql client connections hang during group replication start
Submitted: 31 Jan 2017 8:56 Modified: 18 Sep 2017 9:02
Reporter: Kenny Gryp Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7.17 OS:Any
Assigned to: CPU Architecture:Any

[31 Jan 2017 8:56] Kenny Gryp
Description:
When a node is starting group replication, mysql connection hangs. 

[vagrant@gr-3 ~]$ mysql
.....

- This will cause applications to get stuck trying to connect to that database. this can cause major problems.
- proxies such as proxysql will run a scheduler to do a healthcheck of the nodes and figure out the status of a node by connecting to mysql,  buti  will be stuck in timeout effectively blocking it's operation.

How to repeat:
- create a mysql config with group replication enabled and start on boot make sure no node is yet running
-start mysql, during that GR will get 30 seconds or it will time out
- in the meantime, try to connect with mysql client. it will not hang

2017-01-31T08:50:04.400882Z 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.17-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)
2017-01-31T08:50:04.461422Z 3 [Note] Plugin group_replication reported: 'Group communication SSL configuration: group_replication_ssl_mode: "DISABLED"'
2017-01-31T08:50:04.466216Z 3 [Note] Plugin group_replication reported: '[GCS] Added automatically IP ranges 10.0.2.15/24,127.0.0.1/8,192.168.56.4/24 to the whitelist'
2017-01-31T08:50:04.470824Z 3 [Note] Plugin group_replication reported: '[GCS] Translated 'gr-3' to 192.168.56.4'
2017-01-31T08:50:04.471021Z 3 [Note] Plugin group_replication reported: '[GCS] SSL was not enabled'
2017-01-31T08:50:04.471041Z 3 [Note] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: "da7aba5e-dead-da7a-ba55-da7aba5e57ab"; group_replication_local_address: "gr-3:24901"; group_replication_group_seeds: "gr-1:24901,gr-2:24901,gr-3:24901"; group_replication_bootstrap_group: false; group_replication_poll_spin_loops: 0; group_replication_compression_threshold: 1000000; group_replication_ip_whitelist: "AUTOMATIC"'
2017-01-31T08:50:04.495246Z 4 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2017-01-31T08:50:04.525500Z 7 [Note] Slave SQL thread for channel 'group_replication_applier' initialized, starting replication in log 'FIRST' at position 0, relay log './gr-3-relay-bin-group_replication_applier.000028' position: 4
2017-01-31T08:50:04.526349Z 3 [Note] Plugin group_replication reported: 'Group Replication applier module successfully initialized!'
2017-01-31T08:50:04.545283Z 3 [Note] Plugin group_replication reported: 'auto_increment_increment is set to 7'
2017-01-31T08:50:04.545320Z 3 [Note] Plugin group_replication reported: 'auto_increment_offset is set to 3'
2017-01-31T08:50:04.545624Z 0 [Note] Plugin group_replication reported: 'state 0 action xa_init'
2017-01-31T08:50:04.578010Z 0 [Note] Plugin group_replication reported: 'Successfully bound to 0.0.0.0:24901 (socket=48).'
2017-01-31T08:50:04.578079Z 0 [Note] Plugin group_replication reported: 'Successfully set listen backlog to 32 (socket=48)!'
2017-01-31T08:50:04.578095Z 0 [Note] Plugin group_replication reported: 'Successfully unblocked socket (socket=48)!'
2017-01-31T08:50:04.578146Z 0 [Note] Plugin group_replication reported: 'Ready to accept incoming connections on 0.0.0.0:24901 (socket=48)!'
2017-01-31T08:50:04.579697Z 0 [Note] Plugin group_replication reported: 'connecting to gr-3 24901'
2017-01-31T08:50:04.580011Z 0 [Note] Plugin group_replication reported: 'client connected to gr-3 24901 fd 49'
2017-01-31T08:50:04.581846Z 0 [Note] Plugin group_replication reported: 'connecting to gr-3 24901'
2017-01-31T08:50:04.582031Z 0 [Note] Plugin group_replication reported: 'client connected to gr-3 24901 fd 63'
2017-01-31T08:50:04.582501Z 0 [Note] Plugin group_replication reported: 'connecting to gr-3 24901'
2017-01-31T08:50:04.582638Z 0 [Note] Plugin group_replication reported: 'client connected to gr-3 24901 fd 65'
2017-01-31T08:50:04.583002Z 0 [Note] Plugin group_replication reported: 'connecting to gr-3 24901'
2017-01-31T08:50:04.583129Z 0 [Note] Plugin group_replication reported: 'client connected to gr-3 24901 fd 67'
2017-01-31T08:50:04.583450Z 0 [Note] Plugin group_replication reported: 'connecting to gr-3 24901'
2017-01-31T08:50:04.583568Z 0 [Note] Plugin group_replication reported: 'client connected to gr-3 24901 fd 69'
2017-01-31T08:50:04.583916Z 0 [Note] Plugin group_replication reported: 'connecting to gr-3 24901'
2017-01-31T08:50:04.584033Z 0 [Note] Plugin group_replication reported: 'client connected to gr-3 24901 fd 71'
2017-01-31T08:50:04.584360Z 0 [Note] Plugin group_replication reported: 'connecting to gr-1 24901'
2017-01-31T08:50:04.585937Z 0 [Note] Plugin group_replication reported: 'client connected to gr-1 24901 fd 73'

[vagrant@gr-3 ~]$ mysql
....

2017-01-31T08:50:34.600240Z 0 [ERROR] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to be ready!'
2017-01-31T08:50:34.600292Z 0 [ERROR] Plugin group_replication reported: '[GCS] The group communication engine is not ready for the member to join. Local port: 24901'
2017-01-31T08:50:34.600417Z 0 [Note] Plugin group_replication reported: 'state 4257 action xa_terminate'
2017-01-31T08:50:34.600436Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-01-31T08:50:34.600439Z 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'
2017-01-31T08:50:34.600735Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-01-31T08:50:34.600746Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-01-31T08:50:34.608277Z 0 [Warning] Plugin group_replication reported: 'read failed'
2017-01-31T08:50:34.656636Z 0 [ERROR] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 24901'
2017-01-31T08:51:04.545781Z 3 [ERROR] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2017-01-31T08:51:04.545837Z 3 [Note] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2017-01-31T08:51:04.545862Z 3 [ERROR] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
2017-01-31T08:51:04.553274Z 3 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'
2017-01-31T08:51:04.553589Z 3 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'
2017-01-31T08:51:04.694555Z 7 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2017-01-31T08:51:05.064547Z 4 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'

only after the group replication thread is killed mysql login will happen..... (which I don't want, see https://bugs.mysql.com/bug.php?id=84728) 

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 5.7.17-log MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

Suggested fix:
make sure mysql gives an error like "group replication not ready yet, not accepting reads/writes"
[31 Jan 2017 12:39] Umesh Shastry
Hello Kenny Gryp,

Thank you for the report.

Thanks,
Umesh
[31 Jan 2017 12:40] Umesh Shastry
--
rm -rf data
mkdir data
bin/mysqld --initialize-insecure --basedir=$PWD/mysql-5.7.17 --datadir=$PWD/data/s1
bin/mysqld --initialize-insecure --basedir=$PWD/mysql-5.7.17 --datadir=$PWD/data/s2
bin/mysqld --initialize-insecure --basedir=$PWD/mysql-5.7.17 --datadir=$PWD/data/s3

bin/mysqld --defaults-file=./s1.cnf --log-error=s1.err 2>&1 &
bin/mysqld --defaults-file=./s2.cnf --log-error=s2.err 2>&1 &
bin/mysqld --defaults-file=./s3.cnf --log-error=s3.err 2>&1 &

-- tried to connect all the 3 nodes

[umshastr@hod03]/export/umesh/server/binaries/GABuilds/mysql-5.7.17: bin/mysql -uroot -S /tmp/mysql.sock

[umshastr@hod03]/export/umesh/server/binaries/GABuilds/mysql-5.7.17: bin/mysql -uroot -S /tmp/mysql_s3.sock

[umshastr@hod03]/export/umesh/server/binaries/GABuilds/mysql-5.7.17: bin/mysql -uroot -S /tmp/mysql_s3.sock

^^ all 3 connections hanged
-- pmp

[umshastr@hod03]~/bugs: ./pmp.sh
      2 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=12),srv_worker_thread,start_thread,clone
      1 sigwait,signal_hand,pfs_spawn_thread,start_thread,clone
      1 sigwaitinfo,timer_notify_thread_func,pfs_spawn_thread,start_thread,clone
      1 select,poll_wait,task_loop,xcom_taskmain2,Gcs_xcom_proxy_impl::xcom_init,xcom_taskmain_startup,start_thread,clone
      1 pthread_join,Gcs_xcom_engine::finalize,Gcs_xcom_interface::finalize,Gcs_operations::finalize,leave_group,Delayed_initialization_thread::initialization_thread_handler,launch_handler_thread,pfs_spawn_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=5),srv_purge_coordinator_suspend,out>),start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=25),srv_worker_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=1),buf_resize_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,wait,reset_sig_count=1),buf_dump_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,native_cond_wait,out>,,src_file=0x15c1d60,out>),pfs_spawn_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,native_cond_wait,cond=0x7fd400314310),src_file=0x7fd409c120a8,out>,,at,launch_handler_thread,pfs_spawn_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,native_cond_wait,cond=0x7fd3f8055960),mutex=0x7fd3f8055408,,thd=0x7fd3f00008c0,,next_event,rli=0x7fd3f8054b80),handle_slave_sql,pfs_spawn_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,native_cond_wait,cond=0x3c90ae0),mutex=0x3c90ab0,,at,group_replication_before_handle_connection,Server_state_delegate::before_handle_connection,mysqld_main,__libc_start_main,_start
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event::timed_wait,os_event::wait_time_low,srv_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event::timed_wait,os_event::wait_time_low,srv_error_monitor_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event::timed_wait,os_event::wait_time_low,pc_sleep_if_needed,out>),start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event::timed_wait,os_event::wait_time_low,lock_wait_timeout_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event::timed_wait,os_event::wait_time_low,ib_wqueue_timedwait,fts_optimize_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,os_event::timed_wait,os_event::wait_time_low,dict_stats_thread,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,native_cond_timedwait,mp=0x7fd400045508,,src_file=0x7fd409c12dd8,at,launch_broadcast_thread,pfs_spawn_thread,start_thread,clone
      1 nanosleep,sleep,xcom_sleep,a=0x7fd409ad2b00,,xcom_send_cfg_wait,Gcs_xcom_control::do_join,Parameterized_notification<false>::operator(),Gcs_xcom_engine::process,process_notification_thread,start_thread,clone
      1 nanosleep,os_thread_sleep,srv_master_sleep,out>),start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd434338ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd433937ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd432f36ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd432535ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd431b34ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd431133ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd430732ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd42fd31ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd42f330ec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1 __io_getevents_0_4,LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_linux_handler,m1=0x7fd42e92fec8,,fil_aio_wait,io_handler_thread,start_thread,clone
      1
[31 Jan 2017 16:22] Nuno Carvalho
Posted by developer:
 
Hi Kenny,

Thank you for your suggestion, we will look into it.

Best regards,
Nuno Carvalho
[18 Sep 2017 10:30] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the
following was added to the 5.7.20 / 8.0.4 changelog:
The delayed initialization mechanism used for server starts has been
improved. Now, it only blocks connections until the server is in read mode.