Bug #73066 | Replication stall with multi-threaded replication | ||
---|---|---|---|
Submitted: | 20 Jun 2014 15:45 | Modified: | 9 Nov 2018 5:59 |
Reporter: | Ovais Tariq | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.6.17 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[20 Jun 2014 15:45]
Ovais Tariq
[20 Jun 2014 15:45]
Ovais Tariq
The MySQL configuration is as follows: [mysqld] core-file user = mysql port = 3306 datadir = /db socket = /tmp/mysql.sock default-storage-engine = innodb skip-external-locking log_warnings=1 skip_name_resolve character-set-server = utf8mb4 collation-server = utf8mb4_unicode_ci group_concat_max_len = 1000000 server-id = 666 # InnoDB settings innodb_data_home_dir = /db innodb_log_group_home_dir = /db innodb_data_file_path = ibdata1:100M:autoextend innodb_buffer_pool_size = 20G innodb_buffer_pool_instances = 8 innodb_log_file_size = 10G innodb_buffer_pool_load_at_startup=OFF innodb_buffer_pool_dump_at_shutdown=OFF # allows for row_format=compressed innodb_file_format=Barracuda innodb_log_buffer_size = 64M innodb_flush_log_at_trx_commit = 0 innodb_lock_wait_timeout = 50 innodb_file_per_table innodb_doublewrite = 0 innodb_io_capacity = 1200 innodb_read_io_threads = 6 innodb_write_io_threads = 6 innodb_stats_on_metadata = OFF # Slow query log settings # The default logs all full table scans,tmp tables,filesorts on disk queries #use_global_long_query_time = 1 #long_query_time = 0.5 slow_query_log_file = slowquery.log slow_query_log = 1 long_query_time = 3 log_slow_filter = "full_scan,tmp_table_on_disk,filesort_on_disk" log_slow_verbosity = "full" # Other general MySQL settings sync_binlog = 0 query_cache_type = 0 query_cache_size = 0 max_connections = 3000 thread_cache_size = 1000 back_log = 1024 thread_concurrency = 32 innodb_thread_concurrency = 64 tmpdir = /var/tmp max_allowed_packet = 24M max_join_size = 4294967295 net_buffer_length = 2K thread_stack = 512K tmp_table_size = 64M max_heap_table_size = 64M table_open_cache = 2000 # Replication settings (master to slave) binlog_cache_size = 2M binlog_format=mixed log-bin = bin log-error = error.log expire_logs_days = 5 slave-parallel-workers = 10 master-info-repository = "table" relay-log-info-repository = "table" gtid_mode = ON enforce_gtid_consistency = true log-slave-updates replicate-ignore-table = mysql.user replicate-ignore-table = mysql.db replicate-ignore-table = mysql.tables_priv replicate-ignore-table = mysql.proxies_priv # Started tuning slave catchup speed, can use more research slave-checkpoint-period = 1000 slave-checkpoint-group = 2048
[20 Jun 2014 15:50]
Ovais Tariq
More details on the state of the coordinator thread
Attachment: state_of_coordinator_thread_from_gdb.txt (text/plain), 39.71 KiB.
[23 Jun 2014 10:24]
MySQL Verification Team
Thank you for the report. I'm able to reproduce this issue with below steps: # Setup replication using gtid(conf files attached) # Stop replication(slave>stop slave;) # Emulate load on master # bin/mysqlslap --auto-generate-sql --auto-generate-sql-add-autoincrement --auto-generate-sql-execute-number=100000 --auto-generate-sql-load-type=mixed --auto-generate-sql-secondary-indexes=2 -c 10 --create-schema='replication' -T -e InnoDB -i 10 --number-char-cols=10 -S /tmp/mysql_master.sock # Stop replication(slave>start slave;) # Kill the slave kill -9 when one of repl thread is in "Waiting for master to send event" State # Wait for sometime # Start Slave server again(no skip slave start etc used) Wait for sometime and then try to stop slave "stop slave" or "show slave status\G", everything seems to hang mysql> stop slave; ^^ Freezed mysql> show slave status\G ^^ Freezed slave> show processlist; +----+-------------+-----------+-------------+---------+------+-----------------------------------------------+-------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------------+-----------+-------------+---------+------+-----------------------------------------------+-------------------+ | 1 | system user | | NULL | Connect | 6656 | Waiting for master to send event | NULL | | 2 | system user | | NULL | Connect | 6567 | Waiting for Slave Worker to release partition | NULL | | 3 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 4 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 5 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 6 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 7 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 8 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 9 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 10 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 11 | system user | | NULL | Connect | 6656 | Waiting for an event from Coordinator | NULL | | 12 | system user | | NULL | Connect | 7025 | Waiting for an event from Coordinator | NULL | | 13 | root | localhost | NULL | Query | 5449 | Killing slave | stop slave | | 14 | root | localhost | NULL | Query | 5344 | init | show slave status | | 15 | root | localhost | NULL | Query | 4908 | init | show slave status | | 16 | root | localhost | replication | Sleep | 4278 | | NULL | | 21 | root | localhost | NULL | Query | 0 | init | show processlist | +----+-------------+-----------+-------------+---------+------+-----------------------------------------------+-------------------+ 17 rows in set (0.00 sec)
[23 Jun 2014 10:40]
MySQL Verification Team
Typo in earlier repro note: # Stop replication(slave>start slave;) Should be # Start replication(slave>start slave;)
[23 Jun 2014 12:02]
qinglin zhang
Same bug as http://bugs.mysql.com/bug.php?id=72794.
[23 Jun 2014 12:03]
qinglin zhang
Same bug as http://bugs.mysql.com/bug.php?id=72794.
[26 Feb 2015 13:30]
Laurynas Biveinis
Failed to reproduce with 5.7.5.
[22 Dec 2017 18:40]
Rolf Martin-Hoster
5.7.18 output
Attachment: 5 (application/octet-stream, text), 17.76 KiB.
[22 Dec 2017 18:40]
Rolf Martin-Hoster
This appears to still be happening in 5.7.18
[9 Nov 2018 5:59]
MySQL Verification Team
Confirmed that issue is no longer reproducible on latest 5.6 build. Internal discussion confirmed that bug seems to have been fixed by Joao in 5.6.21+ post Bug#72794. Closing the issue for now and joining latest results shortly. regards, Umesh
[10 Nov 2018 6:29]
MySQL Verification Team
I'm unable to reproduce this on 5.7.24 with exact steps(only difference being that in 2014, had a smaller test box and now with moderate one).
[10 Nov 2018 6:30]
MySQL Verification Team
5.7.24 - Test results
Attachment: 73066_5.7.24.results (application/octet-stream, text), 18.03 KiB.
[19 Nov 2018 4:34]
MySQL Verification Team
Hello Rolf Martin-Hoster, Our internal discussions and re-verification efforts from my end and Development confirmed that original issue is no longer reproducible, most likely fixed after Bug #72794 fixed. Also, one more difference was in the original report both stop slave and show slave status hanged but looking at your provided details, only found in former case. We concluded that the issue might have been caused due to some corner cases. In order to proceed further, may I request you to provide these below details? Quoting as is Dev's requested details: -- Relay log contents when applier thread has errored out, this will help us to identify if there were any partial transactions in relay log. -- Slave's error log file. -- Does slave_parallel_type=DATABASE/LOGICAL_CLOCK was in use? -- There seems to be an active transaction but it is hard to say why the transaction is in active state for such a long time. Can you please provide more details regarding what type of transactions were they doing? -- Just before the MTS hang scenario was the IO thread or Slave server was restarted and START SLAVE was initiated? -- Are you doing any cross version replication? Also we don't fix bugs in old versions, don't backport bug fixes, so need to check with latest version anyway. So, please, upgrade and inform us if problem still exists in latest GA builds along with exact reproducible steps. Thank you! regards, Umesh
[21 Dec 2018 14:26]
Amit Wadkar
Waiting for an event from Coordinator This issue is still there in 5.7.17. Could anyone confirm in which release it is fixed?
[12 Jan 2023 18:36]
Sisir Adhikari
Hello, Looks like I am having the identical problem with 8.0.31. You can see more details at https://dba.stackexchange.com/questions/322053/mysql-replication-follower-stuck-and-behind