Description:
use sysbench 1.0.13 (using bundled LuaJIT 2.1.0-beta2) to test MTS replication,when test with parameters below:
sysbench --test=/root/benchSQL/oltp_legacy/oltp.lua --oltp_tables_count=10 --oltp-table-size=200000 --mysql-user=root --mysql-socket=/data/test/data/mysql.sock --threads=128 --time=120 --events=0 --oltp-test-mode=complex run
the os7.0 quickly hang on with "Waiting for dependent transaction to commit"., and lag increase :
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 421
Master_UUID: 471d803d-3bb6-11e8-8f86-28a6dbbf5ebf
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 471d803d-3bb6-11e8-8f86-28a6dbbf5ebf:1892412-2959062
Executed_Gtid_Set: 471d803d-3bb6-11e8-8f86-28a6dbbf5ebf:1-1893240
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
the processlist look like this:
+------+-------------+-----------+------+---------+-------+---------------------------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+------+-------------+-----------+------+---------+-------+---------------------------------------------+------------------+
| 100 | system user | | NULL | Connect | 75381 | Waiting for master to send event | NULL |
| 101 | system user | | NULL | Connect | 75375 | Waiting for dependent transaction to commit | NULL |
| 102 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 103 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 104 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 105 | system user | | NULL | Connect | 76614 | System lock | NULL |
| 106 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 107 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 108 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 109 | system user | | NULL | Connect | 76614 | Waiting for an event from Coordinator | NULL |
| 7641 | root | localhost | NULL | Query | 0 | starting | show processlist |
+------+-------------+-----------+------+---------+-------+---------------------------------------------+------------------+
11 rows in set (0.00 sec)
i can't stop replication with command "stop slave" even can't stop mysql unless use kill -9 (mysql will quickly restart itself quickly). when restart mysql, then start replication again , it replicate several trx then hang on again with the same reason .
if i restart mysql which hang on use "kill -9" , disable MTS before start replication, then it goes normal.
firstly i guess maybe it's a msyql version bug ,so i replace 5.7.21 official rpm to test, then it repeats again. then i try to change centos 7.0.1406(3.10.0-123.el7.x86_64) to centos 7.4.1708(3.10.0-693.el7.x86_64) , and both version on os7.4 run MTS replication normally.
even i copy the fault data, rpm , my.cnf on os7.0 to os7.4, then restart replication , "Waiting for dependent transaction to commit" never cause hang on under os7.4
finally, i run MTS on os7.0 to make it repeat the error , kill the mysql, upgrade 7.0 to 7.4 , retest it and never found hang on again on 7.4
so i confuse whether it's a mysql's bug or system's bug
my my.cnf:
[mysql]
[mysqld]
#Innodb
innodb_log_file_size = 2G
innodb_log_files_in_group = 3
innodb_autoinc_lock_mode = 2
innodb_flush_log_at_trx_commit = 2
innodb_read_io_threads = 32
innodb_write_io_threads = 32
innodb_thread_concurrency = 256
innodb_commit_concurrency = 256
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_max_dirty_pages_pct=20
innodb_flush_neighbors = 0
innodb_open_files = 65535
innodb_buffer_pool_size = 32G
innodb_flush_method=O_DIRECT
innodb_lock_wait_timeout=60
#server
autocommit=on
character-set-server = utf8
disable-partition-engine-check
explicit_defaults_for_timestamp=true
general-log=0
interactive_timeout = 28800
join_buffer_size = 1M
lock_wait_timeout=60
log-output = FILE
log_timestamps=SYSTEM
long_query_time = 0.2
lower_case_table_names = 1
max_allowed_packet=1G
max_connections = 8000
max_heap_table_size = 128k
max_prepared_stmt_count = 16382
net_buffer_length=32k
open_files_limit = 65535
query_cache_type = 0
read_rnd_buffer_size = 128k
skip_name_resolve
slow_launch_time = 1
slow-query-log
sort_buffer_size = 2M
table_definition_cache=4000
table_open_cache = 65535
thread_cache_size = 500
default_time_zone='+8:00'
tmp_table_size = 32M
transaction-isolation = READ-COMMITTED
wait_timeout=28800
#replication
skip-slave-start
expire_logs_days = 30
gtid-mode=ON
enforce-gtid-consistency
log-bin = mysql-bin
log-bin-index = mysql-bin.index
binlog-format = ROW
log_slave_updates
sync_binlog = 1000
relay-log = relay-bin
relay-log-index = relay-bin.index
binlog_cache_size = 32k
binlog_rows_query_log_events=on
binlog_row_image=FULL
port = 3306
server_id = 2
datadir = /data/test/data
socket = /data/test/data/mysql.sock
pid-file = /data/test/data/mysql.pid
tmpdir = /data/test/tmp
ssl-ca = /data/test/data/ca.pem
ssl-cert = /data/test/data/server-cert.pem
ssl-key = /data/test/data/server-key.pem
log-error = /data/test/logs/error.log
general_log_file = /data/test/logs/general.log
slow_query_log_file = /data/test/logs/slow.log
#####
slave-parallel-type=LOGICAL_CLOCK
slave-parallel-workers=8
master_info_repository=TABLE
relay_log_info_repository=TABLE
relay_log_recovery=ON
slave_preserve_commit_order = ON
How to repeat:
make a mts replicaiton mysql ,then run sysbench
sysbench --test=/root/benchSQL/oltp_legacy/oltp.lua --oltp_tables_count=10 --oltp-table-size=200000 --mysql-user=root --mysql-socket=/data/test/data/mysql.sock --threads=128 --time=120 --events=0 --oltp-test-mode=complex run