Bug #90361 MTS hang on Waiting for dependent transaction to commit
Submitted: 10 Apr 2018 3:48 Modified: 4 May 2018 2:13
Reporter: pc leon Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.17/5.7.21 OS:CentOS (7.0.1406(3.10.0-123.el7.x86_64)/7.4.1708(3.10.0-693.el7.x86_64))
Assigned to: CPU Architecture:Any

[10 Apr 2018 3:48] pc leon
Description:
use sysbench 1.0.13 (using bundled LuaJIT 2.1.0-beta2) to test MTS replication,when test with parameters below:
sysbench --test=/root/benchSQL/oltp_legacy/oltp.lua --oltp_tables_count=10 --oltp-table-size=200000 --mysql-user=root   --mysql-socket=/data/test/data/mysql.sock --threads=128 --time=120 --events=0 --oltp-test-mode=complex run

the os7.0 quickly hang on with "Waiting for dependent transaction to commit"., and lag increase :

               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 421
                  Master_UUID: 471d803d-3bb6-11e8-8f86-28a6dbbf5ebf
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 471d803d-3bb6-11e8-8f86-28a6dbbf5ebf:1892412-2959062
            Executed_Gtid_Set: 471d803d-3bb6-11e8-8f86-28a6dbbf5ebf:1-1893240
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 

the processlist look like this:

+------+-------------+-----------+------+---------+-------+---------------------------------------------+------------------+
| Id   | User        | Host      | db   | Command | Time  | State                                       | Info             |
+------+-------------+-----------+------+---------+-------+---------------------------------------------+------------------+
|  100 | system user |           | NULL | Connect | 75381 | Waiting for master to send event            | NULL             |
|  101 | system user |           | NULL | Connect | 75375 | Waiting for dependent transaction to commit | NULL             |
|  102 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
|  103 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
|  104 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
|  105 | system user |           | NULL | Connect | 76614 | System lock                                 | NULL             |
|  106 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
|  107 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
|  108 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
|  109 | system user |           | NULL | Connect | 76614 | Waiting for an event from Coordinator       | NULL             |
| 7641 | root        | localhost | NULL | Query   |     0 | starting                                    | show processlist |
+------+-------------+-----------+------+---------+-------+---------------------------------------------+------------------+
11 rows in set (0.00 sec)

i can't stop replication with command "stop slave" even can't stop mysql unless use kill -9 (mysql will quickly restart itself quickly). when restart mysql, then start replication again , it replicate several trx then hang on again with the same reason .
if i restart mysql which hang on use "kill -9" , disable MTS before start replication, then it goes normal.
firstly i guess maybe it's a msyql version bug ,so i replace 5.7.21 official rpm to test, then it repeats again. then i try to change centos 7.0.1406(3.10.0-123.el7.x86_64) to  centos 7.4.1708(3.10.0-693.el7.x86_64) , and both version on os7.4 run MTS replication normally.
even i copy the fault data, rpm , my.cnf on os7.0 to os7.4, then restart replication , "Waiting for dependent transaction to commit" never cause hang on under os7.4
finally, i run MTS on os7.0 to make it repeat the error , kill the mysql, upgrade 7.0 to 7.4 , retest it and never found hang on again on 7.4

so i confuse whether it's a mysql's bug or system's bug 

my my.cnf:

[mysql]

[mysqld]

#Innodb
innodb_log_file_size = 2G
innodb_log_files_in_group = 3
innodb_autoinc_lock_mode = 2
innodb_flush_log_at_trx_commit = 2
innodb_read_io_threads = 32
innodb_write_io_threads = 32
innodb_thread_concurrency = 256
innodb_commit_concurrency = 256
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_max_dirty_pages_pct=20
innodb_flush_neighbors = 0
innodb_open_files = 65535
innodb_buffer_pool_size = 32G
innodb_flush_method=O_DIRECT
innodb_lock_wait_timeout=60

#server
autocommit=on
character-set-server = utf8
disable-partition-engine-check
explicit_defaults_for_timestamp=true
general-log=0
interactive_timeout = 28800
join_buffer_size = 1M
lock_wait_timeout=60
log-output = FILE
log_timestamps=SYSTEM
long_query_time = 0.2
lower_case_table_names = 1
max_allowed_packet=1G
max_connections = 8000
max_heap_table_size = 128k
max_prepared_stmt_count = 16382
net_buffer_length=32k
open_files_limit = 65535
query_cache_type = 0
read_rnd_buffer_size = 128k
skip_name_resolve
slow_launch_time = 1
slow-query-log
sort_buffer_size = 2M
table_definition_cache=4000
table_open_cache = 65535 
thread_cache_size = 500
default_time_zone='+8:00'
tmp_table_size = 32M
transaction-isolation = READ-COMMITTED
wait_timeout=28800

#replication
skip-slave-start
expire_logs_days = 30
gtid-mode=ON
enforce-gtid-consistency
log-bin = mysql-bin
log-bin-index = mysql-bin.index
binlog-format = ROW
log_slave_updates
sync_binlog = 1000
relay-log = relay-bin
relay-log-index = relay-bin.index
binlog_cache_size = 32k
binlog_rows_query_log_events=on
binlog_row_image=FULL

port = 3306
server_id = 2
datadir = /data/test/data
socket = /data/test/data/mysql.sock
pid-file = /data/test/data/mysql.pid
tmpdir = /data/test/tmp
ssl-ca = /data/test/data/ca.pem
ssl-cert = /data/test/data/server-cert.pem
ssl-key = /data/test/data/server-key.pem
log-error = /data/test/logs/error.log
general_log_file = /data/test/logs/general.log
slow_query_log_file = /data/test/logs/slow.log

#####
slave-parallel-type=LOGICAL_CLOCK
slave-parallel-workers=8
master_info_repository=TABLE
relay_log_info_repository=TABLE
relay_log_recovery=ON
slave_preserve_commit_order = ON

How to repeat:
make a mts replicaiton mysql ,then run  sysbench
sysbench --test=/root/benchSQL/oltp_legacy/oltp.lua --oltp_tables_count=10 --oltp-table-size=200000 --mysql-user=root   --mysql-socket=/data/test/data/mysql.sock --threads=128 --time=120 --events=0 --oltp-test-mode=complex run
[10 Apr 2018 3:51] pc leon
during the whole test , journalctl has no any exceptional logs
[4 May 2018 2:13] MySQL Verification Team
Hello pc leon,

Thank you for the report.
I've tried to reproduce it on many times on CentOS7/OL7 but not seeing any issues i.e "Waiting for dependent transaction to commit" appears for a fraction of seconds and disappears. Tried to stop/start slave in between but nothing appears to be hanging during my tests.
If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

Thanks,
Umesh
[4 May 2018 2:13] MySQL Verification Team
Also, Bug #85560
[14 Nov 2018 9:55] kfpanda kf
It is found that this problem is related to setting up slave_preserve_commit_order=ON on the slave.
[30 Nov 2018 8:55] Khandaker Hasnat
I am using  Percona Server 5.7.23-25 and getting this error, any idea to quick seloving this problem.
@kfpanda kf what do you mean by, "It is found that there is a relationship between slave_preserve_commit_order and ON on the repository", should i change the variable to OFF or ON? In my case slave_preserve_commit_order=OFF, if you mean keep this variable ON is encounter this issues then its not works for me. do you have any more suggestion to resolve this?
any help will be appreciated :-)
[6 Aug 2021 13:44] John Nicholls
Might this be related to https://bugs.mysql.com/bug.php?id=87796 ?
(fixed in 5.7.33)