Bug #109133 | Parallel replication causes massive delays | ||
---|---|---|---|
Submitted: | 18 Nov 2022 3:45 | Modified: | 27 Dec 2022 9:46 |
Reporter: | chengqing hu | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | MySQL8.0.28 | OS: | CentOS (7.5) |
Assigned to: | CPU Architecture: | x86 |
[18 Nov 2022 3:45]
chengqing hu
[18 Nov 2022 15:22]
MySQL Verification Team
Hi, > without any index This is not a bug. Please add primary key to the table. Thank you for your interest in MySQL.
[21 Nov 2022 2:12]
chengqing hu
1. This is just a small table 2. Turn off parallel replication without any delay Seems to be related to parallel replication, not primary key
[22 Nov 2022 10:17]
Max Irgiznov
Hello. We have similar behavior on a database with intensive writing. Replication works fine with a single thread, with a multithreaded type, a gradual lag of the replica begins. All tables have PK. Additionally, I want to note that in the multi-threaded replication code, for each rpl_rca_wkr, the thread number is always zero, which makes debugging between threads somewhat difficult. I attach flamegraphs for analysis.
[22 Nov 2022 10:17]
Max Irgiznov
Single thread replica
Attachment: flamegraph_mth.svg (image/svg+xml, text), 954.56 KiB.
[22 Nov 2022 10:18]
Max Irgiznov
Single thread replica
Attachment: flamegraph_single.svg (image/svg+xml, text), 1.10 MiB.
[22 Nov 2022 10:20]
Max Irgiznov
pt-pmp profilig on mth replica
Attachment: pt-pmp.svg (image/svg+xml, text), 105.96 KiB.
[22 Nov 2022 10:25]
MySQL Verification Team
Hi Max, What version of MySQL Server are you using?
[22 Nov 2022 11:48]
MySQL Verification Team
Hi Max, Additionally, this "no index" situation I can reproduce but "all tables have PK" situation I cannot reproduce. I see your measurements but I need a way to reproduce the problem, assuming you can reproduce this with 8.0.31 Thanks
[22 Nov 2022 16:07]
Max Irgiznov
Now we use 8.0.30. Upgrade to 31 is not possible now
[22 Nov 2022 17:59]
MySQL Verification Team
Hi Max, I do not think there is anything between .30 and .31 wrt this so .30 is ok, but I cannot reproduce this. Can you provide a reproducible test case kind regards
[23 Nov 2022 11:00]
Max Irgiznov
I tried to reproduce using https://github.com/Percona-Lab/sysbench-tpcc hw: 28 vcpu, 167G ram, nvme raid10, source on host, replica in docker. 0. change mysql vars in contaner: set global sync_binlog=1024; set global innodb_buffer_pool_size=1342177280*4; set global innodb_flush_log_at_trx_commit=2; set global innodb_io_capacity_max = 400000; set global innodb_io_capacity = 200000; set global innodb_io_capacity_max = 400000; set global innodb_io_capacity = 200000; 1. ./tpcc.lua --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=test --mysql-password=еуые --mysql-db=test --time=300 --threads=4 --report-interval=1 --tables=1 --scale=10 --db-driver=mysql prepare 2. setup replica from host to container, do: stop replica; set global replica_parallel_workers=0; start replica; 3. the next step is to determine the number of threads with which the replica will not lag behind in single-threaded mode on your hardware, in my case it is 5 threads. ./tpcc.lua --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-user=test --mysql-password=tezt --mysql-db=test --time=600 --threads=5 --report-interval=1 --tables=1 --scale=10 --db-driver=mysql run run the script and wait for a minute and make sure that the replica is not more than 1 second behind the source. 4. change threads num: stop replica; set global replica_parallel_workers=8; start replica; now wait a minute and see that replication is starting to lag a little, there will be a picture in flamegraph as I showed above. If return the value replica_parallel_workers=0, then the replica will start catching up with a source
[24 Nov 2022 12:18]
MySQL Verification Team
Hi Max, Chengqing, I discussed this with replication team as I'm having issues reproducing this. Even the original report with key-less table should not happen. The flame graphs are interesting, but would be useful to get a bit more details, and do it slightly more controlled: - If possible, use just START REPLICA IO_THREAD while the workload is executing on the source. Once everything is received on the replica, run START REPLICA SQL_THREAD, and measure the time to catch up. - Run the same test in four cases: (1) multi-threaded with index, (2) multi-threaded without index, (3) single-threaded with index, (4) single-threaded without index. So the result of this would be four flame graphs and four measured time durations. That may not solve the problem, but give some idea about the area of the code that is to blame, which may direct future testing. Thanks
[25 Nov 2022 9:59]
Max Irgiznov
Hello! I ran the recommended tests.
Attachment: flamegraphs.zip (application/x-zip-compressed, text), 367.58 KiB.
[23 Dec 2022 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[27 Dec 2022 9:40]
Max Irgiznov
Hi Team. Do you have any updates?
[27 Dec 2022 9:46]
MySQL Verification Team
Hi, No updates yet, the bug is now with replication team. Thanks