Bug #106845 Parallel replication not working with slave_preserve_commit_order=1
Submitted: 26 Mar 13:23 Modified: 6 Apr 1:46
Reporter: Peter Parker Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:8.0.25 OS:CentOS (7.6)
Assigned to: CPU Architecture:ARM
Tags: arm, CentOS, MTS, replication

[26 Mar 13:23] Peter Parker
Description:
Parallel replication is not working with slave_preserve_commit_order=1,only one of the slave_workers is working at the same time.

My environment:
[root@Malluma build]# uname -a
Linux Malluma 4.14.0-115.el7a.0.1.aarch64 #1 SMP Sun Nov 25 20:54:21 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
[root@Malluma build]# cat /etc/*-release
CentOS Linux release 7.6.1810 (AltArch)
NAME="CentOS Linux"
VERSION="7 (AltArch)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (AltArch)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.6.1810 (AltArch)
CentOS Linux release 7.6.1810 (AltArch)

How to repeat:
1. Take source code from https://github.com/mysql/mysql-server/tree/mysql-8.0.25

2. Run cmake:
cmake ${SOURCE_DIR} -DBUILD_CONFIG=mysql_release -DWITH_DEBUG=1

3. Compile:
make

4. Run the MySQL Test Suite:
./mtr --force --timestamp --max-test-fail=0 --suite-timeout=6000 --testcase-timeout=300 --big-test --debug-server --suite=rpl

A number of replication-related test cases fail in debug version.
one of the failed cases:

220322 17:00:35 [  6%] rpl.rpl_mts_slave_preserve_commit_order_error_nobinlog 'mix' w3  [ fail ]
        Test ended at 2022-03-22 17:00:35

CURRENT_TEST: rpl.rpl_mts_slave_preserve_commit_order_error_nobinlog
mysqltest: At line 67: Timeout in wait_condition.inc for $wait_condition
In included file ./include/wait_condition.inc: 68
included from /data/mtr/pq_debug/mysql-test/suite/rpl/t/rpl_mts_slave_preserve_commit_order_error_nobinlog.test: 63

The result from queries just before the failure was:
relaylog_name = 'No such row'
SHOW RELAYLOG EVENTS IN 'No such row';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info

**** slave_relay_info on server_1 ****
SELECT * FROM mysql.slave_relay_log_info;
Number_of_lines	Relay_log_name	Relay_log_pos	Master_log_name	Master_log_pos	Sql_delay	Number_of_workers	Id	Channel_name	Privilege_checks_username	Privilege_checks_hostname	Require_row_format	Require_table_primary_key_check	Assign_gtids_to_anonymous_transactions_type	Assign_gtids_to_anonymous_transactions_value

Suggested fix:
The cause of the problem:

aligned_atomic.h:78
#elif defined(__linux__)
static inline size_t _cache_line_size() {
  return sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
}

The system call sysconf may return 0 causing Commit_order_manager can not push worker id into Commit_order_queue when already a transaction replaying.
rpl_slave_commit_order_manager.cc:67
this->m_workers.push(worker->id);
As a result,the sql thread stuck and can not assign next job to another worker.

Suggested fix:

aligned_atomic.h:78
#elif defined(__linux__)
static inline size_t _cache_line_size() {
  long size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
  if (size == 0) return 64; // add
  return static_cast<size_t>(size);
}
[30 Mar 10:35] MySQL Verification Team
Hi Peter,

Is this reproducible for you only on ARM or you can reproduce this on x86 too?  So far I'm not reproducing on x86, waiting on ARM hw to be available for further testing but wanted to check if you tried this on x86 too?

thanks
[30 Mar 13:02] Peter Parker
Hello, 

I tried it on x86 and didn't reproduce this issue. In addition,I find it is related to the Linux version as well, since I did not reproduce it on a later version of linux on ARM either.
[2 Apr 7:26] Peter Parker
Hello,
    I found a similar bug report : https://bugs.mysql.com/bug.php?id=102926. The reporter also found the sysconf() system call may return 0 but his suggested fix didn't handle the return value 0 which I think is because he didn't see any problems when return 0 but has.
[5 Apr 14:07] MySQL Verification Team
Hi,

The issues with DEBUG version are not always real bugs but in this case you might be on point. This does look similar also to Bug#106807 so fix here might help there too.

Thanks for the report.
[6 Apr 1:46] Peter Parker
Hello,

This problem also exists in the release version. In the release version, the problem can be verified by setting up a parallel replication environment.I suggest first take a look at the return value of sysconf(_SC_LEVEL1_DCACHE_LINESIZE).