Bug #113236 conflicting replica_parallel_workers spec causes parallel_mode misconfig
Submitted: 27 Nov 2023 17:22 Modified: 28 Nov 2023 20:54
Reporter: Josh Wieder Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Row Based Replication ( RBR ) Severity:S3 (Non-critical)
Version:8.0.35-0ubuntu0.22.04.1 OS:Ubuntu (22.04.3 LTS 5.15.0-88-generic)
Assigned to: CPU Architecture:x86 (.)
Tags: parallel replication, replication

[27 Nov 2023 17:22] Josh Wieder
Description:
MySQL documentation related to replica_parallel_workers is conflicting. Doc https://dev.mysql.com/doc/refman/8.0/en/replication-threads.html states that a v8 MySQL multithreaded replica is "where replica_parallel_workers or slave_parallel_workers is greater than 0", with no clarification provided for minor version behavior changes.

However, https://dev.mysql.com/doc/refman/8.0/en/replication-options-replica.html#sysvar_replica_pa... tells a different story for those using the most recent version of MySQL: 

```
replica_parallel_workers enables multithreading on the replica and sets the number of applier threads for executing replication transactions in parallel. When the value is a number greater than 1, the replica is a multithreaded replica with the specified number of applier threads, plus a coordinator thread to manage them. If you are using multiple replication channels, each channel has this number of threads.

Before MySQL 8.0.27, the default for this system variable is 0, so replicas are single-threaded by default. Beginning with MySQL 8.0.27, the default is 4, so replicas are multithreaded by default.

As of MySQL 8.0.30, setting this variable to 0 is deprecated, raises a warning, and is subject to removal in a future MySQL release. For a single worker, set replica_parallel_workers to 1 instead.
```

In an environment where the replica is running v8.0.35 & the primary is running v8.0.26, and the replica is set to replica_parallel_workers=1, the replica will misidentify itself as running "parallel" replication and will generate replication errors for anonymous transactions, for example:

```
Last_SQL_Error: Cannot execute the current event group in the parallel mode. Encountered event Anonymous_Gtid, relay-log name ./XXXXX-relay-bin.XXXXX, position XXXX which prevents execution of this event group in parallel mode. Reason: The source event is logically timestamped incorrectly
```
Here is status output from the replica:

```
show replica status\G
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: XXXXXXXXXX
                  Source_User: XXXXXXXXXX
                  Source_Port: XXXXXXXXXX
                Connect_Retry: 60
              Source_Log_File: mysql-bin-changelog.XXXXXXXXXX
          Read_Source_Log_Pos: 2548XXXXXXXXXX
               Relay_Log_File: XXXXXXXXXX-relay-bin.XXXXXXXXXX
                Relay_Log_Pos: 30282XXXXXXXXXX
        Relay_Source_Log_File: mysql-bin-changelog.XXXXXXXXXX
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1755
                   Last_Error: Cannot execute the current event group in the parallel mode. Encountered event Anonymous_Gtid, relay-log name ./XXXXXXXXXX-relay-bin.XXXXXXXXXX, position XXXXXXXXXX which prevents execution of this event group in parallel mode. Reason: The source event is logically timestamped incorrectly..
                 Skip_Counter: 0
          Exec_Source_Log_Pos: XXXXXXXXXX
              Relay_Log_Space: 138422641521
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1755
               Last_SQL_Error: Cannot execute the current event group in the parallel mode. Encountered event Anonymous_Gtid, relay-log name ./XXXXXXXXXX-relay-bin.XXXXXXXXXX, position XXXXXXXXXX which prevents execution of this event group in parallel mode. Reason: The source event is logically timestamped incorrectly..
  Replicate_Ignore_Server_Ids: 
             Source_Server_Id: 102118XXXXXXXXXX
                  Source_UUID: 649bb58c-XXXXXXXXXX-XXXXXXXXXX-XXXXXXXXXX-XXXXXXXXXX
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: 
           Source_Retry_Count: 86400
                  Source_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 231124 06:51:02
               Source_SSL_Crl: 
           Source_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Source_TLS_Version: 
       Source_public_key_path: 
        Get_Source_public_key: 0
            Network_Namespace:
```
Here's the relay log:

```
# at 279
#231124  6:37:45 server id 1021180011  end_log_pos 3028211525 CRC32 0x671b2b36  Anonymous_GTID  last_committed=0        sequence_number=1       rbr_only=yes    original_committed_timestamp=1700807865819811      immediate_commit_timestamp=1700807865819811     transaction_length=72714
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=1700807865819811 (2023-11-24 06:37:45.819811 UTC)
# immediate_commit_timestamp=1700807865819811 (2023-11-24 06:37:45.819811 UTC)
/*!80001 SET @@session.original_commit_timestamp=1700807865819811*//*!*/;
/*!80014 SET @@session.original_server_version=80026*//*!*/;
/*!80014 SET @@session.immediate_server_version=80026*//*!*/;
SET @@SESSION.GTID_NEXT= 'ANONYMOUS'/*!*/;
```

Theres a couple of things going on here. For one, note the sequence_number=1. This is not the first anonymous transaction received by the replica. I'm not entirely clear on the sequence behavior governance. However, this replica should be running with a single worker, not in parallel. The sequence number shouldn't matter, but perhaps I'm not understanding that correctly (?).

I am able to repair the situation using this sequence of commands:

```
SET @@GLOBAL.replica_parallel_workers = 0; 
START REPLICA UNTIL SQL_AFTER_MTS_GAPS; 
START REPLICA SQL_THREAD;
```
                  
However, I cannot simply edit mysqld.cnf & set replica_parallel_workers = 0, because the default behavior for v8.0.37 is to raise a WARNING for this and quietly change the value from 0 to 4. According to the relevant documentation, replica_parallel_workers = 0 isn't correct. There seems to be something within MySQL that is still identifying this replica as a single worker event when replica_parallel_workers = 1.

How to repeat:
step 1. create a mysql primary where gtid_mode is set to OFF_PERMISSIVE & binlog is enabled. mysql version v8.0.26.

step 2. create a mysql replica with the following settings; all other settings default for mysql v8.0.35. 
[mysqld]
ssl-ca                  		= /usr1/certs/server.cas
ssl-cert               		 	= /usr1/certs/server.crt
ssl-key                 		= /usr1/certs/server.key
user                    		= mysql
pid-file                		= /var/run/mysqld/mysqld.pid
socket                  		= /var/run/mysqld/mysqld.sock
datadir                 		= /usr1/mysql
gtid_mode               		= OFF
innodb_flush_log_at_trx_commit          = 1
sync_binlog                             = 1
read_only               		= ON
tmpdir                  		= /usr1/tmp
bind-address            		= 0.0.0.0
mysqlx-bind-address     		= 0.0.0.0
innodb_file_per_table    		= ON
key_buffer_size          		= 16M
innodb_buffer_pool_size  		= 48G
max_allowed_packet       		= 512M
innodb_log_file_size     		= 2G
innodb_log_buffer_size   		= 1G
innodb_stats_on_metadata 		= OFF
back_log                		= 512
innodb_io_capacity_max  		= 10000
wait_timeout            		= 172800
max_connections         		= 1000
max_connect_errors      		= 1000
connect_timeout         		= 180
replica_net_timeout     		= 180
net_read_timeout        		= 600
net_write_timeout       		= 600
thread_stack            		= 256K
sql_mode                		= ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
event_scheduler         		= OFF
log_error                       	= /var/log/mysql/error.log
log_error_verbosity             	= 3
server-id                       	= 5555
binlog_format                   	= ROW
binlog_expire_logs_seconds      	= 604800
performance_schema              	= OFF
replica_parallel_workers        	= 1
relay_log_recovery              	= ON

step 3. start replication:

CHANGE REPLICATION SOURCE TO \
  SOURCE_HOST='XXXX', \
  SOURCE_USER='XXXX', SOURCE_PASSWORD='XXXX', \
  SOURCE_LOG_FILE='mysql-bin-changelog.XXXXXX', SOURCE_LOG_POS=XXXXXXX;
START REPLICATION;

Step 4. generate an anonymous_gtid event with an out-of-order sequence_number. I'm not sure how to do this or why our local env does this on occasion. this will break replication.

Suggested fix:
https://dev.mysql.com/doc/refman/8.0/en/replication-threads.html should be updated ASAP to make it clear that there are possibly breaking changes in the minor 8.0.x versions for replica_parallel_workers.

Its possible this could be resolved with matching minor versions for both source & replica, but there are reasonable circumstances where minor versions might be at least temporarily different. Ideally, DBAs would have a clear course of action for avoiding replication failure in those circumstances.

But I'm not sure why the primary's version or parallelism should be relevant. mysql replicas should be able to properly recognize their own parallel state, right?
[28 Nov 2023 20:54] MySQL Verification Team
Hi,

Thank you for your report and all the deails.
[29 Nov 2023 14:46] Sven Sandberg
Posted by developer:
 
Thank you for a clear bug report.

I see two doc issues and two server bugs here. I have filed 2 new doc bugs and 1 new server bug (see below). So here we should focus on the root issue; for comments on the other issues please use to the other bugs.

Doc bug #1:

> "where replica_parallel_workers or slave_parallel_workers is greater than 0", with no clarification provided for minor version behavior changes.

By "minor version behavior changes", do you refer to the parentheses that suggest using different names prior to 8.0.26 vs in 8.0.26 and later? There is no behavior change at all, only a terminology change; we have replaced the word "slave" by "replica". Maybe this sentence can be rephrased so that it becomes clearer that this is only a change in name and not a behavior change.

There were several other issues with this page. Reported the following bug to track that:
BUG#113283/BUG#36058322: Clarify section "Replication threads" in the manual

Doc bug #2:

> However, https://dev.mysql.com/doc/refman/8.0/en/replication-options-replica.html#sysvar_replica_pa... tells a different story for those using the most recent version of MySQL

Just to be explicit, I assume you mean the inconsistency is the definition of "multithreaded replica":

- Section "Replication Threads" defines multithreaded replica as one having replica_parallel_workers > 0.
- Section "replica_parallel_workers" defines multithreaded replica as one having replica_parallel_workers > 1.

Yes, clearly confusing. I checked the page and there are also multiple other issues. Filed a new bug to track this:
BUG#113282/BUG#36058217: Clarify section about replica_parallel_workers in the manual

Server bug #1 (source side)

It seems that the transaction has a position > 3 GiB in the source binlog, but its sequence number is 1. The sequence_number is supposed to start at 1 in the beginning of each file and then increase by 1 for each transaction, so this seems wrong. To analyze this, we would need to know a bit more about what happened before the source server generated the wrong logical timestamps. Do you still have the binary logs on the source and the relay logs on the replica? If possible, it may help if you could provide the output from mysqlbinlog for the transaction for which the anonymous_gtid_log_event has end_log_pos=3028211525, plus 2 transactions before that and 2 transactions after that.

Server bug #2 (replica side)

That check seems unnecessary when replica_parallel_workers=1. I think we should relax the check accordingly, so that this issue (and any other issues with logical timestamps) can be worked around by setting replica_parallel_workers=1.

Reported BUG#113284/BUG#36058442: Allow inconsistent logical timestamps when replica_parallel_workers=1

So, let's use the current bug to track Server bug #1 only. This is the root problem and may require a bit of interaction to analyze.