Bug #97531 5.7 replication breakage with syntax error with GRANT management
Submitted: 7 Nov 2019 11:22 Modified: 7 Jan 2020 15:53
Reporter: Simon Mudd (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.7.28, 8.0.18 OS:Any
Assigned to: CPU Architecture:Any
Tags: grants, replication, writeset

[7 Nov 2019 11:22] Simon Mudd
Description:
This seems related to bug#95612 but seen on 5.7 not 8.0.

Environment a development system where there are multiple users (5k) and multiple databases (100) and grants need to be rebuilt on a daily basis from a clean setup. The environment has a master and slave. Replication breaks on the slave when replicating, though the grants are pushed and accepted perfectly on the master.

Replication configuration (/etc/my.cnf) on the slave uses the following settings:

log-bin          = ../log/binlog
binlog_format    = ROW
binlog_row_image = minimal
transaction_write_set_extraction = XXHASH64
binlog_transaction_dependency_tracking = WRITESET
sync_binlog      = 1
expire_logs_days = 1
log_slave_updates
slave_preserve_commit_order = 1
slave_parallel_workers = 10
slave_parallel_type = LOGICAL_CLOCK

Grants are pushed by running grouped CREATE USER statements, followed by GRANT ... ON ... TO ....; for each database.

Errors seen:

root@host-01 [(none)]> select * from performance_schema.replication_applier_status_by_worker limit 1\G
*************************** 1. row ***************************
         CHANNEL_NAME:
            WORKER_ID: 1
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:267233
    LAST_ERROR_NUMBER: 1396
   LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:267233' at master log binlog.000001, end_log_pos 85477918; Error 'Operation CREATE USER failed for 'user1'@'%','user2'@'%','user3'@'%','user4'@'%','user5'@'%','user6'@'%','user7'@'%','user8' on query. Default database: 'mysql'. Query: 'CREATE USER 'user1'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*0000000000000000000000000000000000000000','user2'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*0000000000000000000000000000000000000000','user3'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*0000000000000000000000000000000000000000','user4'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*0000000000000000000000000000000000000000','user5'@'%' I
 LAST_ERROR_TIMESTAMP: 2019-11-05 02:55:25

This was resolved by simply running: stop slave; start slave;

Another error seen (different host):

root@host-02 [(none)]> select * from performance_schema.replication_applier_status_by_worker order by worker_id desc limit 1\G
*************************** 1. row ***************************
         CHANNEL_NAME:
            WORKER_ID: 10
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1673845
    LAST_ERROR_NUMBER: 1064
   LAST_ERROR_MESSAGE: Worker 10 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1673845' at master log binlog.000066, end_log_pos 18247429; Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: ''. Query: 'GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, CREATE VIEW, SHOW VIEW ON `xxxxxxxxxx`'
 LAST_ERROR_TIMESTAMP: 2019-11-05 12:41:11
1 row in set (0.00 sec)

This is on the slave. The master runs the same version so it's extremely unlikely there's a syntax error. Perhaps the statement has got corrupted as several threads are accessing the grant memory structures at the same time?

How to repeat:
Symptoms are shown above. This is repeatable as we rebuild these environments frequently. Clearly the usernames / database names have been anonymised.

However, it looks to me as if the issue with WRITESET replication is causing a similar issue to that seen in 8.0 concurrent access to "grant information" perhaps with some missing locking etc?

Suggested fix:
A couple of things come to mind:
* make replication not fail as the statements worked on the master
* move the replication_applier_status_by_worker error directly to the show slave status error
[7 Nov 2019 16:27] MySQL Verification Team
Thanks for report Simon, let me try to make a test case.
[7 Nov 2019 17:34] Simon Mudd
Further investigation showed the reported syntax error on the slave was due to a corrupt statement in the relay logs. (GRANT statements are sent as SBR) and the statement in the relaylog was corrupt.  The upstream master's binlog had the same event (e.g. a corrupt statement) yet the statement had been written correctly on the master.

So I think for now the problem on the slave should be attributed to an issue on the master which I'm following up on other channels.
[7 Nov 2019 18:20] MySQL Verification Team
Hi Simon,

This was easy to reproduce on 8.0 but I'm having issues reproducing on 5.7

master [localhost:19428] {msandbox} ((none)) > show variables like 'binlog%';
+--------------------------------------------+--------------+
| Variable_name                              | Value        |
+--------------------------------------------+--------------+
| binlog_cache_size                          | 32768        |
| binlog_checksum                            | CRC32        |
| binlog_direct_non_transactional_updates    | OFF          |
| binlog_error_action                        | ABORT_SERVER |
| binlog_format                              | ROW          |
| binlog_group_commit_sync_delay             | 0            |
| binlog_group_commit_sync_no_delay_count    | 0            |
| binlog_gtid_simple_recovery                | ON           |
| binlog_max_flush_queue_time                | 0            |
| binlog_order_commits                       | ON           |
| binlog_row_image                           | MINIMAL      |
| binlog_rows_query_log_events               | OFF          |
| binlog_stmt_cache_size                     | 32768        |
| binlog_transaction_dependency_history_size | 25000        |
| binlog_transaction_dependency_tracking     | WRITESET     |
+--------------------------------------------+--------------+
15 rows in set (0.02 sec)

master [localhost:19428] {msandbox} ((none)) > master [localhost:19428] {msandbox} ((none)) > show variables like 'gtid%';
+----------------------------------+-----------+
| Variable_name                    | Value     |
+----------------------------------+-----------+
| gtid_executed_compression_period | 1000      |
| gtid_mode                        | ON        |
| gtid_next                        | AUTOMATIC |
| gtid_owned                       |           |
| gtid_purged                      |           |
+----------------------------------+-----------+
5 rows in set (0.00 sec)

$cat bug.sh
#!/bin/bash

for (( c=1; c<=5000; c++ ))
do
    /home/arhimed/sandboxes/rsandbox_5_7_27/m -uroot -e "DROP USER IF EXISTS 'bug$c'@'localhost';CREATE USER IF NOT EXISTS 'bug$c'@'localhost';ALTER USER 'bug$c'@'localhost' IDENTIFIED BY 'Pa$$123';GRANT ALL ON bug$c.* TO 'bug$c'@'localhost';\
grant all on db1.* to 'bug$c'@'localhost';\
grant all on db2.* to 'bug$c'@'localhost';\
grant all on db3.* to 'bug$c'@'localhost';\
grant all on db4.* to 'bug$c'@'localhost';\
grant all on db5.* to 'bug$c'@'localhost';\
grant all on db6.* to 'bug$c'@'localhost';\
grant all on db7.* to 'bug$c'@'localhost';\
grant all on db8.* to 'bug$c'@'localhost';\
grant all on db9.* to 'bug$c'@'localhost';\
grant all on db10.* to 'bug$c'@'localhost';\
"
done

$

I run the script a few dozen times and the replication is not failing.
Anything you might add to the test case to make it look more like what you are seeing?

thanks
bogdan
[7 Nov 2019 18:28] Simon Mudd
Easy to reproduce in 8.0? Oh. I thought 8.0 was free from this.
[7 Nov 2019 18:32] Simon Mudd
Note: this bug report actually reports what appear to be 2 different issues. At the time I thought it was a single one.
* the SQL syntax error with GRANT statements [ due to relaylog corruption ]
* an error with a large CREATE USER statement which created multiple users in one go and was giving an error.

So not sure which issue you have been able to repeat.
[7 Nov 2019 19:13] MySQL Verification Team
Hi Simon,

The latest 8 is free from it but the 8.0.16 is not (fixed in 8.0.18).

As for 5.7 running the said script does not stop replication, will check the SR too for more data/config and discuss with Shane. Not sure if the config is different or the issue is with the script itself (if there needs to be more grants or bigger grants or..)

all best
Bogdan
[20 Nov 2019 17:21] MySQL Verification Team
8.0.18 is also affected
[21 Nov 2019 7:32] MySQL Verification Team
Here's proof of the binlog corruption from my test system:
-------------------------

--
Version: '5.7.28-log'  socket: '/home/anon/rpl/slave1/sock'  port: 3307  MySQL Community Server (GPL)
2019-11-20T14:43:14.907597Z 2 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=fc30-relay-bin' to avoid this problem.
2019-11-20T14:43:14.925701Z 2 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='127.0.0.1', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2019-11-20T14:43:14.965970Z 4 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2019-11-20T14:43:14.966570Z 5 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'FIRST' at position 0, relay log './fc30-relay-bin.000001' position: 4
2019-11-20T14:43:14.966819Z 4 [Note] Slave I/O thread for channel '': connected to master 'root@127.0.0.1:3306',replication started in log 'FIRST' at position 4
2019-11-20T14:55:38.631289Z 5 [ERROR] Slave SQL for channel '': Query caused different errors on master and slave. Error on master: message (format)='Operation %s failed for %.256s' error code=1396; Error on slave:actual message='You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1', error code=1064. Default database:'test'. Query:'CREATE USER 'user0005'@'%'', Error_code: 3002
2019-11-20T14:55:38.631309Z 5 [ERROR] Slave (additional info): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 Error_code: 1064
2019-11-20T14:55:38.631314Z 5 [Warning] Slave: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 Error_code: 1064
2019-11-20T14:55:38.631318Z 5 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 370941511.
--
             Master_Log_File: binlog.000001
          Read_Master_Log_Pos: 391045951
               Relay_Log_File: fc30-relay-bin.000002
                Relay_Log_Pos: 370941718
        Relay_Master_Log_File: binlog.000001

               Last_SQL_Errno: 3002
               Last_SQL_Error: Query caused different errors on master and slave. Error on master: message (format)='Operation %s failed for %.256s' error code=1396; Error on slave:actual message='You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1', error code=1064. Default database:'test'. Query:'CREATE USER 'user0005'@'%''
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
          Exec_Master_Log_Pos: 370941511
              Relay_Log_Space: 391046364
 
          Retrieved_Gtid_Set: 092a1971-0ba4-11ea-b3f4-000d6179f72b:1-1850823
            Executed_Gtid_Set: 092a1971-0ba4-11ea-b3f4-000d6179f72b:1-1754855
 

mysqlbinlog  --base64-output=decode-rows ~/rpl/master/data/binlog.000001 --include-gtids=092a1971-0ba4-11ea-b3f4-000d6179f72b:1754854-1754856 --hexdump -vvv |less
Looking at the binary log, notice the NULL byte here: "27 00 49"
   

#       GTID    last_committed=1754855  sequence_number=1754856 rbr_only=no
SET @@SESSION.GTID_NEXT= '092a1971-0ba4-11ea-b3f4-000d6179f72b:1754856'/*!*/;
# at 370941576
#191120 16:55:38 server id 1  end_log_pos 370941805 CRC32 0xd0c6483f 
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags 
# 161c1e88 ea 53 d5 5d   02   01 00 00 00   e5 00 00 00   6d 1f 1c 16   00 00
# 161c1e9b 0f 00 00 00 00 00 00 00  04 74 05 22 00 00 00 00 |.........t......|
# 161c1eab 00 00 01 20 00 a0 55 00  00 00 00 06 03 73 74 64 |......U......std|
# 161c1ebb 04 08 00 08 00 08 00 0c  01 6d 79 73 71 6c 00 74 |.........mysql.t|
# 161c1ecb 65 73 74 00 43 52 45 41  54 45 20 55 53 45 52 20 |est.CREATE.USER.|
# 161c1edb 27 75 73 65 72 30 30 30  35 27 40 27 25 27 00 49 |.user0005......I|                                                  <<<<<<<<<<<<<<<<<<<<<<
# 161c1eeb 44 45 4e 54 49 46 49 45  44 20 57 49 54 48 20 27 |DENTIFIED.WITH..|
# 161c1efb 6d 79 73 71 6c 5f 6e 61  74 69 76 65 5f 70 61 73 |mysql.native.pas|
# 161c1f0b 73 77 6f 72 64 27 2c 27  75 73 65 72 30 30 30 35 |sword...user0005|
# 161c1f1b 27 40 27 25 27 20 49 44  45 4e 54 49 46 49 45 44 |......IDENTIFIED|
# 161c1f2b 20 57 49 54 48 20 27 6d  79 73 71 6c 5f 6e 61 74 |.WITH..mysql.nat|
# 161c1f3b 69 76 65 5f 70 61 73 73  77 6f 72 64 27 20 52 45 |ive.password..RE|
# 161c1f4b 51 55 49 52 45 20 53 55  42 4a 45 43 54 20 27 3b |QUIRE.SUBJECT...|
# 161c1f5b 27 20 41 43 43 4f 55 4e  54 20 4c 4f 43 4b 3f 48 |..ACCOUNT.LOCK.H|
# 161c1f6b c6 d0                                            |..|
#       Query   thread_id=15    exec_time=0     error_code=1396
SET TIMESTAMP=1574261738/*!*/;
CREATE USER 'user0005'@'%'^@IDENTIFIED WITH 'mysql_native_password','user0005'@'%' IDENTIFIED WITH 'mysql_native_password' REQUIRE SUBJECT ';' ACCOUNT LOCK
/*!*/;

--------------------------------
[2 Dec 2019 9:13] Tatjana Nuernberg
Posted by developer:
 
As previously indicated, SHOW PROCESSLIST works differently
from performance schema (threads / events_statements tables),
so the S/P is where we might see the issue.

There is a mechanism in the server that aims to prevent plaintext
passwords from being displayed in logs, monitoring, and so on.

Some changes in the handling of CREATE/ALTER USER and GRANTs
have led to an unexpected change in the life-cycle of these
rewritten queries. As these strings could be inspected from
another thread (using SHOW PROCESSLIST), this could result in
SHOW PROCESSLIST obtaining information from a thread, and this
information subsequently being changed during the execution phase
(rather than just after the parsing phase), resulting in SHOW
PROCESSLIST using stale information.

The above patch therefore modifies the handling of users and
grants, and lays some groundwork for the server to be more
robust with regard to unexpected use of the rewrite framework.
[7 Jan 2020 15:53] Margaret Fisher
Posted by developer:
 
Changelog entry added for MySQL 8.0.20 and 5.7.30:

While an SQL statement was in the process of being rewritten for the binary log so that sensitive information did not appear in plain text, if a SHOW PROCESSLIST statement was used to inspect the query, the query could become corrupted when it was written to the binary log, causing replication to stop. The process of rewriting the query is now kept private, and the query thread is updated only when rewriting is complete.
[29 Jan 2020 21:27] Margaret Fisher
Posted by developer:
 
Already changelogged, reclosing.
[29 Feb 2020 17:35] Michael Finch
Hi Margaret. Has the fix for this bug been published? It has been a big pain for my team so we're wondering if we can cherry pick the fix into our build. Thanks.