MySQL Bugs: #92252: Mysql generates gtid gaps because of option slave-skip-errors

Bug #92252	Mysql generates gtid gaps because of option slave-skip-errors
Submitted:	31 Aug 2018 1:14	Modified:	3 Sep 2018 12:31
Reporter:	Pin Lin	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.23	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
My databases consist of a master and a slave instance.Slave runs with option slave-skip-errors=1007,1008.

When slave applies relaylog transactions and generates errors specified by slave-skip-errors, gtid_executed produce gaps such as 1-3,5-10,12-20.

If binlog files on master including transactions which were applied on slave and hit errors in slave-skip-errors were purged, slave can't reconnect to master and reports error 1236 as bellow.

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

How to repeat:
my.cnf in slave includes option slave-skip-errors as bellow.
[mysqld]
slave-skip-errors = 1007,1008

then,change to master with MASTER_AUTO_POSITION = 1. Ensure that master and slave run well.

####################################################
step1,
login on master and execute command as follower:
####################################################
create database db1;
use db1;
create table t1(id int);
insert into t1 values(1);

now master's executed_gtid: uuid1:1-3

####################################################
step2,
login on slave and execute command as follower:
####################################################
create database db2;

now slave's executed_gtid: uuid1:1-3,uuid2:1

####################################################
step3,
login on master and execute command as follower:
####################################################
create database db2;   -- applied on salve, hit error 1007
insert into db1.t1 values(2);
insert into db1.t1 values(3);

now master's executed_gtid: uuid1:1-6

####################################################
step4,
login on slave and execute command as follower:
####################################################
show slave status\G

display as follower:
Retrieved_Gtid_Set:uuid1:1-5
Executed_Gtid_Set: uuid1:1-3:5-6,uuid1:1

now, executed_gtid on slave generates a gap.
next step used to show that slave fail to connect to master. 

####################################################
step4,
login on slave and execute command as follower:
####################################################
stop slave;
reset slave;
start slave;

Error log of slave displayed error message as bellow and failed to connect to master.
2018-08-30T09:41:01.562742Z 34 [ERROR] Error reading packet from server for channel '': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)

Suggested fix:
write the gtid of transaction which was applied on slave and hit errors specified by slave-skip-errors to executed_gtid.

Hello Pin Lin,

Thank you for the report and steps.
I followed your exact steps to reproduce the issue at my end on 5.7.23 but not seeing any issues. Could you please provide conf details from both master/slave to reproduce this at our end? Also, this looks like duplicate of Bug #86641, please see Bug #86641. 

I'll be shortly joining the 5.7.23 results from my environment.

regards,
Umesh

5.7.23 - test results

Attachment: 92252_5.7.23.results (application/octet-stream, text), 14.63 KiB.

hello, Umesh,
You have repeated the bug I described above.

this is your results as bellow, Executed_Gtid_Set has a gap.

Retrieved_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:4
Executed_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:1-3:5-6,
				   f2c66f8f-ad1d-11e8-80cf-0010e05f3e06:1

if you execute the last step, error will be produced.
the step used to restart slave IO/SQL thread, not to restart entire slave mysqld process。

####################################################
step5,
login on slave and execute command as follower:
####################################################
stop slave;
reset slave;
start slave;

I'm sorry. I made a mistake for the last step id. It is step5,not step4.

Hello Pin Lin,

Please note that I've followed exactly as you listed in step 5 and no errors seen afterwards.
You mean STOP SLAVE IO_THREAD instead of STOp SLAVE? Please confirm.

Extract from the earlier results file:

Slave>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3333
                Connect_Retry: 60
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 1379
               Relay_Log_File: hod03-relay-bin.000002
                Relay_Log_Pos: 1594
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]
           Retrieved_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:1-6
            Executed_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:1-3:5-6,
f2c66f8f-ad1d-11e8-80cf-0010e05f3e06:1
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

Slave>stop slave;
Query OK, 0 rows affected (0.00 sec)

Slave>reset slave;
Query OK, 0 rows affected (0.00 sec)

Slave>start slave;
Query OK, 0 rows affected (0.01 sec)

Slave>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3333
                Connect_Retry: 60
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 1379
               Relay_Log_File: hod03-relay-bin.000003
                Relay_Log_Pos: 621
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:4
            Executed_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:1-3:5-6,
f2c66f8f-ad1d-11e8-80cf-0010e05f3e06:1
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

Thanks,
Umesh

Hi Pin Lin,

Could you please confirm on this? Reported issue is observed only when (i) gtid gaps on slave i.e after Error_code: 1007 AND (ii) bin log(s) are purged on master i.e including those transactions which are already applied on slave.  Below issue is not seen if there is no gtid gap on slave.

-- slave

Slave>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3333
                Connect_Retry: 60
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 1379
               Relay_Log_File: hod03-relay-bin.000002
                Relay_Log_Pos: 1594
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]
           Retrieved_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:1-6
            Executed_Gtid_Set: ed43e92f-ad1d-11e8-bd01-0010e05f3e06:1-3:5-6,
f2c66f8f-ad1d-11e8-80cf-0010e05f3e06:1
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

-- on master

mysql> flush logs;
Query OK, 0 rows affected (0.00 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| master-bin.000001 |      1427 |
| master-bin.000002 |       194 |
+-------------------+-----------+
2 rows in set (0.00 sec)

mysql> purge binary logs to 'master-bin.000002';
Query OK, 0 rows affected (0.00 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| master-bin.000002 |       194 |
+-------------------+-----------+
1 row in set (0.00 sec)

-- on slave

Slave>stop slave;
Query OK, 0 rows affected (0.00 sec)

Slave>reset slave;
Query OK, 0 rows affected (0.00 sec)

Slave>start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3333
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: hod03-relay-bin.000002
[..]
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
[...]
           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 1b66e02d-af5e-11e8-a0b5-0010e05f3e06:1-2,
606f2da2-af5e-11e8-a9ed-0010e05f3e06:1-3:5-6

regards,
Umesh

oh!! yes, it is.
You are so clever and professional.
sorry for missing step "flush logs" and wasting you so much time to repeat it.

Thank you for confirming.

regards,
Umesh

Hello there,

I've been trying to reproduce this problem with the errors 1032 and 1062 and could not do it...

I don't see gaps in the gtid execution set.