Bug #77187 START SLAVE no error if sql thread running + GLOBAL SQL_SLAVE_SKIP_COUNTER > 1
Submitted: 29 May 2015 6:37 Modified: 29 Jun 2015 14:53
Reporter: Simon Mudd (OCA) Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.7rc, 5.7.8 OS:Any
Assigned to: CPU Architecture:Any
Tags: 5.7, replication, sql_slave_skip_counter, start slave

[29 May 2015 6:37] Simon Mudd
Description:
I use this idiom frequently to skip broken statements until I get an error because the SQL thread is running:

SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

On MySQL 5.6 when it's running I get this error:

root@mysql-5-6-24 [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first
Query OK, 0 rows affected, 1 warning (0.00 sec)

In 5.7.7 I notice that the error has been changed into a warning and MySQL warnings are somewhat harder to pick up than straight errors.

How to repeat:
On 5.6 you see:

root@mysql-5-6-24 [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                  Master_User: xxxxxxxx
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.012144
          Read_Master_Log_Pos: 15285143
               Relay_Log_File: relaylog.019194
                Relay_Log_Pos: 15285303
        Relay_Master_Log_File: binlog.012144
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 15285143
              Relay_Log_Space: 15510785
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 999999999
                  Master_UUID: 02956150-acc7-11e4-aed7-aaaaaaaaaaaa
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)

root@mysql-5-6-24 [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first
Query OK, 0 rows affected, 1 warning (0.00 sec)

root@mysql-5-6-24 [(none)]>

On MySQL 5.7.7 you see:

root@mysql-5-7-7rc [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                  Master_User: xxxxxxxx
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.006977
          Read_Master_Log_Pos: 52088503
               Relay_Log_File: relaylog.017605
                Relay_Log_Pos: 1350939
        Relay_Master_Log_File: binlog.006829
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1350745
              Relay_Log_Space: 15351264460
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 197201005
                  Master_UUID: d15191af-8a3f-11e4-8db1-xxxxxxxxxxxx
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name:
1 row in set (0.00 sec)

root@mysql-5-7-7rc [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.00 sec)

root@mysql-5-7-7rc [(none)]>

+-------+------+------------------------------------------------------------+
| Level | Code | Message                                                    |
+-------+------+------------------------------------------------------------+
| Note  | 3083 | Replication thread(s) for channel '' are already runnning. |
+-------+------+------------------------------------------------------------+
1 row in set (0.00 sec)

root@mysql-5-7-7rc [(none)]>

Note: no error generated , just a warning.

Suggested fix:
Please revert the behaviour so that an error is generated rather than a warning under such circumstances.  This will break many scripts that expect to stop when a warning is encountered.
[29 May 2015 6:56] Simon Mudd
To rephrase what I said:

This may break scripts which expect to get an error if the SQL thread is running if you try to start it.  Now you get a warning which is not the same so you need to change the script's behaviour and 5.6 and 5.7 behaviour has changed.
[29 May 2015 7:26] Simon Mudd
So to add some thoughts after talking with colleagues about this.

If you have 1 channel: then current 5.6 behaviour makes sense. Try to maintain it.
If you have > 1 replication channel then running start slave with GLOBAL SQL_SLAVE_SKIP_COUNTER > 1 is ambiguous, so probably should be disallowed.  Probably you should be forced to specify the channel you wish to skip statements on.

[ Even if you skip a statement on channels which are stopped, if you have more than 1 stopped channel, unless the behaviour is very clearly defined, this is like jumping into a minefield as it may not be clear which channel will be used. Basically don't allow that. ]

I had a brief look at the 5.7 documentation regarding slave start and didn't notice any special comments regarding this type of scenario.  So one on "error handling when multiple channels are in use" ... might be handy.

Normally you don't want to skip statements, but sometimes this is necessary if for example replication delay is more important than slave consistency.  So a section on what to do if replication breaks would be helpful for people who don't see this much and then think: "what am I going to do now?"

Given the above I still think the new 5.7 behaviour is wrong, at least if a single channel is involved and documentation should be as clear as possible about this type of situation.
[29 May 2015 7:36] MySQL Verification Team
Hello Simon,

Thank you for the report.
I'm not sure whether this is intended behavior, but 5.6 generate error where as 5.7 just warns and which contradicts manual quote - "This statement is valid only when the slave threads are not running. Otherwise, it produces an error." https://dev.mysql.com/doc/refman/5.7/en/set-global-sql-slave-skip-counter.html

Thanks,
Umesh
[29 May 2015 7:37] MySQL Verification Team
/ 5.6.26 (Setup simple master-slave replication)

// On running Slave

slave> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first
Query OK, 0 rows affected, 1 warning (0.00 sec)

slave> show warnings;
+-------+------+--------------------------+
| Level | Code | Message                  |
+-------+------+--------------------------+
| Note  | 1254 | Slave is already running |
+-------+------+--------------------------+
1 row in set (0.00 sec)

// 5.7.8 (Setup simple master-slave replication)

// On running slave 

slave> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.00 sec)

slave> show warnings;
+-------+------+------------------------------------------------------------+
| Level | Code | Message                                                    |
+-------+------+------------------------------------------------------------+
| Note  | 3083 | Replication thread(s) for channel '' are already runnning. |
+-------+------+------------------------------------------------------------+
1 row in set (0.01 sec)
[29 Jun 2015 14:12] Sven Sandberg
Posted by developer:
 
Thank you for highlighting this issue. However, this is expected behavior. We changed it in 5.7 so that:
- sql_slave_skip_counter behaves like most other replication slave variables.
- sql_slave_skip_counter works with multi-source replication.

The logic is now described in https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html#sysvar_sql_slave_sk... :
"Setting the option has no immediate effect. The variable applies to the next START SLAVE command; the next START SLAVE command will also change the value back to 0. When this variable is set to a non-zero value and there are multiple replication channels configured, the START SLAVE command can only be used with the FOR CHANNEL channel clause."

We are sorry this broke your script. However, you can fix it as follows: instead of checking if SET GLOBAL SQL_SLAVE_SKIP_COUNTER returns an error, check if START SLAVE returns a warning.
[29 Jun 2015 14:53] Simon Mudd
Don't worry about breaking my script. That wasn't really the point.  It was the change in behaviour in a major version release without clear documentation indicating the behavioural change and what the DBA could expect from the MySQL 5.7+ server.

You have addressed both those issues so I'll go and look at adjusting my "convenient 1-liner script" to be a bit smarter and both 5.6 and 5.7+ aware.