Bug #89395 Slave unable to start with err message - could not locate rotate events
Submitted: 25 Jan 2018 8:37 Modified: 28 May 2018 5:17
Reporter: Prasad N Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.18 OS:CentOS (Linux version 2.6.32-696.18.7.el6.x86_64 (mockbuild@c1bl.rdu2.centos.org))
Assigned to: CPU Architecture:Any

[25 Jan 2018 8:37] Prasad N
Description:
I have a master and 2 slave replica set. It is functioning in the semi-synchronous mode of replication. I have also enabled multi-threaded slave option for my slaves and there are 10 worker threads.
Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql_b). Now I bring up mysql-A again and it starts as a slave.
Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-b in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors:

[ERROR] Error during --relay-log-recovery: Could not locate rotate event from the master.
2018-01-24T19:26:47.996950Z 0 [ERROR] Failed to initialize the master info structure
2018-01-24T19:26:47.996968Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-01-24T19:26:47.996975Z 0 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

Other status info:

show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 172.19.0.20
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-log-slave.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 0
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
                  Master_UUID: d156fa9b-ff46-11e7-99bd-000d3afd6d09
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set: ccc33176-ff46-11e7-9a46-000d3af94857:1-670712,
d156fa9b-ff46-11e7-99bd-000d3afd6d09:1-93029
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:

mysql> show master status \G;
*************************** 1. row ***************************
             File: mysql-bin.000006
         Position: 234
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: ccc33176-ff46-11e7-9a46-000d3af94857:1-670712,
d156fa9b-ff46-11e7-99bd-000d3afd6d09:1-93029
1 row in set (0.00 sec)

mysql> select * from slave_master_info\G;
*************************** 1. row ***************************
       Number_of_lines: 25
       Master_log_name:
        Master_log_pos: 4
                  Host: 172.19.0.20
             User_name: repl_user
         User_password: 3cAe9NuTZl3HJjPr
                  Port: 3306
         Connect_retry: 60
           Enabled_ssl: 1
                Ssl_ca:
            Ssl_capath:
              Ssl_cert:
            Ssl_cipher:
               Ssl_key:
Ssl_verify_server_cert: 0
             Heartbeat: 30
                  Bind:
    Ignored_server_ids: 0
                  Uuid: d156fa9b-ff46-11e7-99bd-000d3afd6d09
           Retry_count: 86400
               Ssl_crl:
           Ssl_crlpath:
 Enabled_auto_position: 1
          Channel_name:
           Tls_version:
1 row in set (0.00 sec)

mysql> select * from slave_relay_log_info\G;
*************************** 1. row ***************************
  Number_of_lines: 7
   Relay_log_name: ./relay-log-slave.000001
    Relay_log_pos: 4
  Master_log_name:
   Master_log_pos: 0
        Sql_delay: 0
Number_of_workers: 10
               Id: 1
     Channel_name:
1 row in set (0.00 sec)

How to repeat:
Am using Innodb and replication is GTID based and autoposition enabled.
Replication is semi sync and MTS is enabled. Other relevant config settings on my server are:

sync_binlog=1
innodb_flush_log_at_trx_commit=1
gtid_mode=ON
enforce_gtid_consistency=ON
binlog_format=MIXED
log_slave_updates=true
slave_net_timeout=60
master_info_repository=TABLE
relay_log_info_repository=TABLE
sync_master_info=10000
sync_relay_log=10000
relay_log_recovery=1
slave_parallel_workers=10
slave_preserve_commit_order=1
slave_parallel_type=LOGICAL_CLOCK
rpl_semi_sync_master_timeout=300000

     
I have a master (A)and 2 slave(B and C) replica set running semi-sync replication.
My master is being loaded with simple integer INSERT operations by 100 threads doing inserts every 10 ms.

* I soft kill my master mysql (A)
* I promote slave -B as the new master.
* I start mysql(B) again as a slave. [ i use skip-slave-start option and once mysql starts, i use START SLAVE]
* in 30s, I do a restart of my server thats running mysql-A (shutdown -r)
* I start mysql A again [ i use skip-slave-start option and once mysql starts, i use START SLAVE]
* Now START SLAVE fails with above stated erros

Suggested fix:
Since I have crash safety settings enabled on slave, I expect slave to start gracefully.
[25 Jan 2018 8:42] Prasad N
Correcting and updating the description field: I could not edit directly in the original description field.

<<<Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql-B). Now I bring up mysql-A again and it starts as a slave.
Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-A in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors:>>>
[2 Feb 2018 15:27] MySQL Verification Team
Hi,

I made a similar (hopefully same) configuration and had no issues with it. I believe it is a configuration problem but can't be 100% sure. Can you please share configuration from all nodes and full error log so we can investigate further.

all best
Bogdan
[5 Feb 2018 5:07] Prasad N
mysqld logs

Attachment: mysqld.log (application/octet-stream, text), 53.73 KiB.

[5 Feb 2018 5:11] Prasad N
mysql configuration file

Attachment: my.cnf (application/octet-stream, text), 1.57 KiB.

[5 Feb 2018 5:11] Prasad N
Hi

I have added the complete error log fine and also the my.cnf file that I am using on all nodes.

Thanks!
[27 Feb 2018 13:15] MySQL Verification Team
Hi,

Looks like it has to have some traffic running while killing and switching master/slave in multiple databases for this to be reproduced but I managed to reproduce it so bug is verified.

all best
Bogdan
[28 Feb 2018 6:09] Prasad N
Hi Bogdan. Thank you very much. Please let me know if you have any guidance on the fix for this.

Thanks
Prasad
[5 Mar 2018 9:23] Prasad N
Hello - Is there a timeframe for the fix ?
Also - could you provide steps or workaround to recover from this condition?

Thanks
Prasad
[28 May 2018 5:17] Prasad N
We are regularly hitting this issue. Could you share a plan to address this issue ?
At least a verified workaround should be given. Otherwise, this should be treated as S1 bug.
[8 Sep 2018 15:38] J-F Logica Gagné
This bug was brought to my attention on Twitter:
https://twitter.com/terrorobe/status/1037984288661270528

I am trying to reproduce, but so far, my efforts failed.  I am using the script below on a GCP vm with 4 vCPU, if someone is able to reproduce with this or with a variation of this, please leave a comment or contribute on Twitter so I can try to provide a work-around.

$ dbdeployer deploy replication 5.7.23 --gtid -c skip-slave-start -c "relay_log_recovery=on" \
    -c "slave_parallel_type=LOGICAL_CLOCK" -c "slave_parallel_workers=10" -c "slave_preserve_commit_order=1"

$ cd rsandbox_5_7_23

$ echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./m; \
echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | tee >(./s1) | ./s2; \
echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./m; \
echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | tee >(./s1) | ./s2;

$ echo "stop slave; change master to master_auto_position = 1; start slave;" | tee >(./s1) | ./s2; \
./m <<< "create database test_jfg; create table test_jfg.t (id int primary key AUTO_INCREMENT, x int);"; \
for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./m test_jfg & done; \
sleep 5; kill -9 $(cat master/data/mysql_sandbox*.pid); \
echo "stop slave io_thread;" | tee >(./s1) | ./s2;

$ ./m -N <<< "select @@global.gtid_executed"; \
./s1 -N <<< "select @@global.gtid_executed"; \
./s2 -N <<< "select @@global.gtid_executed"

# Here, there is a risk that the master has more transactions than the slaves.
# But ^^ will not cause any problem with re-slaving below.  There is data drift, but not a problem for a test.
# Also, the slave might be lagging, so run again so see if they are still making progress.

$ ln s1 nm; ln s2 ns; \
./ns <<< "stop slave; change master to master_port=$(./nm -N <<< "select @@port"); start slave;"; \
./m <<< "change master to master_host='127.0.0.1', master_port=$(./nm -N <<< "select @@port");"; \
./m <<< "change master to master_user='rsandbox', master_password='rsandbox', master_auto_position=1;"; \
./m <<< "start slave;"; \
echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./nm; \
echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | ./m; \
echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./nm; \
echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | ./m; \
./nm test_jfg <<< "truncate t;"; \
for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./nm test_jfg & done; \
sleep 5; master/restart & tail -f master/data/msandbox.err
[8 Sep 2018 20:17] MySQL Verification Team
Hi,

> I am trying to reproduce, but so far, my efforts failed.

I spent some days on this one trying to reproduce it. Using sandbox it was impossible for me. I used vm's and also I had to run some write traffic on the master + read traffic on the slaves in order to reproduce it. Anyhow, it's not something I can reproduce on demand, reproduced it twice in a row and verified it, but I still can't say for sure I know what triggers it.