MySQL Bugs: #89395: Slave unable to start with err message

Bug #89395	Slave unable to start with err message - could not locate rotate events
Submitted:	25 Jan 2018 8:37	Modified:	28 May 2018 5:17
Reporter:	Prasad N	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.18	OS:	CentOS (Linux version 2.6.32-696.18.7.el6.x86_64 (mockbuild@c1bl.rdu2.centos.org))
Assigned to:		CPU Architecture:	Any

Description:
I have a master and 2 slave replica set. It is functioning in the semi-synchronous mode of replication. I have also enabled multi-threaded slave option for my slaves and there are 10 worker threads.
Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql_b). Now I bring up mysql-A again and it starts as a slave.
Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-b in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors:

[ERROR] Error during --relay-log-recovery: Could not locate rotate event from the master.
2018-01-24T19:26:47.996950Z 0 [ERROR] Failed to initialize the master info structure
2018-01-24T19:26:47.996968Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-01-24T19:26:47.996975Z 0 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.

Other status info:

show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 172.19.0.20
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-log-slave.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 0
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
                  Master_UUID: d156fa9b-ff46-11e7-99bd-000d3afd6d09
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set: ccc33176-ff46-11e7-9a46-000d3af94857:1-670712,
d156fa9b-ff46-11e7-99bd-000d3afd6d09:1-93029
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:

mysql> show master status \G;
*************************** 1. row ***************************
             File: mysql-bin.000006
         Position: 234
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: ccc33176-ff46-11e7-9a46-000d3af94857:1-670712,
d156fa9b-ff46-11e7-99bd-000d3afd6d09:1-93029
1 row in set (0.00 sec)

mysql> select * from slave_master_info\G;
*************************** 1. row ***************************
       Number_of_lines: 25
       Master_log_name:
        Master_log_pos: 4
                  Host: 172.19.0.20
             User_name: repl_user
         User_password: 3cAe9NuTZl3HJjPr
                  Port: 3306
         Connect_retry: 60
           Enabled_ssl: 1
                Ssl_ca:
            Ssl_capath:
              Ssl_cert:
            Ssl_cipher:
               Ssl_key:
Ssl_verify_server_cert: 0
             Heartbeat: 30
                  Bind:
    Ignored_server_ids: 0
                  Uuid: d156fa9b-ff46-11e7-99bd-000d3afd6d09
           Retry_count: 86400
               Ssl_crl:
           Ssl_crlpath:
 Enabled_auto_position: 1
          Channel_name:
           Tls_version:
1 row in set (0.00 sec)

mysql> select * from slave_relay_log_info\G;
*************************** 1. row ***************************
  Number_of_lines: 7
   Relay_log_name: ./relay-log-slave.000001
    Relay_log_pos: 4
  Master_log_name:
   Master_log_pos: 0
        Sql_delay: 0
Number_of_workers: 10
               Id: 1
     Channel_name:
1 row in set (0.00 sec)

How to repeat:
Am using Innodb and replication is GTID based and autoposition enabled.
Replication is semi sync and MTS is enabled. Other relevant config settings on my server are:

sync_binlog=1
innodb_flush_log_at_trx_commit=1
gtid_mode=ON
enforce_gtid_consistency=ON
binlog_format=MIXED
log_slave_updates=true
slave_net_timeout=60
master_info_repository=TABLE
relay_log_info_repository=TABLE
sync_master_info=10000
sync_relay_log=10000
relay_log_recovery=1
slave_parallel_workers=10
slave_preserve_commit_order=1
slave_parallel_type=LOGICAL_CLOCK
rpl_semi_sync_master_timeout=300000

     
I have a master (A)and 2 slave(B and C) replica set running semi-sync replication.
My master is being loaded with simple integer INSERT operations by 100 threads doing inserts every 10 ms.

* I soft kill my master mysql (A)
* I promote slave -B as the new master.
* I start mysql(B) again as a slave. [ i use skip-slave-start option and once mysql starts, i use START SLAVE]
* in 30s, I do a restart of my server thats running mysql-A (shutdown -r)
* I start mysql A again [ i use skip-slave-start option and once mysql starts, i use START SLAVE]
* Now START SLAVE fails with above stated erros

Suggested fix:
Since I have crash safety settings enabled on slave, I expect slave to start gracefully.

Correcting and updating the description field: I could not edit directly in the original description field.

<<<Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql-B). Now I bring up mysql-A again and it starts as a slave.
Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-A in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors:>>>

Hi,

I made a similar (hopefully same) configuration and had no issues with it. I believe it is a configuration problem but can't be 100% sure. Can you please share configuration from all nodes and full error log so we can investigate further.

all best
Bogdan

mysqld logs

Attachment: mysqld.log (application/octet-stream, text), 53.73 KiB.

mysql configuration file

Attachment: my.cnf (application/octet-stream, text), 1.57 KiB.

Hi

I have added the complete error log fine and also the my.cnf file that I am using on all nodes.

Thanks!

Hi,

Looks like it has to have some traffic running while killing and switching master/slave in multiple databases for this to be reproduced but I managed to reproduce it so bug is verified.

all best
Bogdan

Hi Bogdan. Thank you very much. Please let me know if you have any guidance on the fix for this.

Thanks
Prasad

Hello - Is there a timeframe for the fix ?
Also - could you provide steps or workaround to recover from this condition?

Thanks
Prasad

We are regularly hitting this issue. Could you share a plan to address this issue ?
At least a verified workaround should be given. Otherwise, this should be treated as S1 bug.

This bug was brought to my attention on Twitter:
https://twitter.com/terrorobe/status/1037984288661270528

I am trying to reproduce, but so far, my efforts failed.  I am using the script below on a GCP vm with 4 vCPU, if someone is able to reproduce with this or with a variation of this, please leave a comment or contribute on Twitter so I can try to provide a work-around.

$ dbdeployer deploy replication 5.7.23 --gtid -c skip-slave-start -c "relay_log_recovery=on" \
    -c "slave_parallel_type=LOGICAL_CLOCK" -c "slave_parallel_workers=10" -c "slave_preserve_commit_order=1"

$ cd rsandbox_5_7_23

$ echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./m; \
echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | tee >(./s1) | ./s2; \
echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./m; \
echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | tee >(./s1) | ./s2;

$ echo "stop slave; change master to master_auto_position = 1; start slave;" | tee >(./s1) | ./s2; \
./m <<< "create database test_jfg; create table test_jfg.t (id int primary key AUTO_INCREMENT, x int);"; \
for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./m test_jfg & done; \
sleep 5; kill -9 $(cat master/data/mysql_sandbox*.pid); \
echo "stop slave io_thread;" | tee >(./s1) | ./s2;

$ ./m -N <<< "select @@global.gtid_executed"; \
./s1 -N <<< "select @@global.gtid_executed"; \
./s2 -N <<< "select @@global.gtid_executed"

# Here, there is a risk that the master has more transactions than the slaves.
# But ^^ will not cause any problem with re-slaving below.  There is data drift, but not a problem for a test.
# Also, the slave might be lagging, so run again so see if they are still making progress.

$ ln s1 nm; ln s2 ns; \
./ns <<< "stop slave; change master to master_port=$(./nm -N <<< "select @@port"); start slave;"; \
./m <<< "change master to master_host='127.0.0.1', master_port=$(./nm -N <<< "select @@port");"; \
./m <<< "change master to master_user='rsandbox', master_password='rsandbox', master_auto_position=1;"; \
./m <<< "start slave;"; \
echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./nm; \
echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | ./m; \
echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./nm; \
echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | ./m; \
./nm test_jfg <<< "truncate t;"; \
for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./nm test_jfg & done; \
sleep 5; master/restart & tail -f master/data/msandbox.err

Hi,

> I am trying to reproduce, but so far, my efforts failed.

I spent some days on this one trying to reproduce it. Using sandbox it was impossible for me. I used vm's and also I had to run some write traffic on the master + read traffic on the slaves in order to reproduce it. Anyhow, it's not something I can reproduce on demand, reproduced it twice in a row and verified it, but I still can't say for sure I know what triggers it.