| Bug #89395 | Slave unable to start with err message - could not locate rotate events | ||
|---|---|---|---|
| Submitted: | 25 Jan 2018 8:37 | Modified: | 28 May 2018 5:17 |
| Reporter: | Prasad N | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
| Version: | 5.7.18 | OS: | CentOS (Linux version 2.6.32-696.18.7.el6.x86_64 (mockbuild@c1bl.rdu2.centos.org)) |
| Assigned to: | CPU Architecture: | Any | |
[25 Jan 2018 8:42]
Prasad N
Correcting and updating the description field: I could not edit directly in the original description field. <<<Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql-B). Now I bring up mysql-A again and it starts as a slave. Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-A in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors:>>>
[2 Feb 2018 15:27]
MySQL Verification Team
Hi, I made a similar (hopefully same) configuration and had no issues with it. I believe it is a configuration problem but can't be 100% sure. Can you please share configuration from all nodes and full error log so we can investigate further. all best Bogdan
[5 Feb 2018 5:07]
Prasad N
mysqld logs
Attachment: mysqld.log (application/octet-stream, text), 53.73 KiB.
[5 Feb 2018 5:11]
Prasad N
mysql configuration file
Attachment: my.cnf (application/octet-stream, text), 1.57 KiB.
[5 Feb 2018 5:11]
Prasad N
Hi I have added the complete error log fine and also the my.cnf file that I am using on all nodes. Thanks!
[27 Feb 2018 13:15]
MySQL Verification Team
Hi, Looks like it has to have some traffic running while killing and switching master/slave in multiple databases for this to be reproduced but I managed to reproduce it so bug is verified. all best Bogdan
[28 Feb 2018 6:09]
Prasad N
Hi Bogdan. Thank you very much. Please let me know if you have any guidance on the fix for this. Thanks Prasad
[5 Mar 2018 9:23]
Prasad N
Hello - Is there a timeframe for the fix ? Also - could you provide steps or workaround to recover from this condition? Thanks Prasad
[28 May 2018 5:17]
Prasad N
We are regularly hitting this issue. Could you share a plan to address this issue ? At least a verified workaround should be given. Otherwise, this should be treated as S1 bug.
[8 Sep 2018 15:38]
J-F Legacy Gagné
This bug was brought to my attention on Twitter: https://twitter.com/terrorobe/status/1037984288661270528 I am trying to reproduce, but so far, my efforts failed. I am using the script below on a GCP vm with 4 vCPU, if someone is able to reproduce with this or with a variation of this, please leave a comment or contribute on Twitter so I can try to provide a work-around. $ dbdeployer deploy replication 5.7.23 --gtid -c skip-slave-start -c "relay_log_recovery=on" \ -c "slave_parallel_type=LOGICAL_CLOCK" -c "slave_parallel_workers=10" -c "slave_preserve_commit_order=1" $ cd rsandbox_5_7_23 $ echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./m; \ echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | tee >(./s1) | ./s2; \ echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./m; \ echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | tee >(./s1) | ./s2; $ echo "stop slave; change master to master_auto_position = 1; start slave;" | tee >(./s1) | ./s2; \ ./m <<< "create database test_jfg; create table test_jfg.t (id int primary key AUTO_INCREMENT, x int);"; \ for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./m test_jfg & done; \ sleep 5; kill -9 $(cat master/data/mysql_sandbox*.pid); \ echo "stop slave io_thread;" | tee >(./s1) | ./s2; $ ./m -N <<< "select @@global.gtid_executed"; \ ./s1 -N <<< "select @@global.gtid_executed"; \ ./s2 -N <<< "select @@global.gtid_executed" # Here, there is a risk that the master has more transactions than the slaves. # But ^^ will not cause any problem with re-slaving below. There is data drift, but not a problem for a test. # Also, the slave might be lagging, so run again so see if they are still making progress. $ ln s1 nm; ln s2 ns; \ ./ns <<< "stop slave; change master to master_port=$(./nm -N <<< "select @@port"); start slave;"; \ ./m <<< "change master to master_host='127.0.0.1', master_port=$(./nm -N <<< "select @@port");"; \ ./m <<< "change master to master_user='rsandbox', master_password='rsandbox', master_auto_position=1;"; \ ./m <<< "start slave;"; \ echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./nm; \ echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | ./m; \ echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./nm; \ echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | ./m; \ ./nm test_jfg <<< "truncate t;"; \ for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./nm test_jfg & done; \ sleep 5; master/restart & tail -f master/data/msandbox.err
[8 Sep 2018 20:17]
MySQL Verification Team
Hi, > I am trying to reproduce, but so far, my efforts failed. I spent some days on this one trying to reproduce it. Using sandbox it was impossible for me. I used vm's and also I had to run some write traffic on the master + read traffic on the slaves in order to reproduce it. Anyhow, it's not something I can reproduce on demand, reproduced it twice in a row and verified it, but I still can't say for sure I know what triggers it.

Description: I have a master and 2 slave replica set. It is functioning in the semi-synchronous mode of replication. I have also enabled multi-threaded slave option for my slaves and there are 10 worker threads. Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql_b). Now I bring up mysql-A again and it starts as a slave. Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-b in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors: [ERROR] Error during --relay-log-recovery: Could not locate rotate event from the master. 2018-01-24T19:26:47.996950Z 0 [ERROR] Failed to initialize the master info structure 2018-01-24T19:26:47.996968Z 0 [ERROR] Failed to create or recover replication info repositories. 2018-01-24T19:26:47.996975Z 0 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted. Other status info: show slave status \G; *************************** 1. row *************************** Slave_IO_State: Master_Host: 172.19.0.20 Master_User: repl_user Master_Port: 3306 Connect_Retry: 60 Master_Log_File: Read_Master_Log_Pos: 4 Relay_Log_File: relay-log-slave.000001 Relay_Log_Pos: 4 Relay_Master_Log_File: Slave_IO_Running: No Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 0 Relay_Log_Space: 0 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 0 Master_UUID: d156fa9b-ff46-11e7-99bd-000d3afd6d09 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: ccc33176-ff46-11e7-9a46-000d3af94857:1-670712, d156fa9b-ff46-11e7-99bd-000d3afd6d09:1-93029 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: mysql> show master status \G; *************************** 1. row *************************** File: mysql-bin.000006 Position: 234 Binlog_Do_DB: Binlog_Ignore_DB: Executed_Gtid_Set: ccc33176-ff46-11e7-9a46-000d3af94857:1-670712, d156fa9b-ff46-11e7-99bd-000d3afd6d09:1-93029 1 row in set (0.00 sec) mysql> select * from slave_master_info\G; *************************** 1. row *************************** Number_of_lines: 25 Master_log_name: Master_log_pos: 4 Host: 172.19.0.20 User_name: repl_user User_password: 3cAe9NuTZl3HJjPr Port: 3306 Connect_retry: 60 Enabled_ssl: 1 Ssl_ca: Ssl_capath: Ssl_cert: Ssl_cipher: Ssl_key: Ssl_verify_server_cert: 0 Heartbeat: 30 Bind: Ignored_server_ids: 0 Uuid: d156fa9b-ff46-11e7-99bd-000d3afd6d09 Retry_count: 86400 Ssl_crl: Ssl_crlpath: Enabled_auto_position: 1 Channel_name: Tls_version: 1 row in set (0.00 sec) mysql> select * from slave_relay_log_info\G; *************************** 1. row *************************** Number_of_lines: 7 Relay_log_name: ./relay-log-slave.000001 Relay_log_pos: 4 Master_log_name: Master_log_pos: 0 Sql_delay: 0 Number_of_workers: 10 Id: 1 Channel_name: 1 row in set (0.00 sec) How to repeat: Am using Innodb and replication is GTID based and autoposition enabled. Replication is semi sync and MTS is enabled. Other relevant config settings on my server are: sync_binlog=1 innodb_flush_log_at_trx_commit=1 gtid_mode=ON enforce_gtid_consistency=ON binlog_format=MIXED log_slave_updates=true slave_net_timeout=60 master_info_repository=TABLE relay_log_info_repository=TABLE sync_master_info=10000 sync_relay_log=10000 relay_log_recovery=1 slave_parallel_workers=10 slave_preserve_commit_order=1 slave_parallel_type=LOGICAL_CLOCK rpl_semi_sync_master_timeout=300000 I have a master (A)and 2 slave(B and C) replica set running semi-sync replication. My master is being loaded with simple integer INSERT operations by 100 threads doing inserts every 10 ms. * I soft kill my master mysql (A) * I promote slave -B as the new master. * I start mysql(B) again as a slave. [ i use skip-slave-start option and once mysql starts, i use START SLAVE] * in 30s, I do a restart of my server thats running mysql-A (shutdown -r) * I start mysql A again [ i use skip-slave-start option and once mysql starts, i use START SLAVE] * Now START SLAVE fails with above stated erros Suggested fix: Since I have crash safety settings enabled on slave, I expect slave to start gracefully.