Bug #89395 | Slave unable to start with err message - could not locate rotate events | ||
---|---|---|---|
Submitted: | 25 Jan 2018 8:37 | Modified: | 28 May 2018 5:17 |
Reporter: | Prasad N | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.7.18 | OS: | CentOS (Linux version 2.6.32-696.18.7.el6.x86_64 (mockbuild@c1bl.rdu2.centos.org)) |
Assigned to: | CPU Architecture: | Any |
[25 Jan 2018 8:37]
Prasad N
[25 Jan 2018 8:42]
Prasad N
Correcting and updating the description field: I could not edit directly in the original description field. <<<Now I softkill my master(lets call it mysql-A) and failover to one of the slaves (mysql-B). Now I bring up mysql-A again and it starts as a slave. Now I restart my system using shutdown -r command. When the system comes back, I try to start my mysql-A in skip-slave-start mode and mysql server comes up. Now I run the SLAVE START command but it fails with following errors:>>>
[2 Feb 2018 15:27]
MySQL Verification Team
Hi, I made a similar (hopefully same) configuration and had no issues with it. I believe it is a configuration problem but can't be 100% sure. Can you please share configuration from all nodes and full error log so we can investigate further. all best Bogdan
[5 Feb 2018 5:07]
Prasad N
mysqld logs
Attachment: mysqld.log (application/octet-stream, text), 53.73 KiB.
[5 Feb 2018 5:11]
Prasad N
mysql configuration file
Attachment: my.cnf (application/octet-stream, text), 1.57 KiB.
[5 Feb 2018 5:11]
Prasad N
Hi I have added the complete error log fine and also the my.cnf file that I am using on all nodes. Thanks!
[27 Feb 2018 13:15]
MySQL Verification Team
Hi, Looks like it has to have some traffic running while killing and switching master/slave in multiple databases for this to be reproduced but I managed to reproduce it so bug is verified. all best Bogdan
[28 Feb 2018 6:09]
Prasad N
Hi Bogdan. Thank you very much. Please let me know if you have any guidance on the fix for this. Thanks Prasad
[5 Mar 2018 9:23]
Prasad N
Hello - Is there a timeframe for the fix ? Also - could you provide steps or workaround to recover from this condition? Thanks Prasad
[28 May 2018 5:17]
Prasad N
We are regularly hitting this issue. Could you share a plan to address this issue ? At least a verified workaround should be given. Otherwise, this should be treated as S1 bug.
[8 Sep 2018 15:38]
J-F Logica Gagné
This bug was brought to my attention on Twitter: https://twitter.com/terrorobe/status/1037984288661270528 I am trying to reproduce, but so far, my efforts failed. I am using the script below on a GCP vm with 4 vCPU, if someone is able to reproduce with this or with a variation of this, please leave a comment or contribute on Twitter so I can try to provide a work-around. $ dbdeployer deploy replication 5.7.23 --gtid -c skip-slave-start -c "relay_log_recovery=on" \ -c "slave_parallel_type=LOGICAL_CLOCK" -c "slave_parallel_workers=10" -c "slave_preserve_commit_order=1" $ cd rsandbox_5_7_23 $ echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./m; \ echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | tee >(./s1) | ./s2; \ echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./m; \ echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | tee >(./s1) | ./s2; $ echo "stop slave; change master to master_auto_position = 1; start slave;" | tee >(./s1) | ./s2; \ ./m <<< "create database test_jfg; create table test_jfg.t (id int primary key AUTO_INCREMENT, x int);"; \ for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./m test_jfg & done; \ sleep 5; kill -9 $(cat master/data/mysql_sandbox*.pid); \ echo "stop slave io_thread;" | tee >(./s1) | ./s2; $ ./m -N <<< "select @@global.gtid_executed"; \ ./s1 -N <<< "select @@global.gtid_executed"; \ ./s2 -N <<< "select @@global.gtid_executed" # Here, there is a risk that the master has more transactions than the slaves. # But ^^ will not cause any problem with re-slaving below. There is data drift, but not a problem for a test. # Also, the slave might be lagging, so run again so see if they are still making progress. $ ln s1 nm; ln s2 ns; \ ./ns <<< "stop slave; change master to master_port=$(./nm -N <<< "select @@port"); start slave;"; \ ./m <<< "change master to master_host='127.0.0.1', master_port=$(./nm -N <<< "select @@port");"; \ ./m <<< "change master to master_user='rsandbox', master_password='rsandbox', master_auto_position=1;"; \ ./m <<< "start slave;"; \ echo "INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';" | ./nm; \ echo "INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';" | ./m; \ echo "SET GLOBAL rpl_semi_sync_master_enabled = 1;" | ./nm; \ echo "SET GLOBAL rpl_semi_sync_slave_enabled = 1; stop slave; start slave;" | ./m; \ ./nm test_jfg <<< "truncate t;"; \ for j in $(seq 1 100); do seq 1 1000000000 | while read i; do echo "INSERT INTO t (x) VALUE ($i);"; done | ./nm test_jfg & done; \ sleep 5; master/restart & tail -f master/data/msandbox.err
[8 Sep 2018 20:17]
MySQL Verification Team
Hi, > I am trying to reproduce, but so far, my efforts failed. I spent some days on this one trying to reproduce it. Using sandbox it was impossible for me. I used vm's and also I had to run some write traffic on the master + read traffic on the slaves in order to reproduce it. Anyhow, it's not something I can reproduce on demand, reproduced it twice in a row and verified it, but I still can't say for sure I know what triggers it.