MySQL Bugs: #83713: Slave failed to initialize relay log info after OS crash when use MTS and GTID

Bug #83713	Slave failed to initialize relay log info after OS crash when use MTS and GTID
Submitted:	7 Nov 2016 7:48	Modified:	30 May 2019 4:48
Reporter:	huajun chen	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.31	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	crash safe, GTID, MTS

Description:
When i tried to configurate a crash safe slave with MTS and GTID based replication, but after a OS crash
replication failed to be start.
    
error log:
---------------------------------
2016-10-26 21:00:23 2699 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysql-relay-bin' to avoid this problem.
2016-10-26 21:00:24 2699 [Note] Slave: MTS group recovery relay log info based on Worker-Id 1, group_relay_log_name ./mysql-relay-bin.000011, group_relay_log_pos 2017523 group_master_log_name binlog.000007, group_master_log_pos 2017363
2016-10-26 21:00:24 2699 [ERROR] Error looking for file after ./mysql-relay-bin.000012.
2016-10-26 21:00:24 2699 [ERROR] Failed to initialize the master info structure
2016-10-26 21:00:24 2699 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2016-10-26 21:00:24 2699 [Note] Event Scheduler: Loaded 0 events
2016-10-26 21:00:24 2699 [Note] mysqld: ready for connections.
Version: '5.6.31-77.0-log'  socket: '/data/mysql/mysql.sock'  port: 3306  Percona Server (GPL), Release 77.0, Revision 5c1061c
---------------------------------

And, "start slave" also failed
mysql> start slave;
ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository

my environment:
Percona Server 5.6.31
CentOS 6.3 64bit in KVM

my configuration:

    log_slave_updates             = true
    gtid_mode                     = on
    enforce_gtid_consistency      = true
    master_info_repository         = TABLE
    relay_log_info_repository      = TABLE
    relay_log_recovery             = 1
    sync_binlog                    = 1
    innodb_flush_log_at_trx_commit = 1
    slave_parallel_workers         = 2

*)the following variables remain the default settings
sync_master_info
sync_relay_log
sync_relay_log_info

According the manual, When using GTIDs and MASTER_AUTO_POSITION "sync_relay_log=1" is not necessary.
http://dev.mysql.com/doc/refman/5.6/en/replication-solutions-unexpected-slave-halt.html

How to repeat:
STEP1:
run a script to execute update sql on the master continuously

STEP2:
use "kill -9" kill the kvm process of the slave in the physical machine

STEP3:  
start mysql of the slave

Suggested fix:
The error ocuurs when mts_recovery_groups() reading the corrupted relay log files in MTS group recovery.
But with "gtid_mode=ON" and "relay_log_recovery = 1", the relay log files will be discard later, and so, at the beginning should skip the relay log read.

workaround:
The following steps can start slave successfully

    reset slave;
    start slave IO_THREAD;
    stop slave IO_THREAD;
    reset slave;
    start slave;

Hello huajun chen,

Thank you for the report.
I could not repeat the issue with 5.6.35, could you please confirm with the 5.6.35 and if you are able to reproduce then report us back with exact conf files from both master/slave, load script which was used on master and unaltered error log from slave for our investigation. If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

Thanks,
Umesh

We've hit exactly same issue with 5.7.22 - the replication just wouldn't work after the server crash. The fix by Huajun Chen saved the day, thanks!

Would it be possible to re-triage? Thanks in advance!

If you are getting the following error:

[ERROR] Slave SQL for channel 'group_replication_applier': Slave failed to initialize relay log info structure from the repository, Error_code: 1872

Here is the solution:
From the mysql console:
USE mysql;
TRUNCATE TABLE slave_relay_log_info;
Than restart mysql: service mysql restart

Hope this works for you guys too!

I ran into this same issue on 8.0.13. I can also confirm that this workaround solved the problem for me.

    reset slave;
    start slave IO_THREAD;
    stop slave IO_THREAD;
    reset slave;
    start slave;

Note: I have seen this issue a few times too recently.

I think it would be helpful to add extra logging around the slave initialisation process as I would imagine there are quite a number of possible problems that may arise while setting this up.  The current error message encompasses all of them and provides no context to the actual error that is being seen. This extra logging would probably clarify the circumstances better.

We suspect to have been affected by this, too. In our scenario, a master crashed, and on all replicas, all using GTID (AUTO_POSITION=1) and MTR, we were unable to `CHANGE MASTER TO` nor `RESET SLAVE ALL`.

This, in turn, failed what would have been an automated master failover. 

On all replicas, we would see:

    <host.name> mysqld[18373]: Error looking for file after ./mysqld-relay-bin.004157.

and our clients would see:

    Error 1201: Could not initialize master info structure; more error messages can     be found in the MySQL error log

We kicked some automated response and then resorted to manual operations. At this time we do not know the shortest path out of this situation. We have not tried yet any of the suggested workarounds. Our colleague Gillian Gunson triaged our scenario with this issue.

We will potentially try to reliably reproduce the error and automate the workaround until this bug is resolved. We'll report back with any conclusions.

Going to request this is bug report is opened again.

Seen recently on 5.7.25-1.el7 rpm (CentOS 7)

root@myhost [(none)]> start slave io_thread;  
ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository    
  
And we saw this in the log:  
  
2019-05-23T10:00:19.306164Z 288564 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872 

So given there's no reporting of the actual error that happened here (only that an error occurred) can we at least improve logging to provide more information?
Then we can at least determine what the problem is and work out how to avoid it.

I understand Oracle can not reproduce this and the failure pattern may be specific but it does not mean that the issue is not there. Derek Perkins has reported this
on 8.0 so I'm tempted to expect the issue is on the latest version of MySQL. The original report is on 5.6. I'm confirming this causes issues on 5.7 so it seems to
be a general problem and would be good to fix once we can identify the cause.

If there's a better way of identifying the cause please let us know what that would be.

I can not change the affected version as I did not open this bug report but it would be good to add the 5.7 and 8.0 versions mentioned here as being known to be affected.

Thanks.

I'll upload my /etc/my.cnf file in case one of the configuration settings may be causing this behaviour.

I don't think this should be re-opened. I think a new bug to improve the logging is what's needed. It should more clearly indicate why the initialization failed.

Hello Simon,

My apologies for the delayed response.
Agree with Daniel, moreover I found a closest one Bug #93397 which is already verified. If you prefer then you can add your suggestions there for improving the logging. Thank you!

regards,
Umesh

ok, let's do it that way, though the error message reported here is still the same, and JF's bug report is 2 years after this one.

I'll add suggestions to https://bugs.mysql.com/93397.

Maybe related: Bug#92882.

Thank you, Simon.

regards,
Umesh

Hi Umesh,

I am not sure this bug is a duplicate of Bug#93397.  I think is it more related to Bug#92882.

Bug#93397 does not involve a crash and does not involves MTS.

However, Bug#92882 involves both a crash and MTS.

I agree the error messages are very similar though, so I guess all of this is related in some way that I do not understand.

Cheers, JFG