Bug #118059 Stop sockets_acceptors after failover check timeout even if a mysql recover occures in the mean time
Submitted: 25 Apr 10:07 Modified: 30 Jul 12:09
Reporter: Garrido Mickael Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Router Severity:S1 (Critical)
Version:8.0.37+ OS:Any
Assigned to: CPU Architecture:Any

[25 Apr 10:07] Garrido Mickael
Description:
We faced an issue when using mysqlrouter with InnoDB cluster metadata cache.

Sometimes, dns requests can fail and a MySQL instance can be switched to quarantine until next resolv success. It's what expected.
But, when we have one primary instance, a failover wait is triggered. Even if a mysql instance is recovered in the mean time, after the wait, a shut of the sockets_acceptors is done.

1. dns outage
2. resolve fail for each mysql instances FQDN.
3. MySQL instances are switched to quarantine.
4. stop sockets_acceptors
5. Failover wait for 10s
6. dns recover
7. resolve success and mysql removed from quarantine
8. start sockets_acceptors
After some time
9. timeout (10s) of failover reached
10. stop sockets_acceptors

How to repeat:
1. Create a innodb mysql cluster
2. Bootstrap mysqlrouter
3. Simulate a dns outage thanks to an iptables rule like DROP 53/udp
4. Wait until first "resolve" has this error message: Temporary failure in name resolution
5. Immediatly, recover dns by removing iptables rule. Must be done before failover timeout of 10s.
6. sockets_acceptors should be unreachable

The only wait to recover it's to restart mysqlrouter.

Suggested fix:
Check if a mysql instance is healthy before stopping sockets_acceptors after failover timeout.
[7 May 13:02] Garrido Mickael
Can we have a small update about this issue please ?
Do you need more information?
[7 May 14:12] MySQL Verification Team
Hi,

Apologies for the wait, I have just verified the report.
[22 May 14:38] Paulo Machado
Hi Garrido Mickael, 

I am trying to frame a similar situation, but I am uncertain.
What other logs do you see? 
I would expect to see `Waiting for failover to happen..` - https://github.com/mysql/mysql-server/blob/6ba1fef58b043ac5e9657ded777d20619b9b2f4e/router... 

But I cannot see it on my case.
[22 May 20:29] Paulo Machado
To correct myself, I can see it. Need to set log level as DEBUG.
[2 Jun 9:03] Garrido Mickael
Thanks for the update. Glad to see you were able to trigger the log.
FYI, we've encountered this issue again in the production environment.
[17 Jun 7:12] MySQL Verification Team
Bug #118303 is marked as duplicate of this one
[18 Jun 16:17] Brandon WELSCH
Hi, I got the same issue with multiple MySQL clusters.

Each time we need to restart the MySQL Router instances to resolve it.

This has repeatedly impacted the availability of our MySQL databases.

Could a fix be prioritized and provided soon?
[21 Jul 10:27] Garrido Mickael
We would appreciate a prompt update on this issue.
[23 Jul 12:59] Jonathan Hurter
This created incident in case of dockerized deployment. If the docker daemon restarts, the DNS is unavailable for a short amount of time which often trigger this bug. This make it quite impactful.
[30 Jul 12:09] Garrido Mickael
This issue appears to have been addressed in the MySQL Router 8.0.43 release on July 22nd. Could you please confirm whether the fix mentioned in the release notes pertains to this specific issue?

https://dev.mysql.com/doc/relnotes/mysql-router/8.0/en/news-8-0-43.html

> After a DNS failure, the destination was not added to quarantine and could not > be checked for availability after the connection was restored. Errors were returned similar to the following:
> 
>         resolve(host) failed: Name or service not known