MySQL Bugs: #118059: Stop sockets_acceptors after failover check timeout even if a mysql recover occures in the mean time

Bug #118059	Stop sockets_acceptors after failover check timeout even if a mysql recover occures in the mean time
Submitted:	25 Apr 10:07	Modified:	30 Jul 12:09
Reporter:	Garrido Mickael	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Router	Severity:	S1 (Critical)
Version:	8.0.37+	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
We faced an issue when using mysqlrouter with InnoDB cluster metadata cache.

Sometimes, dns requests can fail and a MySQL instance can be switched to quarantine until next resolv success. It's what expected.
But, when we have one primary instance, a failover wait is triggered. Even if a mysql instance is recovered in the mean time, after the wait, a shut of the sockets_acceptors is done.

1. dns outage
2. resolve fail for each mysql instances FQDN.
3. MySQL instances are switched to quarantine.
4. stop sockets_acceptors
5. Failover wait for 10s
6. dns recover
7. resolve success and mysql removed from quarantine
8. start sockets_acceptors
After some time
9. timeout (10s) of failover reached
10. stop sockets_acceptors

How to repeat:
1. Create a innodb mysql cluster
2. Bootstrap mysqlrouter
3. Simulate a dns outage thanks to an iptables rule like DROP 53/udp
4. Wait until first "resolve" has this error message: Temporary failure in name resolution
5. Immediatly, recover dns by removing iptables rule. Must be done before failover timeout of 10s.
6. sockets_acceptors should be unreachable

The only wait to recover it's to restart mysqlrouter.

Suggested fix:
Check if a mysql instance is healthy before stopping sockets_acceptors after failover timeout.

Can we have a small update about this issue please ?
Do you need more information?

Hi,

Apologies for the wait, I have just verified the report.

Hi Garrido Mickael, 

I am trying to frame a similar situation, but I am uncertain.
What other logs do you see? 
I would expect to see `Waiting for failover to happen..` - https://github.com/mysql/mysql-server/blob/6ba1fef58b043ac5e9657ded777d20619b9b2f4e/router... 

But I cannot see it on my case.

To correct myself, I can see it. Need to set log level as DEBUG.

Thanks for the update. Glad to see you were able to trigger the log.
FYI, we've encountered this issue again in the production environment.

Bug #118303 is marked as duplicate of this one

Hi, I got the same issue with multiple MySQL clusters.

Each time we need to restart the MySQL Router instances to resolve it.

This has repeatedly impacted the availability of our MySQL databases.

Could a fix be prioritized and provided soon?

We would appreciate a prompt update on this issue.

This created incident in case of dockerized deployment. If the docker daemon restarts, the DNS is unavailable for a short amount of time which often trigger this bug. This make it quite impactful.

This issue appears to have been addressed in the MySQL Router 8.0.43 release on July 22nd. Could you please confirm whether the fix mentioned in the release notes pertains to this specific issue?

https://dev.mysql.com/doc/relnotes/mysql-router/8.0/en/news-8-0-43.html

> After a DNS failure, the destination was not added to quarantine and could not > be checked for availability after the connection was restored. Errors were returned similar to the following:
> 
>         resolve(host) failed: Name or service not known