Bug #77581 Collect DNS timing information into Performance_Schema
Submitted: 1 Jul 2015 12:29 Modified: 20 Nov 2015 10:04
Reporter: Simon Mudd (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Performance Schema Severity:S4 (Feature request)
Version:5.6.25 OS:Any
Assigned to: Marc ALFF CPU Architecture:Any

[1 Jul 2015 12:29] Simon Mudd
Description:
A recent connectivity issue we have connecting slaves to a MySQL server make it hard to diagnose the cause which may be related to the default skip_name_resolve = 0 setting (which is ok).

With this setting MySQL does a reverse name lookup on a client's ip address to get the hostname if the hostname is not already in the host_cache.

However, given the workings of DNS the timing that these requests may take may not be consistent and may depend on the ip concerned. This may affect the connect time as seen by the MySQL client.

Additionally a tcp connect may be established but the dns lookup may take a long time and if you configure "mysql connects" to be time limited then given the final connect is not confirmed to the client it won't be clear if the cause is at the tcp or mysql level.

How to repeat:
look and see that there are no metrics on this information.

Suggested fix:
Add
I'd like to see some sort of metrics / counters which indicate
* the number of dns requests made
* the latency of these requests
* any different "error/good states" that might make sense
* some dns requests may return more than one type of information. If this is visible to the libraries used by MySQL then some information on what's returned (if you use the system library you may not see anything, I've not checked the code)

Additionally aggregating this information per host, or per mysql username may make a lot of sense, providing total times, counts  and also maximum times.

DNS timeouts are generally of the order of 5 seconds, and the are 2 further requests made by most standard libraries, so latency > 5s usually indicates a retry which implies a failed request (did not reach server) or failed response (did not get back to client) and this is likely to be very much larger than the normal LAN rates of response of probably 1ms or less.