Description:
In such scenario, we hit the infinite looping of 'flush privileges' between master and slave(it's also a master but read-only), which caused the app connections periodic error of "host is not allowed to connect to this MySQL server":
restart 'slave' but changed the server-id, and during the gap of master and 'slave', there exists a 'flush privileges' event in the relay log to be executed.
The root problems caused the connection issue is due to the logic defect which explained as below:
For 'flush privileges', acl_reload()would be called, and then acl_load() is called. The global variable allow_all_hosts will set to 0 under the lock, as well as the modification for acl_check_hosts.
But when client connections to server, acl_check_hosts() will be called, which logic is list below:
bool acl_check_host(const char *host, const char *ip)
1496 {
1497 if (allow_all_hosts)
1498 return 0;
1499 VOID(pthread_mutex_lock(&acl_cache->lock));
1500
1501 if ((host && hash_search(&acl_check_hosts,(uchar*) host,strlen(host))) ||
1502 (ip && hash_search(&acl_check_hosts,(uchar*) ip, strlen(ip))))
1503 {
in line 1497,we find that allow_all_hosts is used without any lock, which caused inconsistent between allow_all_hosts and acl_check_hosts.
So there is a race risk, and in some special case which like we hit, that might be a problem.
How to repeat:
NoN
Suggested fix:
check allow_all_hosts under the lock, just like this:
VOID(pthread_mutex_lock(&acl_cache->lock));
if (allow_all_hosts){
VOID(pthread_mutex_unlock(&acl_cache->lock));
return 0;
}