| Bug #13659 | Slave DNS name reported incorrectly in master's show processlist | ||
|---|---|---|---|
| Submitted: | 30 Sep 2005 14:36 | Modified: | 24 Jan 2006 9:01 |
| Reporter: | Michael DePhilliips | Email Updates: | |
| Status: | Duplicate | Impact on me: | |
| Category: | MySQL Server | Severity: | S1 (Critical) |
| Version: | 4.1.14 | OS: | Linux (RHEL WS r3 2.4.21-20) |
| Assigned to: | Assigned Account | CPU Architecture: | Any |
[30 Sep 2005 14:48]
Valeriy Kravchuk
Thank you for a bug report. Please, send the my.cnf content and SHOW VARIABLES results from your master server. Please, look at the bug reports http://bugs.mysql.com/bug.php?id=13477 and http://bugs.mysql.com/bug.php?id=11958 also. Do you have anything similar in your configuration (like --skip-name-resolve on the server)? Do you have similar problems trying to connect from your slave hosts with mysql client?
[30 Sep 2005 15:20]
Michael DePhilliips
>>Do you have similar problems trying to connect from your slave hosts with mysql client? No
[4 Oct 2005 10:28]
Borja GarcĂa
We are experiencing the same problem under RH 9 2.4.20-8smp #1 SMP using version MySQL-server-4.1.14-0.glibc23.i386.rpm
mysql> show processlist\G
*************************** 1. row ***************************
Id: 184
User: dbrep
Host: dbrepdb3:37061
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 2. row ***************************
Id: 185
User: dbrep
Host: dbrepdb9:47113
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 3. row ***************************
Id: 186
User: dbrep
Host: dbrepdb6:36330
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 4. row ***************************
Id: 187
User: dbrep
Host: dbrepdb3:32803
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 5. row ***************************
Id: 188
User: dbrep
Host: dbrepdb5:35351
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 6. row ***************************
Id: 189
User: dbrep
Host: dbrepdb0:33472
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 7. row ***************************
Id: 190
User: dbrep
Host: dbrepdb13:32775
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 8. row ***************************
Id: 191
User: dbrep
Host: dbrepdb11:36777
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 9. row ***************************
Id: 192
User: dbrep
Host: dbrepdb4:37242
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 10. row ***************************
Id: 193
User: dbrep
Host: dbrepdb12:37276
db: NULL
Command: Binlog Dump
Time: 14364
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 11. row ***************************
Id: 220
User: dbrep
Host: dbrepdb10:36790
db: NULL
Command: Binlog Dump
Time: 14362
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 12. row ***************************
Id: 221
User: dbrep
Host: dbrepdb8:45216
db: NULL
Command: Binlog Dump
Time: 14362
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 13. row ***************************
Id: 222
User: dbrep
Host: dbrepdb3:37013
db: NULL
Command: Binlog Dump
Time: 14362
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 14. row ***************************
Id: 223
User: dbrep
Host: dbrepdb7:50876
db: NULL
Command: Binlog Dump
Time: 14362
State: Has sent all binlog to slave; waiting for binlog to be updated
Info: NULL
*************************** 15. row ***************************
Id: 203427
User: root
Host: localhost
db: NULL
Command: Query
Time: 0
State: NULL
Info: show processlist
17 rows in set (0.00 sec)
dbrepdb3 appears 3 times, but netstat says:
netstat |grep ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb5:35358 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb9:47131 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb12:37282 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb10:36796 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb2:37015 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb6:36337 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb3:37067 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb1:32805 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb13:32781 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb7:50894 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb8:45234 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb4:37248 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb11:36798 ESTABLISHED
tcp 0 0 dbw:mysql dbrepdb0:33474 ESTABLISHED
cat /etc/hosts
172.20.254.10 dbrepdb0
172.20.254.234 dbrepdb1
172.20.254.235 dbrepdb2
172.20.254.233 dbrepdb3
172.20.254.20 dbrepdb4
172.20.254.23 dbrepdb5
172.20.254.237 dbrepdb6
172.20.254.134 dbrepdb7
172.20.254.242 dbrepdb8
172.20.254.137 dbrepdb9
172.20.254.242 dbrepdb8
172.20.254.247 dbrepdb10
172.20.254.180 dbrepdb11
172.20.254.21 dbrepdb12
172.20.254.22 dbrepdb13
The clue is the same, consecutive ips..but there are others consecutive ips that works fine, although we have restart the server, it seems the problem is with the same machines.
Regards.
[7 Oct 2005 14:22]
Michael DePhilliips
agreed - after a server restart temporarily fixes the problem, it returns to the same machine. Thanks
[20 Oct 2005 8:30]
Andrew Stribblehill
I'm worried that this may be a bug in MySQL's authentication routines.
It's more than just the process list: the master grants privileges on its erroneous view of the slave hostname, at least in my case. Client and software here are mysql-{client,server}-4.1.14-1.FC4.1, on AMD64.
[root@celeste ~]# mysql -h babar -u repl -p<mumble>
ERROR 1045 (28000): Access denied for user 'repl'@'bianca.dur.ac.uk' (using password: YES)
[root@babar etc]# host bianca.dur.ac.uk
bianca.dur.ac.uk has address 129.234.4.218
[root@babar etc]# host 129.234.4.218
218.4.234.129.in-addr.arpa domain name pointer bianca.dur.ac.uk.
[root@babar etc]# host celeste
celeste.dur.ac.uk has address 129.234.4.250
[root@babar etc]# host 129.234.4.250
250.4.234.129.in-addr.arpa domain name pointer celeste.dur.ac.uk.
[root@celeste log]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:00:1A:1A:3E:3C
inet addr:129.234.4.250 Bcast:129.234.255.255 Mask:255.255.0.0
inet6 addr: fe80::200:1aff:fe1a:3e3c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3648047 errors:0 dropped:0 overruns:0 frame:0
TX packets:1965491 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:793207614 (756.4 MiB) TX bytes:150580945 (143.6 MiB)
Interrupt:169
eth0:0 Link encap:Ethernet HWaddr 00:00:1A:1A:3E:3C
inet addr:129.234.4.181 Bcast:129.234.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:169
eth1 Link encap:Ethernet HWaddr 00:00:1A:1A:3E:3B
inet addr:10.0.0.2 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::200:1aff:fe1a:3e3b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:46948 errors:0 dropped:0 overruns:0 frame:0
TX packets:47096 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9151453 (8.7 MiB) TX bytes:9319626 (8.8 MiB)
Interrupt:177
[1 Nov 2005 12:01]
Andrew Stribblehill
Keyword: SECURITY Maybe you should escalate this bug report. Having a host misidentified as a different one could easily have serious security concerns, especially if a systematic way to exploit it were found.
[1 Nov 2005 13:17]
Valeriy Kravchuk
I had increased the severity of this report. Please, inform about the versions of glibc you use. Do you have nscd (Name Service Cache Daemon) installed? What version, if any?
[1 Nov 2005 15:07]
Andrew Stribblehill
# uname -a Linux babar 2.6.13-1.1532_FC4smp #1 SMP Thu Oct 20 01:42:06 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux # rpm -q glibc glibc-2.3.5-10.3 glibc-2.3.5-10.3 (2*glibc -- one for 64-bit and the other for 32-bit legacy stuff) # nscd -V nscd (GNU libc) 2.3.5 Copyright (C) 2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Written by Thorsten Kukuk and Ulrich Drepper. I turned off nscd for a while but the problem still exhibited itself. Thanks.
[1 Nov 2005 15:36]
Michael DePhilliips
Thanks for upgrading this. nscd (GNU libc) 2.3.2 glibc-2.3.2-95.37
[4 Nov 2005 9:25]
Valeriy Kravchuk
All reporters: Please, try to start your servers with --skip-host-cache and run them for a while. Send a note here immediately if you will see the same problem after that.
[4 Nov 2005 20:43]
Michael DePhilliips
Hi, I restarted with --skip-host-cache, after about 4 hours the problem seems to be resolved. The problem has been temporarily fixed with a restart, but usualy shows up prior to this amount of time. I will add a post tomorrow if it shows up again over night. Thanks, Michael
[5 Nov 2005 0:04]
Andrew Stribblehill
Restarted with --skip-host-cache; problem went away. Then restarted without --skip-host-cache; problem returned immediately. Then restarted with --skip-host-cache: no noticed problem for 2 hours.
[5 Nov 2005 9:05]
Valeriy Kravchuk
Please, monitor your server for more time. Reopen this report as soon as you get this problem. If this workaround helps you, than it is not a MySQL bug. I'll try to provide you with the explanation why...
[9 Nov 2005 17:02]
Andrew Stribblehill
Workaround still seems to hold. Can you explain why you think it's not a bug in MySQL? To me, it would seem that if it works okay when we turn off the host cache, the host cache is most likely to blame.
[9 Nov 2005 17:53]
Valeriy Kravchuk
Sorry for misinforming you. It really looks like a bug in MySQL itself. So, while using this workaround, describe your dns setup, send the content of the /etc/host.conf, /etc/hosts , nscd.conf, nsswitch.conf, resolv.conf (if any). As the problem occured at various sites, I ask you all for this information. It can help me to create a repeatable test case.
[17 Nov 2005 18:21]
Michael DePhilliips
Sorry for the delayed response.
+++++++++++++++++++++++++++++++++++++++++++++
/etc/host.conf
order hosts,bind
+++++++++++++++++++++++++++++++++++++++
/etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
130.199.88.103 robinson.star.bnl.gov
++++++++++++++++++++++++++++++++++++++++++++
/etc/nscd.conf
#
# An example Name Service Cache config file. This file is needed by nscd.
#
# Legal entries are:
#
# logfile <file>
# debug-level <level>
# threads <#threads to use>
# server-user <user to run server as instead of root>
# server-user is ignored if nscd is started with -S parameters
# stat-user <user who is allowed to request statistics>
#
# enable-cache <service> <yes|no>
# positive-time-to-live <service> <time in seconds>
# negative-time-to-live <service> <time in seconds>
# suggested-size <service> <prime number>
# check-files <service> <yes|no>
#
# Currently supported cache names (services): passwd, group, hosts
#
# logfile /var/log/nscd.log
# threads 6
server-user nscd
# stat-user nocpulse
debug-level 0
enable-cache passwd yes
positive-time-to-live passwd 600
negative-time-to-live passwd 20
suggested-size passwd 211
check-files passwd yes
enable-cache group yes
positive-time-to-live group 3600
negative-time-to-live group 60
suggested-size group 211
check-files group yes
enable-cache hosts yes
positive-time-to-live hosts 3600
negative-time-to-live hosts 20
suggested-size hosts 211
check-files hosts yes
+++++++++++++++++++++++++++++++++++++++++++++++++
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Legal entries are:
#
# nisplus or nis+ Use NIS+ (NIS version 3)
# nis or yp Use NIS (NIS version 2), also called YP
# dns Use DNS (Domain Name Service)
# files Use the local files
# db Use the local database (.db) files
# compat Use NIS on compat mode
# hesiod Use Hesiod for user lookups
# [NOTFOUND=return] Stop searching if not found so far
#
# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd: db files nisplus nis
#shadow: db files nisplus nis
#group: db files nisplus nis
passwd: files
shadow: files
group: files
#hosts: db files nisplus nis dns
hosts: files dns
# Example - obey only what nisplus tells us...
#services: nisplus [NOTFOUND=return] files
#networks: nisplus [NOTFOUND=return] files
#protocols: nisplus [NOTFOUND=return] files
#rpc: nisplus [NOTFOUND=return] files
#ethers: nisplus [NOTFOUND=return] files
#netmasks: nisplus [NOTFOUND=return] files
bootparams: nisplus [NOTFOUND=return] files
ethers: files
netmasks: files
networks: files
protocols: files
rpc: files
services: files
netgroup: files
publickey: nisplus
automount: files
aliases: files nisplus
+++++++++++++++++++++++++
/etc/resolv.conf
search star.bnl.gov
nameserver 130.199.1.1
nameserver 130.199.128.31
[2 Dec 2005 19:50]
Valeriy Kravchuk
So, because --skip-host-cache is the workaroun, the problem is really in the following code (sql/hostname.cc, line 150 in latest 4.1.17-BK sources):
/* Check first if we have name in cache */
if (!(specialflag & SPECIAL_NO_HOST_CACHE))
{
VOID(pthread_mutex_lock(&hostname_cache->lock));
if ((entry=(host_entry*) hostname_cache->search((gptr) &in->s_addr,0)))
{
char *name;
if (!entry->hostname)
name=0; // Don't allow connection
else
name=my_strdup(entry->hostname,MYF(0));
*errors= entry->errors;
VOID(pthread_mutex_unlock(&hostname_cache->lock));
DBUG_RETURN(name);
}
VOID(pthread_mutex_unlock(&hostname_cache->lock));
}
It was stated in the report that bug never occured before. Please, look at the (related) bug #10931. Looks like hostname cache had not been used for a long time, until that bug was fixed in 4.1.13.
Than part of code is executed each and every time MySQL resolves ip addresses to hosnames, if --skip-host-cache is not used. So, people insisting that the problem goes far beyond the SHOW PROCESSLIST results for slave servers, may be quite right.
Looks like there may be some kind of a race condition that is simply easier visible when replication is used. So, I came back to my initial idea, that the problem may be shown without replication at all, with several clients being sent queries at a high rate and SHOW PROCESSLIST being executed repeatedly.
I am trying to create a test case based on these ideas. What do you think about all these?
Yet another thing I want to know: is anybody of the reportes get this bug on singe CPU machine? I saw that smp in the uname -a results, but just to be sure.
[6 Dec 2005 17:11]
Andrew Stribblehill
It certainly seems that the most likely broken part is hostname.cc, somewhere near the ip_to_hostname function. By the way, I'm not convinced about the '// Don't allow connection' comment.
[17 Dec 2005 18:14]
Valeriy Kravchuk
Dear bug reporters (there are many)! Thank you for your patience. I was finally able to pinpoint and verify this weird bug. See bug #15756 for details. I'd mark this one as a duplicate, if you agree. The bug has nothing to do with replication. It is related to hostnames chaching only, and it is easily demonstratable, really, as soon as you are "unlucky" enough to get IP-addresses from some range. As and additional temporary workaround, please, avoid IP-addresses with 206, 207, 232, 233, 234, 235 and some other numbers in dot notation. The bug had not influenced versions before 4.1.14 because hostname cache was effectively bypassed in them because of the other bug, later fixed.
[24 Jan 2006 9:01]
Ramil Kalimullin
See #15756: incorrect ip address matching in ACL due to use of latin1 collation

Description: Show processlist from master displays (along with 14 other slaves) the following ++++++++++show processlist clip++++++++++++++++++++++++++ | 2345 | starmirror | db01.star.bnl.gov:41066 | NULL | Binlog Dump | 4569 | Has sent all binlog to slave; waiting for binlog to be updated | NULL | | 2348 | starmirror | db01.star.bnl.gov:32858 | NULL | Binlog Dump | 4567 | Has sent all binlog to slave; waiting for binlog to be updated | NULL +++++++++++++++++++++++++++++++++++++++++++++++++ entries 2345 and 2348 are differnent machines. ++++++++++++++check to see ip/port estabilshed+++++++++++++++ [root@robinson root]# netstat -an | grep 41066 tcp 0 0 130.199.88.103:3306 130.199.89.207:41066 ESTABLISHED [root@robinson root]# netstat -an | grep 32858 tcp 0 0 130.199.88.103:3306 130.199.89.206:32858 ESTABLISHED +++++++++++++++++++++++++++++++++++++++++++++++++ so..... ++++++++++++check ip/dns+++++++++++++++++++++++++++ [root@robinson root]# nslookup 130.199.89.207 Server: 130.199.1.1 Address: 130.199.1.1#53 207.89.199.130.in-addr.arpa name = db02.star.bnl.gov. [root@robinson root]# nslookup 130.199.89.206 Server: 130.199.1.1 Address: 130.199.1.1#53 206.89.199.130.in-addr.arpa name = db01.star.bnl.gov. +++++++++++++++++++++++++++++++++++++++++++++ entry 2345 should read db02.star.bnl.gov. Clues.... restarting the server _sometimes_ fixes the problem, but it reverts back to the erronious report after some amount of time. How to repeat: The only clue on how to repeat is have two machines with IP address the differ by 1. Make them slaves and execute show processlist on master. Suggested fix: I wouldn't think that the numerical order of the IP address should matter at all. However, it is the only thing different between those two machines and the other slaves. I had no problem before 4.1.14. Thanks