MySQL Bugs: #13659: Slave DNS name reported incorrectly in master's show processlist

Bug #13659	Slave DNS name reported incorrectly in master's show processlist
Submitted:	30 Sep 2005 14:36	Modified:	24 Jan 2006 9:01
Reporter:	Michael DePhilliips	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	4.1.14	OS:	Linux (RHEL WS r3 2.4.21-20)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Show processlist from master displays (along with 14 other slaves) the following

++++++++++show processlist clip++++++++++++++++++++++++++
| 2345 | starmirror | db01.star.bnl.gov:41066        | NULL             | Binlog Dump | 4569 | Has sent all binlog to slave; waiting for binlog to be updated | NULL             |
             
| 2348 | starmirror | db01.star.bnl.gov:32858        | NULL             | Binlog Dump | 4567 | Has sent all binlog to slave; waiting for binlog to be updated | NULL 
+++++++++++++++++++++++++++++++++++++++++++++++++

entries 2345 and 2348 are differnent machines.

++++++++++++++check to see ip/port estabilshed+++++++++++++++
[root@robinson root]# netstat -an | grep 41066
tcp        0      0 130.199.88.103:3306         130.199.89.207:41066        ESTABLISHED 
[root@robinson root]# netstat -an | grep 32858
tcp        0      0 130.199.88.103:3306         130.199.89.206:32858        ESTABLISHED 
+++++++++++++++++++++++++++++++++++++++++++++++++

so.....

++++++++++++check ip/dns+++++++++++++++++++++++++++
[root@robinson root]# nslookup 130.199.89.207
Server:         130.199.1.1
Address:        130.199.1.1#53

207.89.199.130.in-addr.arpa     name = db02.star.bnl.gov.

[root@robinson root]# nslookup 130.199.89.206
Server:         130.199.1.1
Address:        130.199.1.1#53

206.89.199.130.in-addr.arpa     name = db01.star.bnl.gov.
+++++++++++++++++++++++++++++++++++++++++++++

entry 2345 should read db02.star.bnl.gov.

Clues....
restarting the server _sometimes_ fixes the problem, but it reverts back to the erronious report after some amount of time.

How to repeat:
The only clue on how to repeat is have two machines with IP address the differ by 1. Make them slaves and execute show processlist on master.

Suggested fix:
I wouldn't think that the numerical order of the IP address should matter at all. However, it is the only thing different between those two machines and the other slaves.  I had no problem before 4.1.14.

Thanks

Thank you for a bug report. Please, send the my.cnf content and SHOW VARIABLES results from your master server. 

Please, look at the bug reports http://bugs.mysql.com/bug.php?id=13477 and http://bugs.mysql.com/bug.php?id=11958 also. Do you have anything similar in your configuration (like --skip-name-resolve on the server)?

Do you have similar problems trying to connect from  your slave hosts with mysql client?

>>Do you have similar problems trying to connect from  your slave hosts with mysql client?

No

We are experiencing the same problem under RH 9 2.4.20-8smp #1 SMP using version MySQL-server-4.1.14-0.glibc23.i386.rpm

mysql> show processlist\G
*************************** 1. row ***************************
     Id: 184
   User: dbrep
   Host: dbrepdb3:37061
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 2. row ***************************
     Id: 185
   User: dbrep
   Host: dbrepdb9:47113
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 3. row ***************************
     Id: 186
   User: dbrep
   Host: dbrepdb6:36330
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 4. row ***************************
     Id: 187
   User: dbrep
   Host: dbrepdb3:32803
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 5. row ***************************
     Id: 188
   User: dbrep
   Host: dbrepdb5:35351
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 6. row ***************************
     Id: 189
   User: dbrep
   Host: dbrepdb0:33472
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 7. row ***************************
     Id: 190
   User: dbrep
   Host: dbrepdb13:32775
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 8. row ***************************
     Id: 191
   User: dbrep
   Host: dbrepdb11:36777
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 9. row ***************************
     Id: 192
   User: dbrep
   Host: dbrepdb4:37242
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 10. row ***************************
     Id: 193
   User: dbrep
   Host: dbrepdb12:37276
     db: NULL
Command: Binlog Dump
   Time: 14364
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 11. row ***************************
     Id: 220
   User: dbrep
   Host: dbrepdb10:36790
     db: NULL
Command: Binlog Dump
   Time: 14362
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 12. row ***************************
     Id: 221
   User: dbrep
   Host: dbrepdb8:45216
     db: NULL
Command: Binlog Dump
   Time: 14362
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 13. row ***************************
     Id: 222
   User: dbrep
   Host: dbrepdb3:37013
     db: NULL
Command: Binlog Dump
   Time: 14362
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL
*************************** 14. row ***************************
     Id: 223
   User: dbrep
   Host: dbrepdb7:50876
     db: NULL
Command: Binlog Dump
   Time: 14362
  State: Has sent all binlog to slave; waiting for binlog to be updated
   Info: NULL

*************************** 15. row ***************************
     Id: 203427
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show processlist
17 rows in set (0.00 sec)

dbrepdb3 appears 3 times, but netstat says:

netstat |grep ESTABLISHED
tcp        0      0 dbw:mysql             dbrepdb5:35358          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb9:47131          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb12:37282         ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb10:36796         ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb2:37015          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb6:36337          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb3:37067          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb1:32805          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb13:32781         ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb7:50894          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb8:45234          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb4:37248          ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb11:36798         ESTABLISHED 
tcp        0      0 dbw:mysql             dbrepdb0:33474          ESTABLISHED 

cat /etc/hosts
172.20.254.10          dbrepdb0
172.20.254.234         dbrepdb1
172.20.254.235         dbrepdb2
172.20.254.233         dbrepdb3
172.20.254.20          dbrepdb4
172.20.254.23          dbrepdb5
172.20.254.237         dbrepdb6
172.20.254.134         dbrepdb7
172.20.254.242         dbrepdb8
172.20.254.137         dbrepdb9
172.20.254.242         dbrepdb8
172.20.254.247         dbrepdb10
172.20.254.180		dbrepdb11
172.20.254.21          dbrepdb12
172.20.254.22          dbrepdb13

The clue is the same, consecutive ips..but there are others consecutive ips that works fine, although we have restart the server, it seems the problem is with the same machines.

Regards.

agreed - after a server restart temporarily fixes the problem, it returns to the same machine.

Thanks

I'm worried that this may be a bug in MySQL's authentication routines.

It's more than just the process list: the master grants privileges on its erroneous view of the slave hostname, at least in my case. Client and software here are mysql-{client,server}-4.1.14-1.FC4.1, on AMD64.

[root@celeste ~]# mysql -h babar -u repl -p<mumble>
ERROR 1045 (28000): Access denied for user 'repl'@'bianca.dur.ac.uk' (using password: YES)

[root@babar etc]# host bianca.dur.ac.uk
bianca.dur.ac.uk has address 129.234.4.218
[root@babar etc]# host 129.234.4.218
218.4.234.129.in-addr.arpa domain name pointer bianca.dur.ac.uk.
[root@babar etc]# host celeste
celeste.dur.ac.uk has address 129.234.4.250
[root@babar etc]# host 129.234.4.250
250.4.234.129.in-addr.arpa domain name pointer celeste.dur.ac.uk.

[root@celeste log]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:00:1A:1A:3E:3C
          inet addr:129.234.4.250  Bcast:129.234.255.255  Mask:255.255.0.0
          inet6 addr: fe80::200:1aff:fe1a:3e3c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3648047 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1965491 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:793207614 (756.4 MiB)  TX bytes:150580945 (143.6 MiB)
          Interrupt:169

eth0:0    Link encap:Ethernet  HWaddr 00:00:1A:1A:3E:3C
          inet addr:129.234.4.181  Bcast:129.234.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:169

eth1      Link encap:Ethernet  HWaddr 00:00:1A:1A:3E:3B
          inet addr:10.0.0.2  Bcast:10.0.0.255  Mask:255.255.255.0
          inet6 addr: fe80::200:1aff:fe1a:3e3b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:46948 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47096 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:9151453 (8.7 MiB)  TX bytes:9319626 (8.8 MiB)
          Interrupt:177

Keyword: SECURITY

Maybe you should escalate this bug report. Having a host misidentified as a different one could easily have serious security concerns, especially if a systematic way to exploit it were found.

I had increased the severity of this report.

Please, inform about the versions of glibc you use. Do you have nscd (Name Service Cache Daemon) installed? What version, if any?

# uname -a
Linux babar 2.6.13-1.1532_FC4smp #1 SMP Thu Oct 20 01:42:06 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux

# rpm -q glibc
glibc-2.3.5-10.3
glibc-2.3.5-10.3
(2*glibc -- one for 64-bit and the other for 32-bit legacy stuff)

# nscd -V
nscd (GNU libc) 2.3.5
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Thorsten Kukuk and Ulrich Drepper.

I turned off nscd for a while but the problem still exhibited itself.

Thanks.

Thanks for upgrading this.

nscd (GNU libc) 2.3.2
glibc-2.3.2-95.37

All reporters: 

Please, try to start your servers with --skip-host-cache and run them for a while. Send a note here immediately if you will see the same problem after that.

Hi,

I restarted with --skip-host-cache, after about 4 hours the problem seems to be resolved.  The problem has been temporarily fixed with a restart, but usualy shows up prior to this amount of time. I will add a post tomorrow if it shows up again over night.

Thanks,
Michael

Restarted with --skip-host-cache; problem went away.
Then restarted without --skip-host-cache; problem returned immediately.
Then restarted with --skip-host-cache: no noticed problem for 2 hours.

Please, monitor your server for more time. Reopen this report as soon as you get this problem. 

If this workaround helps you, than it is not a MySQL bug. I'll try to provide you with the explanation why...

Workaround still seems to hold. Can you explain why you think it's not a bug in MySQL? To me, it would seem that if it works okay when we turn off the host cache, the host cache is most likely to blame.

Sorry for misinforming you. 

It really looks like a bug in MySQL itself. So, while using this workaround, describe your dns setup, send the content of the /etc/host.conf, /etc/hosts , nscd.conf, nsswitch.conf, resolv.conf (if any). 

As the problem occured at various sites, I ask you all for this information. It can help me to create a repeatable test case.

Sorry for the delayed response.  
+++++++++++++++++++++++++++++++++++++++++++++
 /etc/host.conf 
order hosts,bind
+++++++++++++++++++++++++++++++++++++++
/etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
130.199.88.103          robinson.star.bnl.gov

++++++++++++++++++++++++++++++++++++++++++++
/etc/nscd.conf
#
# An example Name Service Cache config file.  This file is needed by nscd.
#
# Legal entries are:
#
#       logfile                 <file>
#       debug-level             <level>
#       threads                 <#threads to use>
#       server-user             <user to run server as instead of root>
#               server-user is ignored if nscd is started with -S parameters
#       stat-user               <user who is allowed to request statistics>
#
#       enable-cache            <service> <yes|no>
#       positive-time-to-live   <service> <time in seconds>
#       negative-time-to-live   <service> <time in seconds>
#       suggested-size          <service> <prime number>
#       check-files             <service> <yes|no>
#
# Currently supported cache names (services): passwd, group, hosts
#

#       logfile                 /var/log/nscd.log
#       threads                 6
        server-user             nscd
#       stat-user               nocpulse
        debug-level             0

        enable-cache            passwd          yes
        positive-time-to-live   passwd          600
        negative-time-to-live   passwd          20
        suggested-size          passwd          211
        check-files             passwd          yes

        enable-cache            group           yes
        positive-time-to-live   group           3600
        negative-time-to-live   group           60
        suggested-size          group           211
        check-files             group           yes

        enable-cache            hosts           yes
        positive-time-to-live   hosts           3600
        negative-time-to-live   hosts           20
        suggested-size          hosts           211
        check-files             hosts           yes

+++++++++++++++++++++++++++++++++++++++++++++++++
# /etc/nsswitch.conf
#
# An example Name Service Switch config file. This file should be
# sorted with the most-used services at the beginning.
#
# The entry '[NOTFOUND=return]' means that the search for an
# entry should stop if the search in the previous entry turned
# up nothing. Note that if the search failed due to some other reason
# (like no NIS server responding) then the search continues with the
# next entry.
#
# Legal entries are:
#
#       nisplus or nis+         Use NIS+ (NIS version 3)
#       nis or yp               Use NIS (NIS version 2), also called YP
#       dns                     Use DNS (Domain Name Service)
#       files                   Use the local files
#       db                      Use the local database (.db) files
#       compat                  Use NIS on compat mode
#       hesiod                  Use Hesiod for user lookups
#       [NOTFOUND=return]       Stop searching if not found so far
#

# To use db, put the "db" in front of "files" for entries you want to be
# looked up first in the databases
#
# Example:
#passwd:    db files nisplus nis
#shadow:    db files nisplus nis
#group:     db files nisplus nis

passwd:     files
shadow:     files
group:      files

#hosts:     db files nisplus nis dns
hosts:      files dns

# Example - obey only what nisplus tells us...
#services:   nisplus [NOTFOUND=return] files
#networks:   nisplus [NOTFOUND=return] files
#protocols:  nisplus [NOTFOUND=return] files
#rpc:        nisplus [NOTFOUND=return] files
#ethers:     nisplus [NOTFOUND=return] files
#netmasks:   nisplus [NOTFOUND=return] files     

bootparams: nisplus [NOTFOUND=return] files

ethers:     files
netmasks:   files
networks:   files
protocols:  files
rpc:        files
services:   files

netgroup:   files

publickey:  nisplus

automount:  files
aliases:    files nisplus
+++++++++++++++++++++++++
/etc/resolv.conf 
search star.bnl.gov
nameserver 130.199.1.1
nameserver 130.199.128.31

So, because --skip-host-cache is the workaroun, the problem is really in the following code (sql/hostname.cc, line 150 in latest 4.1.17-BK sources):

  /* Check first if we have name in cache */
  if (!(specialflag & SPECIAL_NO_HOST_CACHE))
  {
    VOID(pthread_mutex_lock(&hostname_cache->lock));
    if ((entry=(host_entry*) hostname_cache->search((gptr) &in->s_addr,0)))
    {
      char *name;
      if (!entry->hostname)
        name=0;                                 // Don't allow connection
      else
        name=my_strdup(entry->hostname,MYF(0));
      *errors= entry->errors;
      VOID(pthread_mutex_unlock(&hostname_cache->lock));
      DBUG_RETURN(name);
    }
    VOID(pthread_mutex_unlock(&hostname_cache->lock));
  }

It was stated in the report that bug never occured before. Please, look at the (related) bug #10931. Looks like hostname cache had not been used for a long time, until that bug was fixed in 4.1.13. 

Than part of code is executed each and every time MySQL resolves ip addresses to hosnames, if --skip-host-cache is not used. So, people insisting that the problem goes far beyond the SHOW PROCESSLIST results for slave servers, may be quite right.

Looks like there may be some kind of a race condition that is simply easier visible when replication is used. So, I came back to my initial idea, that the problem may be shown without replication at all, with several clients being sent queries at a high rate and SHOW PROCESSLIST being executed repeatedly.

I am trying to create a test case based on these ideas. What do you think about all these?

Yet another thing I want to know: is anybody of the reportes get this bug on singe CPU machine? I saw that smp in the uname -a results, but just to be sure.

It certainly seems that the most likely broken part is hostname.cc, somewhere near the ip_to_hostname function.

By the way, I'm not convinced about the '// Don't allow connection' comment.

Dear bug reporters (there are many)! Thank you for your patience. I was finally able to pinpoint and verify this weird bug. See bug #15756 for details. I'd mark this one as a duplicate, if you agree.

The bug has nothing to do with replication. It is related to hostnames chaching only, and it is easily demonstratable, really, as soon as you are "unlucky" enough to get IP-addresses from some range.

As and additional temporary workaround, please, avoid IP-addresses with 206, 207, 232, 233, 234, 235 and some other numbers in dot notation. The bug had not influenced versions before 4.1.14 because hostname cache was effectively bypassed in them because of the other bug, later fixed.

See #15756: incorrect ip address matching in ACL due to use of latin1 collation