Bug #13062 Replication slave fails to start under OSF1 (HP Tru64 UNIX)
Submitted: 8 Sep 2005 11:55 Modified: 19 Feb 2008 8:38
Reporter: David Harper Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.1.14 OS:Any (OSF1 V5.1)
Assigned to: Andrei Elkin CPU Architecture:Any

[8 Sep 2005 11:55] David Harper
Description:
I am running MySQL 4.1.14 on HP Alpha machines under OSF1 V5.1 version 2650.
I installed MySQL from the OSF1 binary version downloaded from one of the
official mirror sites, mirror.ac.uk.

I have configured a replication master on a machine named 'mars' and a slave
on a machine named 'venus'.

The master server runs fine, but whenever I try to start the slave, I get a
report like this in the error log file:

050828 13:33:30  mysqld started
050828 13:33:30 [Warning] Can't open and lock time zone table: Table 'mysql.time_zone_leap_second' doesn't exist trying to live without them
/nfs/pathsoft/external/mysql-standard-4.1.14/bin/mysqld: ready for connections.
Version: '4.1.14-standard'  socket: '/nfs/arcturus1/mysql/etc/mysql.mars-dev.sock'  port: 14644  MySQL Community Edition - Standard (GPL)
050828 13:33:30 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log '/nfs/arcturus1/mysql/data/mars-dev/relay-bin.000001' position: 4
050828 13:33:30 [ERROR] Slave I/O thread: error connecting to master 'slave@mars:14642': Error: 'Unknown MySQL server host 'mars' (1)'  errno: 2005  retry-time: 30  retries: 86400

This suggests that the machine running the replication slave cannot perform a
hostname-to-IP address lookup to convert the master's hostname 'mars' to an IP
address.

However, there is nothing wrong with DNS or NIS at our site. I can run the mysql
client program on venus with a command line such as "mysql -h mars -P 14642 ..."
and connect successfully.

I can also run a replication slave on a Linux machine using exactly the same
configuration file, so I know that I have configured both the master and
slave servers correctly.

How to repeat:
1. Set up a replication master on any machine.
2. Set up a replication slave on any HP Alpha machine running OSF1.
3. Try to start the replication slave.

This problem is easily repeatable.

Suggested fix:
Inspection of the source code shows that the source of the problem is the
wrapper code for gethostbyname_r in the file mysys/my_gethostbyname.c, and
specifically the section within the

#elif defined(HAVE_GETHOSTBYNAME_R_RETURN_INT)
...
#elif ...

conditional compilation block.

Under OSF1, the manual page for gethostbyname_r is as follows (irrelevant text
is replaced by ellipsis "..."):

-------------------------------------------------------------------------------
NAME

  gethostbyname, gethostbyname_r - Get a network host entry by name

SYNOPSIS

  #include <netdb.h>

  struct hostent *gethostbyname(
          const char *name );

  [Tru64 UNIX]  The following function is supported in order to maintain
  backward compatibility with previous versions of the operating system.  You
  should not use it in new designs.

  int gethostbyname_r(
          const char *name,
          struct hostent *hptr,
          struct hostent_data *hdptr );
...

PARAMETERS

  name
      Specifies the official network name or alias.

  hptr
      [Tru64 UNIX]  For gethostbyname_r() only, this points to the hostent
      structure.  The netdb.h header file defines hostent structure.

  hdptr
      [Tru64 UNIX]  For gethostbyname_r() only, this is data for hosts data-
      base.  The netdb.h header file defines hostent_data structure.
... 

NOTES

  The gethostbyname() function returns a pointer to thread-specific data.
  Subsequent calls to this or a related function from the same thread
  overwrite this data.

  [Tru64 UNIX]  The gethostbyname_r() function is an obsolete reentrant ver-
  sion of the gethostbyname() function.  It is supported in order to maintain
  backward compatibility with previous versions of the operating system and
  should not be used in new designs.  Note that you must zero-fill the hdptr
  structure before its first access by the gethostbyname_r() function.

RETURN VALUES

  Upon successful completion, the gethostbyname() function returns a pointer
  to a hostent structure.  If it reaches the end of the network host name
  database, it returns a null pointer.

  [Tru64 UNIX]  Upon successful completion, the gethostbyname_r() function
  stores the hostent structure in the location pointed to by hptr, and
  returns a value of 0 (zero). Upon failure, it returns a value of -1.

ERRORS

  If the gethostbyname() or gethostbyname_r() function call fails, h_errno is
  set to one of the following the values:

 ...

  [Tru64 UNIX]  If any of the following conditions occurs, the
  gethostbyaddr_r() function sets errno to the corresponding value:

  [EINVAL]
      The name, hptr, or hdptr is invalid.
-------------------------------------------------------------------------------

The MySQL source code does not set the contents of the "struct hostent_data"
structure to zero, as required by OSF1. As a result, the call to gethostbyname_r
returns a non-zero value and sets h_errno to EINVAL. Unfortunately, the MySQL
source code interprets this to mean that the hostname lookup failed.

This problem is closely related to a bug which I reported in April 2002:

http://lists.mysql.com/bugs/11975

On that occasion, Monty investigated and correctly analysed the problem:

http://lists.mysql.com/bugs/11977

He noted that gethostbyname is thread-safe under OSF1, so it is not necessary
to use the re-entrant version gethostbyname_r, which in any case is flagged as
obsolete.

Monty's workaround should still work, but it seems to have been dropped from
the official builds at sonme point.

Unfortunately, I'm unable to build from source code myself because I lack the
HP C++ compiler, so I can't verify my diagnosis of the problem, but I'm
quite confident that it is correct.
[8 Sep 2005 12:56] David Harper
This problem is not new in 4.1.14. It has been present in all of the 4.1 releases we have used in the past year i.e. 4.1.7, 4.1.9 and 4.1.13a, as well as the latest release.

Also, I have used replication successfully with MySQL 3.23 following Monty's fix in 2002.

We quit using replication for a couple of years, but have recently returned to using it since the software project became mission-critical.

For operational reasons, we do need to get replication working under OSF1. Our Linux machine is not part of our production hardware environment. I only used it to prove to myself (and to you guys) that I hand't screwed up the replication configuration.
[17 Feb 2008 6:30] sdsfce sdsfce
http://www.linuxlords.net/forum/
[19 Feb 2008 8:38] David Harper
We no longer use DEC/Compaq/HP Alpha machines running OSF1, so this bug is no longer relevant to my oranisation.