MySQL Bugs: #28690: mysql-enterprise-gpl-5.0.40-linux-i686-glibc won't start on Debian Sarge

Bug #28690	mysql-enterprise-gpl-5.0.40-linux-i686-glibc won't start on Debian Sarge
Submitted:	25 May 2007 19:51	Modified:	20 Jun 2007 1:08
Reporter:	Bruce Hard	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:		OS:	Linux (Debian Sarge (3.1))
Assigned to:	Magnus Blåudd	CPU Architecture:	Any
Tags:	bfsm_2007_05_31, regression

Description:
shell> uname -a
Linux debian 2.4.27-3-386 #1 Wed Dec 6 00:38:33 UTC 2006 i686 GNU/Linux

* If I try and install mysql on a clean install of Debian, it freezes during the mysql_install_db script.
* If I try and run mysqld_safe & it appears to be running in a ps ax, but never creates the socket file.  The server is not listening on port 3306 either.

Works without error on mysql-standard-5.0.22-linux-i686-glibc.

How to repeat:
* Install fresh copy of debian sarge 3.1 for x86 (supported platform: <http://www.mysql.com/support/supportedplatforms/enterprise.html>)
* Install mysql-enterprise-gpl-5.0.40-linux-i686-glibc tar file from enterprise website.
* Try and start mysqld.

Same here for a debian 4.0 system (it's actually testing so a little newer than 4.0). 

Platform: Debian Linux 4.0 (+ testing)
Hardware: Intel dualcore (6400), with 2GB RAM and lots of diskspace.

Xen has been previously installed, but is now disabled (tls back in place).

Update to 5.0.38 failed in postinst step (hangs forever), trying to start manually results in the mysqld running (listening on port 3306, socket created and reported to by used in lsof), but no connections are possible. The mysql client just hangs forever (well I haven't waited forever, as that would be really long, but 20min should be sufficient here).

strace only shows that all but one thread are

root@einstein:/var/log/mysql# strace -p 11523 -p 11566 -p 11567 -p 11568 -p 11570 -f
Process 11523 attached - interrupt to quit
Process 11566 attached - interrupt to quit
Process 11567 attached - interrupt to quit
Process 11568 attached - interrupt to quit
Process 11570 attached - interrupt to quit
[pid 11566] rt_sigsuspend([HUP INT QUIT PIPE ALRM TERM TSTP] <unfinished ...>
[pid 11568] getppid( <unfinished ...>
[pid 11570] rt_sigsuspend([HUP INT QUIT PIPE TERM TSTP] <unfinished ...>
[pid 11568] <... getppid resumed> )     = 11566
[pid 11568] poll( <unfinished ...>
[pid 11523] waitpid(-1,  <unfinished ...>
[pid 11567] read(0,  <unfinished ...>
[pid 11568] <... poll resumed> [{fd=3, events=POLLIN}], 1, 2000) = 0
[pid 11568] getppid()                   = 11566
[pid 11568] poll([{fd=3, events=POLLIN}], 1, 2000) = 0
[pid 11568] getppid()                   = 11566
[pid 11568] poll([{fd=3, events=POLLIN}], 1, 2000) = 0
[pid 11568] getppid()                   = 11566
[pid 11568] poll([{fd=3, events=POLLIN}], 1, 2000) = 0
[pid 11568] getppid()                   = 11566
[pid 11568] poll( <unfinished ...>
Process 11523 detached
Process 11566 detached
Process 11567 detached
Process 11568 detached
Process 11570 detached

After a few seconds an connection attempt using mysql client has been performed, but the server didn't even notice.

root@einstein:~# strace mysql
...
socket(PF_FILE, SOCK_STREAM, 0)         = 3
fcntl64(3, F_SETFL, O_RDONLY)           = 0
fcntl64(3, F_GETFL)                     = 0x2 (flags O_RDWR)
connect(3, {sa_family=AF_FILE, path="/var/run/mysqld/mysqld.sock"}, 110) = 0
setsockopt(3, SOL_IP, IP_TOS, [8], 4)   = -1 EOPNOTSUPP (Operation not supported)
setsockopt(3, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO, "\2003\341\1\0\0\0\0", 8) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO, "\2003\341\1\0\0\0\0", 8) = 0
read(3,  <unfinished ...>

Full strace logs are attached as strace_mysql_*

As mysql is currently completely unusable (and even the system fails to boot if the startscript does not return on starting the server) any comments would be highly appreciated.

strace of mysql client trying to connect to server

Attachment: strace_mysql_client_5.0.38_debian.txt (text/plain), 14.32 KiB.

strace of all server threads, none responding to request

Attachment: strace_mysql_server_5.0.38_debian.txt (text/plain), 3.14 KiB.

It does not look like the mysqld is one of the processes being straced in the server log.

The process being traced is in a loop "poll(POLL_IN, 2000)" and that is normally seen in a client trying to connect - the client will first do a connect and if it has "connect_timeout" option set it will enter 'vio_poll_read' which will timeout , in this case after 2 seconds.

Process 11568 is probably some mysql client program used by mysqld_safe to detect if mysqld is up or down.

Could we maybe try to see if it's possible to connect to mysqld using TCP.

Try "mysql -P 3306 -h 127.0.0.1" to determine if it's only when using unix sockets there is a problem.

It does not look like the mysqld is one of the processes being straced in the server log.

The process being traced is in a loop "poll(POLL_IN, 2000)" and that is normally seen in a client trying to connect - the client will first do a connect and if it has "connect_timeout" option set it will enter 'vio_poll_read' which will timeout , in this case after 2 seconds.

Process 11568 is probably some mysql client program used by mysqld_safe to detect if mysqld is up or down.

Could we maybe try to see if it's possible to connect to mysqld using TCP.

Try "mysql -P 3306 -h 127.0.0.1" to determine if it's only when using unix sockets there is a problem.

It does not look like the mysqld is one of the processes being straced in the server log.

The process being traced is in a loop "poll(POLL_IN, 2000)" and that is normally seen in a client trying to connect - the client will first do a connect and if it has "connect_timeout" option set it will enter 'vio_poll_read' which will timeout , in this case after 2 seconds.

Process 11568 is probably some mysql client program used by mysqld_safe to detect if mysqld is up or down.

Could we maybe try to see if it's possible to connect to mysqld using TCP.

Try "mysql -P 3306 -h 127.0.0.1" to determine if it's only when using unix sockets there is a problem.

Please also check the servers error log.

It is the mysqld being straced. If I understand correctly the poll/getppid  lop is the manager thread which is part of the LinuxThreads implementation.

It the attachment strace_mysql_server_5.0.38_debian.txt it is the mysqld beeing traced. The lowest pid 11523 is the master process (running as user mysqld), all others seem to be the worker threads beeing started by the master.

Only one child (pid 11568) seems to be hanging now waiting forever. I can't see from the trace what fd=3 actually is, but if it's e.g. STDIN it maybe waiting for user input, without letting the user (me) know.

From what I remember using TCP did not change anything (no reaction by the server process), but I'll try again asap.

I forgot to add... 

Oddly the mysql error log is always empty (0 bytes in size).

I've enabled mysql.log and mysql-slow.log too, they just write the initial header and that's it. (I'll verify again asap).

This is what it look slike in gdb when deadlock occurs during start of signal handler. The main thread that called 'start_signal_handler' is in pthread_cond_wait, thus the mutex should be free for the signal handler to lock.  But the signal handler thread hangs waiting for the same mutex.

This is a debug build which is using safe_mutex, it should complain if the mutex /condition is not initialized or if trying to unlock an already locked mutex.

With a little different timing(achieved by running mysqld in gdb for example) the program might pass this point but will instead hang at a similar construct later on.

(gdb) info thread
  3 Thread 32770 (LWP 2836)  0x40038604 in __pthread_sigsuspend ()
   from /lib/libpthread.so.0
  2 Thread 32769 (LWP 2834)  0x401b4a5a in poll () from /lib/libc.so.6
  1 Thread 16384 (LWP 2833)  0x40038604 in __pthread_sigsuspend ()
   from /lib/libpthread.so.0
(gdb) thread apply all bt

Thread 3 (Thread 32770 (LWP 2836)):
#0  0x40038604 in __pthread_sigsuspend () from /lib/libpthread.so.0
#1  0x400383c8 in __pthread_wait_for_restart_signal ()
   from /lib/libpthread.so.0
#2  0x40039a6b in __pthread_lock () from /lib/libpthread.so.0
#3  0x40036b5d in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x083aaf83 in safe_mutex_lock (mp=0x84da280, file=0x83d681e "mysqld.cc", 
    line=2358) at thr_mutex.c:115
#5  0x081699fd in signal_hand (arg=0x0) at mysqld.cc:2358
#6  0x40035e51 in pthread_start_thread () from /lib/libpthread.so.0
#7  0x401bd8aa in clone () from /lib/libc.so.6

Thread 2 (Thread 32769 (LWP 2834)):
#0  0x401b4a5a in poll () from /lib/libc.so.6
#1  0x40035b50 in __pthread_manager () from /lib/libpthread.so.0
#2  0x401bd8aa in clone () from /lib/libc.so.6

Thread 1 (Thread 16384 (LWP 2833)):
#0  0x40038604 in __pthread_sigsuspend () from /lib/libpthread.so.0
#1  0x400383c8 in __pthread_wait_for_restart_signal ()
   from /lib/libpthread.so.0
#2  0x40034f2b in pthread_cond_wait@GLIBC_2.0 () from /lib/libpthread.so.0
#3  0x083ab182 in safe_cond_wait (cond=0x84da760, mp=0x84da280, 
    file=0x83d681e "mysqld.cc", line=2287) at thr_mutex.c:202
#4  0x081697aa in start_signal_handler () at mysqld.cc:2287
#5  0x0816b1cb in main (argc=5, argv=0xbffffd64) at mysqld.cc:3604
#0  0x4

Running under strace shows one interesting thing and that is "kill(3586, SIGRTMIN)" returns "EPERM (Operation not permitted)" Could this be a hint why the pthread manager thread fails to wake up the other thread when lock is released.

$>strace ./mysqld --skip-grant --debug --user=mysql --skip-innodb
<snip>
sched_get_priority_max(SCHED_OTHER)     = 0
sched_get_priority_min(SCHED_OTHER)     = 0
write(5, "T@16384: | info: Locking LOCK_th"..., 43) = 43
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT QUIT PIPE TERM TSTP RTMIN], 8) = 0
write(4, "`pJ\10\0\0\0\0\320\372\377\277b`\f\10\0\0\0\0\7P\10\200"..., 148) = 14
8
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT QUIT PIPE TERM TSTP RTMIN], 8) = 0
rt_sigsuspend([HUP INT QUIT PIPE TERM TSTP] <unfinished ...>
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
<... rt_sigsuspend resumed> )           = -1 EINTR (Interrupted system call)
sigreturn()                             = ? (mask now [HUP INT QUIT PIPE TERM TS
TP RTMIN])
write(5, "T@16384: | info: Waiting for CON"..., 47) = 47
kill(3586, SIGRTMIN)                    = -1 EPERM (Operation not permitted)
rt_sigprocmask(SIG_SETMASK, NULL, [HUP INT QUIT PIPE TERM TSTP RTMIN], 8) = 0
rt_sigsuspend([HUP INT QUIT PIPE TERM TSTP] <unfinished ...>

The EPERM error from kill(<pid>, SIGRTMIN) probably indicates that the manager thread fails to send the signal because it's owned by another user, in this case mysqld was started as root with the argument --user=mysql causing it to setgid and setuid. By using "ps -ef" it's possible to see that only the first process/lwp is actually owned by "mysql" the others are root.
UID        PID  PPID  C STIME TTY          TIME CMD
mysql    30176     1  1 10:51 pts/0    00:00:00 ./mysqld <args>
root     30177 30176  0 10:51 pts/0    00:00:00 ./mysqld <args>
root     30179 30177  0 10:51 pts/0    00:00:00 ./mysqld <args>

Starting mysqld as user mysql" after "su mysql" will cause all the threads to be owned by "mysql" user - and that seems to work fine.
UID        PID  PPID  C STIME TTY          TIME CMD
mysql    24843  3700  0 10:36 pts/0    00:00:00 ./mysqld <args>
mysql    24844 24843  0 10:36 pts/0    00:00:00 ./mysqld <args>
mysql    24846 24844  0 10:36 pts/0    00:00:00 ./mysqld <args>

Accoding to the last post I checked my configuration and found only the master process to be owned by mysql, all other by root.

The my.cnf seemed to have been changed to use

=== snip start === /etc/mysql/my.cnf ===
[mysqld]
user = mysql
=== snip end ===

After changing that to "user = root" everything works fine.

So the code dropping the privileges is rather awkward here as it seems the threads are started before privileges are dropped and then the master process drops it's privileges and suffers from permission problems in signal handling.

This should be fixed asap, as a normal config update can results in a complete unusable database.

For me it's working, I'll inform debian folks (create a bug there).

There is a workaround to do "su mysql" before starting the mysqld. Or if starting mysqld as root - let it run as root with "mysqld --user=root"

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28262

ChangeSet@1.2506, 2007-06-07 09:48:37+02:00, msvensson@pilot.(none) +1 -0
  Bug#28690 mysql-enterprise-gpl-5.0.40-linux-i686-glibc won't start on Debian Sarge
   - Only use the "hack for bug in NTPL" if using NTPL, by dynamically
     checking the thd_lib_detected flag

Bug #28210 was marked as a duplicate of this one.

Pushed into 5.0.44

Pushed into 5.1.20-beta

Noted in 5.0.44, 5.1.20 changelogs.

The "manager thread" of the LinuxThreads implementation was
unintentionally started before mysqld had dropped privileges (to
run as an unprivileged user). This caused signaling between threads
in mysqld to fail when the privileges were finally dropped.