Bug #868 on linux with NPTL, mysqld hangs under high load
Submitted: 17 Jul 2003 13:44 Modified: 30 Nov 2006 12:56
Reporter: elaine forbes Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:3.23.54-log, current RH/mysql rpms OS:Linux (Redhat 9.0, Lunar linux)
Assigned to: CPU Architecture:Any

[17 Jul 2003 13:44] elaine forbes
Description:
Experineced problems with mysqld hanging, not responding to queries and unable to shutdown while doing high-load benchmark testing.

Hardware: ibm 2-way SMP netfinity 5100

case 1
On redhat 9 (which I beleive uses a back-port of the kernel 2.5 and glibc 
nptl threads implementation) the hang would occur after about an hour of 
continuous high-load activity.

The server would not respond to the redhat /etc/init.d/mysql shutdown script
and prevented server shutdown. mysqld could only be stopped with SIGKILL.

I replicated this behavior with mysql as shipped with RH9, the RPMs from 
mysql.com dated Jun7 '03 and mysql built from source.

case 2
On 'Lunar' linux after installing kernel 2.5.74, and glibc with the most recent
NPTL thread library code (0.52) I experience what seemed to be the same problem,
except that at loads on the order of 800 simultaneous clients the server would
hang in in under 20 minutes.

How to repeat:
The benchmark consisted of a webpage using php+mysql for recording cookies. It
is a simple set of queries, check for the presence of a cookie on the client, 
and create a db record if none exists. 

The benchmark was run using 'siege' which creates an arbitrary number of 
simultaneous client connections for a fixed period of time. The db was prone 
to become unstable at 500 or more simultaneous connections at which point 
it was serving out around 18,000 connections per minute.

No instability was experineced on kernel 2.4.19|20 with glibc 2.3.2/linuxthreads.
I am currently going back to test kernel 2.5.74 with glibc/linuxthreads, but I'm
pretty certain the issue lies with NPTL, tho I don't know if could an nptl 
code/build/configuration issue, or something in the nptl thread model which 
doesn't work well with mysqld.

I suppose I can attach gdb to the running mysqld and get a backtrace after it's 
hung but haven't done so yet.

Suggested fix:

None at this time.
[21 Jul 2003 5:52] Alexander Keremidarski
Please provide as much details as possible so we can repeat this problem.

Did you tried the same test with RedHat hack which is supposed to turn off NPTL?

export LD_ASSUME_KERNEL=2.2.5; mysqld_safe &
[11 Sep 2003 7:41] elaine forbes
Appologies for the delay in getting back to you on this.

I've not had the time to reboot this box to redhat, however I'm sure that your
suggested work-around of:

export LD_ASSUME_KERNEL=2.2.5; mysqld_safe &

would work, as the issue replicated more or less exactly on a 2.5 kernel with NPTL.

I would *like* to be running/testing mysql fully in an NPTL enabled environment
however thus far I've not had much success building mysql from source against
NPTL headers and libraries. Mysql(binary) does run a good bit faster on NPTL, and I assume that once it's compiled to specifically use NPTL the performance gain will be better.

I see you've marked this as 'reproduced' so unless you ask I'm not going to 
attach the php+apache+mysql configuration in which I found the problem.
[21 Mar 2004 8:45] [ name withheld ]
Seems we got a similar problem here. MySQL randomly hangs on a SMP-system (dual Xeon) with Fedora Core 1. Afaik this also features the NPTL-threads, since it's the successor of RedHat 9. The times it hangs are not reproducable however, and also occur in off-load times. Here the MySQL-version is 4.0.17.

PS: Also mysql can't cleanly be shutdown. It doesn't respond to connects or a clean shutdown. Only killing it helps :-(
[17 May 2004 23:34] Steve Meyers
Our experience seems to agree with what has been posted.  Specifically, we did not have the problem when running 4.0.17 on RH 7.3.  We upgraded to Fedora Core 1, and MySQL 4.0.18.  We started having the problem approximately every one and a half weeks.  Whenever the hang happens, after we kill it, we end up with database corruption.  Fortunately, we use replication, and have always been able to recover.  The problem has only ever happened on master or slave servers.

One interesting side note is that if you strace the right process, the system will recover.  However, we have still had database corruption when we did this.

We currently have a spare replicated server live for the express purpose of recovering from this specific failure quickly.  We would be glad to leave it running next time we experience this issue, to let someone have a look at it.

One last thing - we have experienced both under our heaviest load, and under (relatively) light load.
[24 Jun 2004 21:01] [ name withheld ]
Could this be the same problem as http://www.blackdown.org/java-linux/java-linux@java.blackdown.org/java-linux-msg00089.html ?
[14 Feb 2005 22:54] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[29 May 2006 18:06] Valeriy Kravchuk
All reporters:

Does anybody still have similar problems with 2.6.x kernels, modern versions of glibc/NPTL and latest versions of MySQL server (3.23.58, 4.0.27 or newer)?
[29 Jun 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[30 Oct 2006 17:52] jocelyn fournier
Hi,

I'm experiencing exactly the same issue on a x86-64 server on Suse 10.1 + Glibc 2.4 (NPTL).
Under high load / QPS, all the queries accumulates in the processlist with a NULL status, and only a few are stuck in update/end status. 
The problem has been reproduced with 5.0.26 and 5.1.11-beta.

Regards,
  Jocelyn
[30 Oct 2006 17:59] jocelyn fournier
Here is a show full processlist when the server is stuck :

Id        User        Host        db        Command        Time        State        Info
1        event_scheduler        localhost        NULL        Connect        28735        Suspended        NULL
409        slave        192.168.222.5:40909        NULL        Binlog Dump        25050        Has sent all binlog to slave;
waiting for binlog to be updated        NULL

[...]
LEFT OUTER JOIN connectors c ON c.id=i.connectorid WHERE c.language='de' AND
i.status<9 ORDER BY i.status DESC,i.created DESC LIMIT 100
22019        wikalsql        192.168.222.1:18035        wikal        Query        615        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=21
22364        wikalsql        192.168.222.1:18390        wikal        Execute        610        end        UPDATE thesaurus SET
name='Política - Partidos politicos - PSOE - Manuel
Marí',language='ES',description='',keywords='in_title \\"Manuel Marín
González\\"\\r\\nin_title \\"Manuel
Marín\\"',industrial='',person='',global=1,created='2006-10-30
14:31:52',createdby='Wikio',modified='2006-10-30 17:17:52',modifiedby='phermouet'
WHERE id=58324
22607        wikalsql        192.168.222.1:18741        wikal        Query        613        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=21
22635        wikalsql        192.168.222.1:18764        wikal        Query        615        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=21
22636        wikalsql        192.168.222.1:18766        wikal        Query        617        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=6
22652        wikalsql        192.168.222.64:19765        wikal        Query        617        NULL        SELECT id, lastCapture FROM
packages_totreat where status=0 ORDER BY priority ASC, dateCreated ASC LIMIT 4
22735        wikalsql        192.168.222.1:18968        wikal        Query        615        NULL        SELECT id FROM blacklist
WHERE bltype=4 AND mask='192.168.222.2'
22739        wikalsql        192.168.222.1:18974        wikal        Query        617        NULL        SELECT id FROM blacklist
WHERE bltype=4 AND mask='192.168.222.2'
22740        wikalsql        192.168.222.1:18979        wikal        Query        616        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=21
22742        wikalsql        192.168.222.1:18980        wikal        Query        614        NULL        SELECT id FROM blacklist
WHERE bltype=4 AND mask='192.168.222.2'
22829        wikalsql        192.168.222.6:11652        wikal        Query        617        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://www.crunkrockradio.com"
22831        wikalsql        192.168.222.6:5836        wikal        Query        614        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://www.lernzeit.de"
22902        wikalsql        192.168.222.6:4054        wikal        Execute        560        end        UPDATE servers_infos set
value='314' where id=6 and param='nbConnectorsCaptured(300000ms)'
22939        wikalsql        192.168.222.1:19292        wikal        Query        615        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=23
22940        wikalsql        192.168.222.1:19293        wikal        Query        617        update        INSERT INTO infos_misc
(infoid,defcateg) VALUES (7655554,4093)
23001        wikalsql        192.168.222.64:9414        wikal        Query        617        NULL        SELECT categ FROM packages
WHERE id=65890
23141        wikalsql        192.168.222.6:29052        wikal        Query        617        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://www.wfp.org"
23197        wikalsql        192.168.222.6:25226        wikal        Query        617        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://www.conservative.ca"
23205        wikalsql        192.168.222.6:18383        wikal        Query        616        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://paris-photographie.com"
23209        wikalsql        192.168.222.6:9266        wikal        Query        617        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://googlesystem.blogspot.com"
23279        wikalsql        192.168.222.1:19799        wikal        Query        615        NULL        SELECT labelid,label FROM
labels WHERE language="fr" AND groupid=21
23281        wikalsql        192.168.222.1:19802        wikal        Query        616        NULL        SELECT id FROM blacklist
WHERE bltype=4 AND mask='192.168.222.2'
23282        wikalsql        192.168.222.1:19803        wikal        Query        615        NULL        SELECT id FROM blacklist
WHERE bltype=4 AND mask='192.168.222.2'
23287        wikalsql        192.168.222.6:20136        wikal        Query        617        NULL        SELECT id FROM blacklist
WHERE bltype=1 AND mask="http://feeds.feedburner.com"
23296        wikalsql        192.168.222.64:10980        wikal        Query        616        NULL        SELECT categ FROM packages
WHERE id=41724
23298        wikalsql        192.168.222.64:20514        wikal        Query        617        NULL        SELECT categ FROM packages
WHERE id=41014
23309        wikalsql        192.168.222.6:17609        wikal        Query        617        NULL        SELECT lastcapture FROM
connectors_stats WHERE connectorid=22648
23312        wikalsql        192.168.222.6:16215        wikal        Query        
23561        root        localhost        NULL        Query        0        NULL        show full processlist

No clue about how to reproduce it, this seems to occur randomly under high load.

Thanks,
  Jocelyn
[30 Oct 2006 18:31] MySQL Verification Team
Jocelyn,

We need very much a repeatable test case for this.

Can you use sysbench, mysqlslap or similar tools in order to create one ??

We would be very gratefull if that could be done ....

Sinisa Milivojevic
[31 Oct 2006 9:27] jocelyn fournier
Hi Sinisa,

I failed to reproduce the problem with sysbench with 300 // threads running on a table with 1M lines.
I'll try to see if I can modify sysbench to run queries used by the application.
(ideally it would be great if sysbench was able to parse a mysql log file to generate random queries based on what it has read in the log).

We'll try also to replay the binary log until the failing point, but since it doesn't replay SELECT and it's not in // thread, I think it will not fail.

Thanks,
  Jocelyn
[1 Dec 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".