MySQL Bugs: #8555: INSERT queries hang indefinitely on AMD64

Bug #8555	INSERT queries hang indefinitely on AMD64
Submitted:	16 Feb 2005 19:29	Modified:	19 Jul 2005 0:16
Reporter:	Michel Buijsman	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	4.1.10-x86-64	OS:	Linux (Debian Linux 3.1 pure64)
Assigned to:	Sergei Golubchik	CPU Architecture:	Any

Description:
First of all, the symptoms are extremely similar to the ones Don MacAskill reported in bug 3483.

Inserts (and only inserts) hang on various tables, not all the time but often enough to be unusable as a database server. The hanging inserts can't be killed within mysql, killing the process with mysqladmin doesn't work, the processlist then shows the command as "Killed". Shutting down mysqld doesn't work either, I have to kill -9 it from the operating system.

I've tried 4.0.23, 4.1.9, 4.1.10 (all as x86-64 binary, also some of them compiled) on two identical dual opterons with 8 gig ram running Debian 3.1 pure64 gcc3.4. Tried kernels 2.6.11-rc1-mm1 and 2.6.11-rc2. The servers are/were running as a replication master/slave pair.

Both the master and the slave have this problem.

How to repeat:
I haven't been able to pin it down completely.

On a standalone or replication master it was difficult to reproduce in a test case. Using real data I got it to hang by throwing lots of queries at it, I have a quick test script in perl that basically forks 20 processes and starts hammering the server with a random query out of 4 selects and an insert, got the load up pretty good but it keeps on chugging through but the select queries take a long time to run (>30 seconds). Then after I run a repair table, the first insert hangs and the rest of them are locked.

On a replication slave off of the live master it's a matter of seconds before the replication thread hangs. It's literally a matter of start mysqld, do a processlist, and watch it hanging. That's on an otherwise totally unused server.

Dropping all indexes solved the problem. All the affected tables (that I know of) have multiple indexes.

Suggested fix:
Unknown.

A workaround is to add "skip-concurrent-insert", that makes the problem go away. As does dropping all indexes from the table, but both also cripple performance so that's not really an option.

Hi

Can you please attatch your perl test script to this issue, as well as the show-create statements of your tables so that we can try and reproduce this problem ?

Thanks

HI!

We truly require a repeatable test case in order to be able to verify and / or fix this bug.

mysqldump of 2 tables incl data

Attachment: dump.sql.gz (application/x-gzip, text), 161.62 KiB.

Test script

Attachment: stresstest.pl (text/plain), 1.08 KiB.

Attached my test script, and a gzipped mysqldump of the test case including some data.

Sometimes just running the test script will hang mysql all by itself, but most of the time it needs a few runs of "repair table testing" before it hangs.

Hi,

Thank you for the report, but I can't reproduce it.
Can you upload your my.cnf file?

This is it. I've been toying with a few things, but this is the one I used with the test case I uploaded here.

[mysqld]
server-id         = 10
#master-host       = x.x.x.x
#master-port       = x
#master-user       = x
#master-password   = x
#master-connect-retry = 30
#report-host       = x.x.x.x
#replicate-do-db   = x

relay-log   = /var/lib/mysql/relay.log

log-error=/var/log/mysql.error
user=mysql
socket=/tmp/mysql.sock

skip-locking
set-variable    = key_buffer=2G
set-variable    = table_cache=1024
set-variable    = sort_buffer=16M
set-variable    = read_buffer=16M
set-variable    = max_allowed_packet=10M
set-variable    = thread_cache=64
set-variable    = thread_stack=512K
set-variable    = tmp_table_size=16M
set-variable    = interactive_timeout=600
set-variable    = wait_timeout=600

set-variable    = max_connections=1024

set-variable   = query_cache_type=2
set-variable   = query_cache_size=100M

set-variable    = join_buffer_size=8M
set-variable   = thread_concurrency=4
set-variable    = myisam_sort_buffer_size=64M

The two systems are identical. Here are the specs:

Tyan Transport GX28
2x AMD Opteron 250
4x Seagate Cheetah 15K.3 36GB
8x 1GB ECC

Kernel: 2.6.11-rc2 64bit only
OS: Debian pure64 gcc-3.4 (up to date)

/boot is raid1, ext2
/ is raid10, reiserfs

What hw/sw specs do you guys use when you try to reproduce?

Next thing we will try is run all this on one disk ext2 (no raid, no reiserfs),
and if that doesn't work, we'll try kernel 2.4 if possible.

Has any progress been made on this? I have a dual 2.4GHz Opteron sitting idle until this comes online. Is there a fix available to supported or licensed users? Has anyone found a possible work around? 

Does running it within a debootstrap in i386 mode fix this problem (just a stab, it's next on my list of things to try). Please email me at joe at joestump dot net if anyone has made progress on this.

We're now running in 32 bit mode on the amd64, and that seems to work fine. One of the servers is now running as a replication slave with live data and it's actually replicating which is more than I got when running a 64 bit OS.

I've done some head-to-head speed tests (2 identical servers, one with a 64 bit install and one with a regular 32 bit debian, both running mysql 4.1.10 with identical configs) and the 32 bit version is about 10% slower.

Hi,

I was still unable to reproduce these results in our AMD64 platform using 4.1.11 from bk.  Is your raid software based or hardware based ?

Hello,

> I've tried 4.0.23, 4.1.9, 4.1.10 (all as x86-64 binary, also some of them
> compiled) on two identical dual opterons with 8 gig ram running Debian 3.1
> pure64 gcc3.4. Tried kernels 2.6.11-rc1-mm1 and 2.6.11-rc2. The servers are/were
> running as a replication master/slave pair.

I have a couple of questions:

1. Are you aware the gcc-3.4/gcc-4.0 based debian.amd64 port is highly experimental and __not recommended__ for any kind of production system, considering it is compiled with an unreleased compiler version known to produce buggy code?

2. Why did you not send a mail to the debian-amd64@lists.debian.org about this topic?

3. Which are the corresponding bug numbers in the debian bug tracking system against the concerned mysql versions? I did not find any.

4. Can you give pure64 (the gcc-3.3 based one) a try, and check if the bug is present there too?

Kind regards
Frederik Schueler

> 1. Are you aware the gcc-3.4/gcc-4.0 based debian.amd64 port is highly
> experimental and __not recommended__ for any kind of production system,
> considering it is compiled with an unreleased compiler version known to produce 
> buggy code?

I wasn't, no.

> 2. Why did you not send a mail to the debian-amd64@lists.debian.org about this
> topic?

Since the OS itself was rock solid and it was mysql that was acting up in strange ways, the thought didn't really cross my mind that it might be an operating system bug.

> 3. Which are the corresponding bug numbers in the debian bug tracking system
> against the concerned mysql versions? I did not find any.

There aren't any. I'm not using a debian package for mysql.

> 4. Can you give pure64 (the gcc-3.3 based one) a try, and check if the bug is
> present there too?

I'd love to, but the servers in question were customer machines and I was already past the deadline on that project. So they've gone into production, running a regular 32 bit debian install...

I seem to have this bug too :|

The machine is a DELL 2850: Dual Xeon EM64T w/ hyperthreading, 4Go RAM, hw raid (megaraid 2)
Linux version: 2.6.11.7 (compiled with gcc 3.4)
Debian version: Sarge pure64 (gcc 3.3)

Tested with mysql 4.0.24 AMD64, both gcc and icc.

I encountered the bug while attempting to set this host as a slave, so I have a full binary DB copy with which I can reproduce the bug.

The bug strikes when I start the replication thread, and the server is otherwise idle: there is no concurent access to any table, only the slave thread is running.

The bug seems to strike on two tables from what I could see, ie. sometimes it would lock on a first table, sometimes past that one and lock on another. I checked and repaire the tables with various myisamchk options.

'ps' makes it look like the mysql processes aren't doing anything (apart from the relay thread picking up binlog from the master from time to time), ie processes are Sleeping.

I'm available for more testing / reproducing as required.

Stefan

hello,

I had the opportunity to run a long-time stress test of mysql-4.1.11 as packaged in debian/sid the past weekend: 10 clients (4G dual xeons with debian-amd64 sarge) running the sql-bench suite against a quad opteron with 16G ram (debian-pure64 sarge too). The suite run approx 3 days (friday morning to monday morning), each run took between 200 and 215 minutes per client node, and there where no failed tests. The database run a total of 497M queries.

We did some testing with replication too, having the quad opteron replicate to a dual opteron and a dual xeon while being treated by 10 concurrent sql-bench runs, without experiencing problems here. A long-time run of this test is still to be done.

Kind regards
Frederik Schueler

Just to let you know that it also happens with another distrib! Precisely on a DELL PowerEdge 8500 (dual xeon64, no raid just ATA) with Mandrake 10.2 for x86_64 and MySQL-4.1.11.x86_64.
All requests (including Select, Create temporary table, ...) were hanging, as if all databases were "locked", and so on until I kill-9 mysqld (since I failed with mysqladmin)... 
BTW, we did *not* encounter the problem with Mandrake 10.1 for x86-64 and MySQL-4.0.20.x86-64, but the tables were rebuild from latin1 to utf-8 for 4.1.11
We have various Insert queries from JDBC and PHP, it happens randomly but at least 1/week and up to 2/day!
I will try the i586 version in a few days hopping it is better...

I have been able to verify this on a dual cpu debian system (at least i think its the same issue), on this system the easiest way to reproduce it is to run "make test",
this will hang after the "archive" test as the "auto_increment" test never returns

i was able to track it down to myisam/mi_write.c:107 (4.1.12 source):

  rw_wrlock(&share->key_root_lock[i]);

this call never returned

this machine had NTPL as thread library only, LinuxThreads were not installed

replacing the mysqld binary build on the machine itself with one build on another
macheine and statically linked against LinuxThreads made it pass the 
auto_increment test just fine

so this is a mysqld<->NTPL problem that shows up on AMD64 (maybe on multi-CPU
systems only) with NTPL. given that the mysql codebase passes the auto_increment
test on a lot of systems with different pthread implementations i think its NTPL that is
to blame, not mysql, but i'm not 100% sure about it

I've replaced Mysql x86_64 version by the i586 version and there is no hanging anymore (working fine since 1 month), so if it is a NTPL problem then it only happens with the 64bit mysql binary...

This looks like the glibc bug that JimW blogged about and that I mentioned in Bug#7254. 

As for other distros, my bet is they too have this glibc bug.

This is a bug with glibc for x86_64 on Debian, Ubuntu, and almost certainly the other distributions.

You can track the Ubuntu bug here: https://bugzilla.ubuntu.com/show_bug.cgi?id=11730

The Debian bug is here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=314408

You can verify the bug using the MySQL test suite, you don't have to rely on it to crop up in your own application. The auto_increment test in the test suite is the first one that will hang due to this issue.

Well, it might be glibc problem, but it seems to me, that it also has something to do with the database.

The problem is connected to myisam tables - no problems occur with innodb tables.
However the locks on myisam INSERTs and UPDATEs only occur when the MYI file is corrupted.

I had no problem with 4.1.10 or 4.1.11 (don't remember that exactly); the trouble started with an update to mysql 4.1.15 on Debian Linux AMD64 (Linux xxx 2.6.15-1-em64t-p4-smp #2 SMP Tue Mar 7 21:06:43 UTC 2006 x86_64 GNU/Linux).
I'm not sure whether these myisam tables were broken before I upgraded, but afterwards I fixed them nearly twice a day for not running into LOCKING troubles.

Any ideas?

Thank you,
Maurice.

I have to agree with Maurice, there is also something mysql specific in this issue.

We are currently running with version 4.1.11 from the debian package system (64bit bin) and have no problems with that version, but when we try to use the newste version 4.1.22 from mysql.com we have those insert that just hangs, 

We are running with a replication setup aswell, since we are running production on the system, its hard to get the time needed for testing.

Although my next shot will be to upgrade glibc on both machines.
maybe try the 32bit version of the mysql server.

Morten

We're seeing this issue as well, and we've narrowed it down somewhat. We're running on a 8 x Opteron system, but we're running in 32 bit mode (inside of a vm). Changing some search code from "INSTR" to "MATCH AGAINST" triggers hangs on INSERT, SELECT, etc. Like everyone else, this is something that we cannot reproduce in testing, but it exhibits itself in production within 12-24 hours. Once this occurs all new and open db handles become unresponsive. The only solution is to kill -9 all the mysqld processes. This happened both with 4.0.21 and 5.0.27. We're using myisam tables.  I haven't tried disabling concurrent inserts yet. Is there any update on this issue, or any other bug I should be following?

While we used Debian 3.1 and MySQL/4 from the backports, the bug occured. Upgrading libc6 to a newer version from testing fixed the problem (only update libc6 not the rest of the system!). For Debian 4.0, the problem has gone completely.

I think that this bug of course has to do something "with the database" - the newer MySQL versions probably make use of libc functions which cause the bug to happen while the older ones don't.