MySQL Bugs: #39508: INSERT queries hang indefinitely on AMD64, again

Bug #39508	INSERT queries hang indefinitely on AMD64, again
Submitted:	18 Sep 2008 2:29	Modified:	6 Apr 2010 12:08
Reporter:	Eric Jensen	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: General	Severity:	S1 (Critical)
Version:	5.0.67-x86-64	OS:	Linux (Debian Linux 4.0 x86_64)
Assigned to:		CPU Architecture:	Any

Description:
I am experiencing the same symptoms described in http://bugs.mysql.com/bug.php?id=8555 except that it was closed and thought to have been an old glibc bug.  My replication thread hangs indefinitely on an insert that can't be killed, except by stopping the server with kill -9.  My application uses all MyISAM.

However, this used to be a bug that would happen every month or so on a single box in production.  This morning it happened on five separate servers in production within about a half hour timeframe.  Each of them was serving about 400 queries per second (150 of which are writes from replication) when their replication threads just hung.  They could still serve queries, but replication could not be stopped or restarted.  I did have one read slave that was not serving query load, and it was the only box to not die with the same problem.  

Although I cannot explain why this happened to every server around the same time (I can't find anything unusual about our database or general load).  I believe what has changed about my application to make this more prevalent is that I am querying mysql with higher concurrency on a larger number of connections.  I had about 500 threads connected instead of about 100 previously, and I am using code similar to http://github.com/espace/mysqlplus to query mysql asynchronously across up to four connections at once (spawned back to back) from a single rails instance.  

The only thing that stands out to me in my post-mortem of this is that the Key_write_requests rate drops off significantly about 30 minutes before replication hangs on each of my hung hosts.  I do not know what this would mean...

Here are all the other details I can come up with, along with the attached logs:

Summary: all five read slaves hung on different inserts, just like w7 did four days before 

* was running debian etch mysql 5.0.51a-3-log on master, 5.0.51a-9-log on slaves, except for w7 running 5.0.67-0.dotdeb.1-log 

* happened on all hosts with read load in different tables, did not happen in host with no read load, happened on w7 last

* all hosts with read load had delay_key_write=ALL (32-38% dirty blocks), one surviving w/o read load had delay_key_write=OFF

* nothing unusual in cacti, indexer pipeline log, rails log, syslog, or http query log that i can see, leading up to event

* only relay log thread and sql thread in processlist, relay log is indeed still replicating over

* show global status is changing more rapidly (RRN about 200 per 5s) than i can explain by just the cacti probes and i can't explain the writes at all (below about a minute):

< Handler_read_rnd_next 18319154
---
> Handler_read_rnd_next 18324278
134c134
< Handler_write 111910931
---
> Handler_write 111916034

* strace during hang shows just:

select(16, [14 15], NULL, NULL, NULL

* ran "stop slave" and it changed the server to be hung for status requests (can still connect and get processlist, and can run basic "mysqladmin status")

* both mysqld_safe and mysqld must be killed with -9 to get rid of em

How to repeat:
I cannot repeat this problem.  But, I do have the same hosts back in production with several settings changed.  Particularly, i have turned off mysqlplus-style asynchronous querying on some of them, have disabled delay_key_write on others, and have set skip-concurrent-insert on one.  I will report if it happens again on any...

Thank you for the report.

Which version of GLIBC do you use? Could you also please install on one of slaves version 5.0.67 build by MySQL build team and available from http://dev.mysql.com/downloads/mysql/5.0.html#downloads: I want to check if problem is repeatable with MySQL's binaries as well with Debian binaries.

we use the latest debian etch security update with libc6 2.3.6.ds1-13etch7

i will install your build on one box...mind if i use the intel compiler one?

Thank you for the feedback.

> i will install your build on one box...mind if i use the intel compiler one?

No, I don't mind. Intel compiler one should be fine.

We had this happen again on two boxes.  We gathered better stats from one of them:  w9, which had delay_key_write=ALL, ASYNC_QUERY_CONCURRENCY=4 and the debian 5.0.51a-9-log

about four hours later, w7 hangs on an insert too.  it had delay_key_write=OFF, ASYNC_QUERY_CONCURRENCY=0 and the debian 5.0.67-0.dotdeb.1-log

interestingly, w11 which had the intel mysql 5.0.67 build had no problem

and w12 which had the debian 5.0.51a-12~bpo40+1-log  build AND skip-concurrent-insert had no problem.

although it's not much to go on, i would tentatively say this is therefore a problem in the debian build when skip-concurrent-insert is turned on, which would be consistent with http://bugs.mysql.com/bug.php?id=8555   perhaps you guys will find something in the post mortem info i will post to corroborate this.

We had this happen again on three of our slaves last night.  All of them had concurrent inserts turned on, the ones with it off did not have the problem. Two of them had debian mysql's.  But, in an interesting twist the third one had the tarball build directly from you guys:  mysql-5.0.67-linux-x86_64-icc-glibc23.tar.gz

So apparently this is not a problem with debian builds.  But, it does indeed appear to be a problem with concurrent inserts...we have now transitioned all of our hosts except the backup one which has no read traffic to skip-concurrent-insert

oh, also interesting was that a select in a "killed" state was caught hung too this time, and we could do "show global status" on this intel build without that hanging everything like it did previously.

We just had this happen again on w11, but with skip-concurrent-insert turned on this time!  I guess that theory is out the window.  This was with the debian build of 5.0.67

digging through the stack traces we provided, it appears this could have something to do with the query cache?  we have disabled it and await the next disaster

We have been running for a few weeks now with the query cache disabled and have not run into the problem again.  Given this and that I see the query cache in the stack traces I provided, it seems like there is indeed some deadlock potential in the query cache somewhere.  It is probably worth someone more familiar with it reading through those traces.

MySQL 5.0.68 fixes a query cache deadlock with similar trace. It would be interesting to test whether the problem is present on 5.0.68.

We haven't run 5.0.68.  I leave it to you to determine whether this was the same problem, as we don't plan on re-enabling the query cache to test.

Thanks!

Do you guys use MyISAM's merge tables?

nope

We have the same problem on freebsd 7, amd64. first we though it came from the 5.1 version, we downgraded to 5.0.67 and the problem still occurs. currently the query_cache is enabled, and we're waiting for the next hang/crash. next time we'll restart with query_cache OFF.  do you know if this deadlock bug fix has been included in the 5.1.30 version ?

finaly we managed to make it stable since 24hours now. query_cache is still enabled, what we changed is: concurrency_insert = 0,  and reduced table_cache to 256.

Stéphane, it would be interesting to do those testing on 5.0.68 as it contains a fix for a query cache deadlock. If the server deadlocks, please try to get a core file or backtrace from it.

Stéphane, be careful with that.  We turned off concurrent inserts and the probability of the deadlock seemed to go down but we still ran into one...you can read through everything we tried in this bug's comments.

I see the following in my.cnf uploaded:

query_cache_size        = 384M

Please, check if the problem is repeatable with disabled query cache.

We have not encountered this again with query_cache_type = 0

Then I am wondering if this can be related to http://bugs.mysql.com/bug.php?id=43758. Please, check (SQL and/or other threads status for 'freeing items' when hang happens, for example).

After the permutations we went through, it does seem this was related to the query cache.  However, I can't say whether it is a duplicate of that other bug or not.  I did go through the post-mortem info we attached to this ticket and saw that one of them had the output of "show processlist".  The hung replication insert thread is in the "Connect" state.

Thank you for the feedback.

If you think this is related to query cache would be good if you test it with version 5.0.90 where one of bugs wer fixed or even with 5.1.44 where both bugs are fixed. Please consider this possibility and let us know about results.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".