MySQL Bugs: #10472: InnoDB deadlock in srv0srv.c causing server crash and restart

Bug #10472	InnoDB deadlock in srv0srv.c causing server crash and restart
Submitted:	9 May 2005 12:13	Modified:	13 Feb 2007 16:54
Reporter:	N.C. Rozemeijer	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	4.1.9-max-log	OS:	Linux (Red Hat Linux release 9, Windows XP SP2)
Assigned to:	Heikki Tuuri	CPU Architecture:	Any

Description:
Every now and then a dead-lock occurs in the InnoDB engine, causing a server crash and restart. This problem does not occur on specific queries but has occurred with different (typically very simple, straightforward) queries.

InnoDB first starts complaining about long semaphore waits:
/home/swmaint/mysql4.1/bin/mysqld: ready for connections.
Version: '4.1.9-max-log'  socket: '/home/tmcf/test/data/RDBMS/mysql.sock'  port: 3400  Official MySQL-max binary
InnoDB: Warning: a long semaphore wait:
--Thread 1326898480 has waited at ../include/trx0sys.ic line 101 for 242.00 seconds the semaphore:
X-lock on RW-latch at 0x41719728 created in file buf0buf.c line 469
a writer (thread id 1326898480) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/trx0sys.ic line 101
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Warning: a long semaphore wait:
--Thread 1326898480 has waited at ../include/trx0sys.ic line 101 for 274.00 seconds the semaphore:
X-lock on RW-latch at 0x41719728 created in file buf0buf.c line 469
a writer (thread id 1326898480) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/trx0sys.ic line 101
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Warning: a long semaphore wait:
--Thread 1326898480 has waited at ../include/trx0sys.ic line 101 for 306.00 seconds the semaphore:
X-lock on RW-latch at 0x41719728 created in file buf0buf.c line 469
a writer (thread id 1326898480) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/trx0sys.ic line 101
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream
....

This goes on for some time, until finally it crashes:

....
InnoDB: Warning: a long semaphore wait:
--Thread 1326898480 has waited at ../include/trx0sys.ic line 101 for 787.00 seconds the semaphore:
X-lock on RW-latch at 0x41719728 created in file buf0buf.c line 469
a writer (thread id 1326898480) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/trx0sys.ic line 101
InnoDB: Warning: a long semaphore wait:
--Thread 1104305456 has waited at ../include/trx0sys.ic line 101 for 708.00 seconds the semaphore:
X-lock on RW-latch at 0x41719728 created in file buf0buf.c line 469
a writer (thread id 1326898480) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/trx0sys.ic line 101
InnoDB: Warning: a long semaphore wait:
--Thread 1326697776 has waited at srv0srv.c line 1732 for 707.00 seconds the semaphore:
Mutex at 0x40f18168 created file srv0srv.c line 798, lock var 1
waiters flag 1
InnoDB: Warning: a long semaphore wait:
--Thread 1345531184 has waited at trx0trx.c line 175 for 460.00 seconds the semaphore:
Mutex at 0x40f18168 created file srv0srv.c line 798, lock var 1
waiters flag 1
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
050509  8:54:32InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
InnoDB: about forcing recovery.
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=1048576
read_buffer_size=2093056
max_used_connections=6
max_connections=100
threads_connected=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 410223 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=(nil)
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x4fb005e4, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x8135ed3
0x4004e7c8
(nil)
0x420de867
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Number of processes running now: 0
050509 08:54:32  mysqld restarted
050509  8:54:32  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
050509  8:54:33  InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 0 1628613796.
InnoDB: Doing recovery: scanned up to log sequence number 0 1628613830
InnoDB: Last MySQL binlog file position 0 4, file name ./omidb3-bin.000008
050509  8:54:33  InnoDB: Flushing modified pages from the buffer pool...
050509  8:54:33  InnoDB: Started; log sequence number 0 1628613830
/home/swmaint/mysql4.1/bin/mysqld: ready for connections.
Version: '4.1.9-max-log'  socket: '/home/tmcf/test/data/RDBMS/mysql.sock'  port: 3400  Official MySQL-max binary
....

MySQL is automatically restarted after which it can be used normally, but during the deadlock all queries block (including queries like "show innodb status"; mysql only queries like "show processlist" still work).

We have seen similar problems on 3.23.57 and were hoping that the upgrade to 4.1.9 would help, but this does not seem the case.

What is interesting is that by doing a grep on failure in the log file, the crash is always occuring on the same thread ID:
[production-tmcf@omidb3:~/test/data/RDBMS/data]$ grep "failure in thread" omidb3.omi.err
050322  9:39:17InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
050329 22:46:04InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
050422  6:36:29InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
050423  5:13:40InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
050502 16:49:11InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
050507 22:13:36InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873
050509  8:54:32InnoDB: Assertion failure in thread 1336937776 in file srv0srv.c line 1873

We are running 2 MySQL instances on this machine. Both instances are affected by this problem. The other instance also crashes always in the same thread, but this is a different thread than the first instance:
[production-tmcf@omidb3:~/production/data/RDBMS/data]$ grep "failure in thread" omidb3.omi.err
050421 11:05:13InnoDB: Assertion failure in thread 1473735984 in file srv0srv.c line 1873
050503 13:20:30InnoDB: Assertion failure in thread 1473735984 in file srv0srv.c line 1873

Both instances are identical in terms of data model, mysql version etc. Only differences are in the contents of the records and there are some differences in the my.cnf related to resource (memory) usage.
 
Supporting info:
Hardware: Dell PowerEdge 2650 Dual Xeon 2GHz w/ HT, 2 GB Memory
Kernel: 2.4.20-31.9smp
MySQL is running in Master / Single Slave configuration.

How to repeat:
So far have not been able to isolate this bug.

Not enough information was provided for us to be able
to handle this bug. Please re-read the instructions at
http://bugs.mysql.com/how-to-report.php

If you can provide more information, feel free to add it
to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

Additional info:

The crash in the InnoDB master thread is intentional. If someone has been waiting for a lock for over 10 minutes, something is wrong, and the situation is corrected by restarting the server.

You should find the cause of this bug. Is the server heavily loaded during the crash? Are several queries trying to update the same rows (perhaps via cascaded foreign key operations)? CHECK TABLE is known to cause this, but there the timeout has been increased by 2 hours.

I have seen this problem several times and in each of these cases I could not find queries that would lock the same table or the same row.
Most cases there is a SELECT running on one table and an INSERT on another table that is not related in any way (not even through cascaded foreign keys).

Even if two queries would lock the same table, I would expect the dead-lock detection in InnoDB to rollback one of the queries and return a timeout after waiting for innodb_lock_wait_timeout. Or am I wrong there?

I have also not been able to trace the issue back to heavy system loads. Most occurrences that I have witnessed, there was hardly any load on the system.

What puzzles me most is that the crash always happens in the same threadID. I'm not sure how the threadID is determined (i.e. is this assigned by the O/S or thread library or is it an internal ID that InnoDB or MySQL administrates?). This ThreadID seems to be the constant factor in case of this problem, not the queries that are running at the time.

Yes, it's always the master thread that crashes: it is an intentional crash when the master thread notices that a lock wait has lasted for over ten minutes. It is not necessarily caused by a row lock or a table lock. The lock waiting code is also used for some InnoDB internal mutexes related to memory-based data structures, such as the data dictionary.

The bug could be caused by InnoDB forgetting to release a mutex. Database related locks should be automatically released at transaction commit or rollback. Have you perhaps forgot a transaction open for a long time? Can you send us the full lock monitor output? Also, could you enable innodb_status_file and send a copy of the file when it crashes next time? Could you perhaps also enable query logging?

* Have attached the full (unedited) error log with monitor info.
* Have restarted the server with innodb_status_file enabled; will send the status file next time the server crashes.
* I'm reluctant to start with full query logging since the server is running at about 20 to 30 queries per second average and I don't think I have sufficient disk space to log all these queries for several days.
* We are running almost all transactions with autocommit=1. Actually we never set autocommit=0. There are only a very few transactions of a couple of statements and for these we always use begin...commit or begin...rollback. I therefore think that it is unlikely that there are (unintentionally) very long and / or large transactions. I have never noticed anything like this in our logs. Also, most processes that are accessing the database are short-life, i.e. couple of seconds to a few minutes. We are also running a couple of daemons, but these are 'polling' and connect and disconnect to the server for each poll cycle. Having a transaction stuck there (i.e. missing a commit or rollback for a long time) seems unlikely too.

Contents of the innodb_status logfile when a crash occurs:

=====================================
050523  7:46:21 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 16 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 105906, signal count 104408
Mutex spin waits 41474103, rounds 28658527, OS waits 38158
RW-shared spins 57613, OS waits 15404; RW-excl spins 30984, OS waits 2031
------------
TRANSACTIONS
------------
Trx id counter 0 25018760
Purge done for trx's n:o < 0 25018380 undo n:o < 0 0
History list length 16
Total number of lock structs in row lock hash table 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 25018759, not started, process no 19040, OS thread id 1105022256
MySQL thread id 1568809, query id 16482525 localhost tmcfdb
---TRANSACTION 0 25018590, not started, process no 19040, OS thread id 1104821552
MySQL thread id 1568807, query id 16481827 localhost tmcfdb
---TRANSACTION 0 25018360, not started, process no 19040, OS thread id 1474338096
MySQL thread id 1568804, query id 16480966 localhost tmcfdb
---TRANSACTION 0 25018176, not started, process no 19040, OS thread id 1106705712
MySQL thread id 1568800, query id 16480211 localhost tmcfdb
---TRANSACTION 0 25018612, not started, process no 19040, OS thread id 1106906416
MySQL thread id 1568415, query id 16482043 localhost tmcfdb
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread) ev set
Pending normal aio reads: 0, aio writes: 17,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
11197 OS file reads, 2549614 OS file writes, 112551 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 1.19 writes/s, 0.19 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf for space 0: size 1, free list len 0, seg size 2, is empty
Ibuf for space 0: size 1, free list len 0, seg size 2,
567 inserts, 568 merged recs, 408 merges
Hash table size 1106407, used cells 303047, node heap has 348 buffer(s)
12322.23 hash searches/s, 374.91 non-hash searches/s
---
LOG
---
Log sequence number 0 1734031831
Log flushed up to   0 1734031831
Last checkpoint at  0 1734029649
0 pending log writes, 0 pending chkp writes
2075905 log i/o's done, 1.12 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 303985821; in additional pool allocated 2292608
Buffer pool size   16384
Free buffers       976
Database pages     15060
Modified db pages  17
Pending reads 0
Pending writes: LRU 0, flush list 17, single page 0
Pages read 12344, created 2716, written 596445
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
Main thread process no. 19040, id 1487379760, state: flushing buffer pool pages
Number of rows inserted 380029, updated 1893624, deleted 0, read 1682634038
1.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 11368.79 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

Hi!

Looks like file reads or writes end up in a pending state.

This might be memory corruption, bad hardware, an OS bug, or an unknown InnoDB bug.

I am putting this bug report to 'Can't repeat' state until we get more similar reports.

Regards,

Heikki

FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread) ev set
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 57, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
983 OS file reads, 23 OS file writes, 19 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 1, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
3529 OS file reads, 757225 OS file writes, 29313 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

Pending normal aio reads: 1, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
3591 OS file reads, 6750 OS file writes, 23 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

I am seeing this same bug using MySQL 4.1.12 on FreeBSD 5.4 with libthr.  I believe libpthread manifests the same bug but with much more annoying symptoms (140k context switches/sec as it continually tries to find a runnable thread).

Hi!

Please post a few outputs of the InnoDB monitor during the hang.

Regards,

Heikki

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

had the same problem on redhat linux 9 kernel 2.4.20-30-smp
mysql version: 4.0.24-max
one clue: just before this problem happened (multiple times, within 1-2 days intervals), I enabled query cache. trying to remove query cache and see if it
goes away.

Below are info from db error file and innodb status file. Note that the status file is abt 14 mins before the err file dump... I thought the status file is updated every 15 seconds?

********** db error file extract..

InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Warning: a long semaphore wait:
--Thread 1500592944 has waited at ../../innobase/trx/../include/trx0sys.ic line 101 for 786.00 seconds the semaphore:
X-lock on RW-latch at 0x41bda560 created in file buf0buf.c line 436
a writer (thread id 1500592944) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 532
Last time write locked in file ../../innobase/trx/../include/trx0sys.ic line 101
InnoDB: Warning: a long semaphore wait:
--Thread 1317522224 has waited at trx0trx.c line 608 for 786.00 seconds the semaphore:
Mutex at 0x43136168 created file srv0srv.c line 765, lock var 1
waiters flag 1
InnoDB: Warning: a long semaphore wait:
--Thread 1501596464 has waited at trx0trx.c line 608 for 785.00 seconds the semaphore:
Mutex at 0x43136168 created file srv0srv.c line 765, lock var 1
waiters flag 1
... similar lines for a while
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
050810 22:50:31InnoDB: Assertion failure in thread 1329539888 in file srv0srv.c line 1795
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
InnoDB: about forcing recovery.
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=134217728
read_buffer_size=1044480
max_used_connections=170
max_connections=400
threads_connected=96
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 948668 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=(nil)
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x4f3f2734, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x80daa90
0x400498f8
(nil)
0x420df147
New value of fp=(nil) failed sanity check, terminating stack trace!

****** inndo status file output (just b4 it crashed) *******
=====================================
050810 22:36:55 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 16 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 9243, signal count 8689
Mutex spin waits 12591, rounds 104741, OS waits 4425
RW-shared spins 9275, OS waits 4637; RW-excl spins 119, OS waits 116
------------
TRANSACTIONS
------------
Trx id counter 0 973758973
Purge done for trx's n:o < 0 973737703 undo n:o < 0 0
Total number of lock structs in row lock hash table 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 973758837, not started, process no 3292, OS thread id 1479580464
MySQL thread id 12185, query id 2892184 [ip details removed]
---TRANSACTION 0 973758807, not started, process no 3292, OS thread id 1478777648
MySQL thread id 12186, query id 2892149 [ip details removed]
---TRANSACTION 0 973758805, not started, process no 3292, OS thread id 1495374640
MySQL thread id 12184, query id 2892245 [ip details removed]
.... many more similar items ....
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread) ev set
Pending normal aio reads: 0, aio writes: 1,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
6727 OS file reads, 21228 OS file writes, 19031 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf for space 0: size 1, free list len 17, seg size 19,
248 inserts, 248 merged recs, 248 merges
Hash table size 553253, used cells 281252, node heap has 386 buffer(s)
1.81 hash searches/s, 6.25 non-hash searches/s
---
LOG
---
Log sequence number 0 1048848988
Log flushed up to   0 1048848988
Last checkpoint at  0 1048848978
0 pending log writes, 0 pending chkp writes
9689 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 179169764; in additional pool allocated 1965440
Buffer pool size   8192
Free buffers       865
Database pages     6941
Modified db pages  1
Pending reads 0
Pending writes: LRU 0, flush list 1, single page 0
Pages read 6936, created 5, written 7044
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
Main thread process no. 3292, id 1337932592, state: flushing buffer pool pages
Number of rows inserted 131, updated 49, deleted 121, read 611929
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 2.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

I have the same problem, and the log file looks very much like the other logs in this bug report.

We run our usual SQL queries but at some point mysql hung. It did not crash right away, but it was just stuck at a simple SQL involving inserting into a table; after 10 hours I connected to the server to see the status and approx. 10 minutes later it crashed.

Final part of the log file:

InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Warning: a long semaphore wait:
--Thread 1306241840 has waited at ../include/btr0btr.ic line 28 for 787.00 seconds the semaphore:
X-lock on RW-latch at 0x402b6748 created in file buf0buf.c line 469
a writer (thread id 1306241840) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/btr0btr.ic line 28
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

=====================================
060915 10:44:21 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 32 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 18, signal count 17
--Thread 1306241840 has waited at ../include/btr0btr.ic line 28 for 802.00 seconds the semaphore:
X-lock on RW-latch at 0x402b6748 created in file buf0buf.c line 469
a writer (thread id 1306241840) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1
Last time read locked in file buf0flu.c line 562
Last time write locked in file ../include/btr0btr.ic line 28
Mutex spin waits 17, rounds 180, OS waits 9
RW-shared spins 16, OS waits 8; RW-excl spins 1, OS waits 1
------------
TRANSACTIONS
------------
Trx id counter 0 1101
Purge done for trx's n:o < 0 0 undo n:o < 0 0
History list length 119
Total number of lock structs in row lock hash table 410
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 1100, not started, process no 16377, OS thread id 1306643248
MySQL thread id 17, query id 666 aristovski01.klocwork.com 10.0.1.45 root
---TRANSACTION 0 0, not started, process no 16377, OS thread id 1306241840 updating table statistics
mysql tables in use 1, locked 1
MySQL thread id 14, query id 592 aristovski01.klocwork.com 10.0.1.45 checking
SHOW TABLE STATUS
---TRANSACTION 0 0, not started, process no 16377, OS thread id 1107139376
MySQL thread id 12, query id 589 aristovski01.klocwork.com 10.0.1.45 checking
---TRANSACTION 0 0, not started, process no 16377, OS thread id 1106938672
MySQL thread id 10, query id 543 aristovski01.klocwork.com 10.0.1.45 checking
---TRANSACTION 0 1097, ACTIVE 56772 sec, process no 16377, OS thread id 1105156912 inserting, thread declared inside InnoDB 371
mysql tables in use 1, locked 1
415 lock struct(s), heap size 44352, undo log entries 36146
MySQL thread id 9, query id 455 devlx90c1 10.0.0.145  Sending data
insert into problem_arg ( upid, bld_id, arg_no, arg_info) select pdt.upid, @build_id, pat.arg_no, pat.arg_info from CEMSR1400_nortel__shared.problem_dat as pdt, CEMSR1400_nortel__shared.problem_arg_dat as pat where pat.prob_id = pdt.prob_id
Trx read view will not see trx with id >= 0 1098, sees < 0 1098
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread) ev set
Pending normal aio reads: 0, aio writes: 128,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
0 OS file reads, 586 OS file writes, 585 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf for space 0: size 1, free list len 0, seg size 2, is empty
Ibuf for space 0: size 1, free list len 0, seg size 2,
0 inserts, 0 merged recs, 0 merges
Hash table size 553253, used cells 21218, node heap has 22 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 0 6440496
Log flushed up to   0 6440496
Last checkpoint at  0 43634
0 pending log writes, 0 pending chkp writes
202 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 148886906; in additional pool allocated 1048576
Buffer pool size   8192
Free buffers       7444
Database pages     724
Modified db pages  561
Pending reads 0
Pending writes: LRU 0, flush list 129, single page 0
Pages read 0, created 877, written 192
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
Main thread process no. 16377, id 1100888880, state: waiting for server activity
Number of rows inserted 37485, updated 0, deleted 0, read 10445
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
060915 10:44:36InnoDB: Assertion failure in thread 1096670000 in file srv0srv.c line 1873
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
InnoDB: about forcing recovery.
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=33554432
read_buffer_size=520192
max_used_connections=6
max_connections=100
threads_connected=6
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 288367 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=(nil)
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x415dd724, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x8135b13
0x4005f8f8
(nil)
0x420df147
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do 
resolve it
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Upgrading our O/S to Red Hat Enterprise 4 has solved this issue for us.
It looks like a compatibility issue between MySQL and RH9.

Aleksandar,

what Linux distro you were running?

Regards,

Heikki

Heikki, the crash happened on RH9 and today it happened on Windows XP SP2 (exactly the same symptomps). 

mysql version is 4.1.10

Aleksandar,

below is the hang you reported from Windows:

>=====================================
>> 060918 17:36:36 INNODB MONITOR OUTPUT
>> =====================================
>> Per second averages calculated from the last 16 seconds
>> ----------
>> SEMAPHORES
>> ----------
>> OS WAIT ARRAY INFO: reservation count 623, signal count 615
>> --Thread 2856 has waited at

<cut>mysqldev\build\mysql-4.1.10-build\mysql-4.1.10\innobase\buf\buf0flu.c line 390 for 811.00 seconds the semaphore:

>> Mutex at 02FB6C80 created file

<cut>mysql-4.1.10-build\mysql-4.1.10\innobase\trx\trx0sys.c line 103, lock var 1

>> waiters flag 1
>> Mutex spin waits 505, rounds 2777, OS waits 58
>> RW-shared spins 802, OS waits 401; RW-excl spins 152, OS waits 151

but the hang on Linux happened on a different semaphore.

My guess is that the Linux hangs are due to some bug in the Linux distro. The Windows hang is on this semaphore:

trx0sys.c:
/********************************************************************
Creates or initialializes the doublewrite buffer at a database start. */
static
void
trx_doublewrite_init(
/*=================*/
        byte*   doublewrite)    /* in: pointer to the doublewrite buf
                                header on trx sys page */
{
        trx_doublewrite = mem_alloc(sizeof(trx_doublewrite_t));

        /* Since we now start to use the doublewrite buffer, no need to call
        fsync() after every write to a data file */
#ifdef UNIV_DO_FLUSH
        os_do_not_call_flush_at_each_write = TRUE;
#endif /* UNIV_DO_FLUSH */

        mutex_create(&(trx_doublewrite->mutex));

I have never seen a report of a hang on this particular semaphore.

This still remains a mystery. You may try an upgrade to the latest 4.1.xx.

Regards,

Heikki

I am currently having this problem on Fedora Core 1

Putting the status to 'Can't repeat' until we get more reports of the hang on trx_doublewrite->mutex. I could not find any obvious bug from buf0flu.c in 4.1.16.

The mutex is reserved during calls to fil I/O and flush. An operating system hang in those I/O operations would explain the hang, but Windows has usually given an OS error if it fails, it has not hung. Thus, this remains a mystery.

The hang on Linux may be one of the many OS problems people have reported over years.

Putting the status again to 'Can't repeat' because there is no common factor in the various hang reports that have accumulated here.

Some of the problems may be due to OS bugs or faulty hardware, or to InnoDB table corruption.

Hi even we are getting the same errors with mysql 4.1.21 on RHEL4.
its happening once in every 2 days..

please find the error log report..

key_buffer_size=402653184
read_buffer_size=2093056
max_used_connections=134
max_connections=250
threads_connected=19
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 1416214 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0xa4d74f08
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xa4f91bf4, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x812b880
0x400397c8
(nil)
0x8126c83
0x815b222
0x8181169
0x8142d30
0x81445dd
0x813dcab
0x813d946
0x813d11f
0x40034332
0x420de867
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x9bfda70 = UPDATE Test_table TC, test_cc_ad_info_txt EX SET EX.test_click_no=UC.no, EX.temp_no=NULL WHERE UC.temp_no IS NOT NULL AND UC.temp_no=EX.temp_no
thd->thread_id=1878119
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Number of processes running now: 0
070615 01:07:54  mysqld restarted
070615  1:08:01  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
070615  1:08:02  InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 119 2107948288.
InnoDB: Doing recovery: scanned up to log sequence number 119 2107949309
070615  1:08:03  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 299139144, file name server3-bin.000176
InnoDB: Last MySQL binlog file position 0 89345055, file name ./server-bin.000799
070615  1:08:04  InnoDB: Flushing modified pages from the buffer pool...
070615  1:08:04  InnoDB: Started; log sequence number 119 2107949309
070615  1:08:04 [Warning] mysql.user table is not updated to new password format; Disabling new password usage until mysql_fix_privilege_tables is run
/usr/sbin/mysqld: ready for connections.

please let us know how to proceed.

Regards,
krishna

Hi, even we are getting the same errors with mysql 4.1.21 on RHEL4.
its happening once in every 2 days..

please find the error log report..

key_buffer_size=402653184
read_buffer_size=2093056
max_used_connections=134
max_connections=250
threads_connected=19
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 1416214 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0xa4d74f08
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xa4f91bf4, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x812b880
0x400397c8
(nil)
0x8126c83
0x815b222
0x8181169
0x8142d30
0x81445dd
0x813dcab
0x813d946
0x813d11f
0x40034332
0x420de867
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x9bfda70 = UPDATE Test_table TC, test_cc_ad_info_txt EX SET EX.test_click_no=UC.no, EX.temp_no=NULL WHERE UC.temp_no IS NOT NULL AND UC.temp_no=EX.temp_no
thd->thread_id=1878119
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Number of processes running now: 0
070615 01:07:54  mysqld restarted
070615  1:08:01  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
070615  1:08:02  InnoDB: Starting log scan based on checkpoint at
InnoDB: log sequence number 119 2107948288.
InnoDB: Doing recovery: scanned up to log sequence number 119 2107949309
070615  1:08:03  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 299139144, file name server3-bin.000176
InnoDB: Last MySQL binlog file position 0 89345055, file name ./server-bin.000799
070615  1:08:04  InnoDB: Flushing modified pages from the buffer pool...
070615  1:08:04  InnoDB: Started; log sequence number 119 2107949309
070615  1:08:04 [Warning] mysql.user table is not updated to new password format; Disabling new password usage until mysql_fix_privilege_tables is run
/usr/sbin/mysqld: ready for connections.

please let us know how to proceed.

Regards,
krishna

Just to share on our experience, the issue went away when we moved to mysql 5.x, and added more memory to the system. Unfortunately, we are not conclusive which of the above changes fixed the issue.