MySQL Bugs: #33779: mysql hangs very badly, bringing the machine down

Bug #33779	mysql hangs very badly, bringing the machine down
Submitted:	9 Jan 2008 18:18	Modified:	8 May 2008 11:09
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0.4	OS:	Any
Assigned to:	Philip Stoev	CPU Architecture:	Any

Description:
Running the iuds2 systems test against Falcon repeatedly results in a situation with the following sympthoms:

* mysqld CPU usage is either 800% or 300% flat (the machine has 8 CPUs)
* ps does not work, hangs when trying to open a file in the /proc filesystem
* ls /data1 hangs, in a getdents() system call
* mysql hangs, in connect() to the server socket.
* /sbin/shutdown -r now does not produce a machine restart.
* gdb attach to mysqld process hangs.
* dmesg and /var/log/messages have no significant data.

 uname -a
Linux dl360-g5-a.mysql.com 2.6.18-8.1.6.el5 #1 SMP Fri Jun 1 18:52:13 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

How to repeat:
run iuds2 test against falcon on the dl360-g5-a machine.

Suggested fix:
This appears to be a kernel or processor bug.

Setting to verified so that it comes up in the reports of unsolved bugs.

Stack trace files for bug #33779

Attachment: bug33779.trace (text/plain), 471.10 KiB.

I just attached gdb output of "bt" and "bt full" for all threads, taken less than 1 minute before the machine hanged.

The output is produced using a simple tool which attaches to mysqld every minute and runs gdb commands. Can you please take a look at the output and let me know if anything rings a bell. If not, I will expand the tool to collect SHOW PROCESSLIST and INFORMATION_SCHEMA output, along with more information from gdb.

The bad news is that with the debug binary the issue happened after running the test for less than one hour.

Philip, please provide couple more scrapes, information_schema dump and SHOW PROCESSLIST.

Bug also happens on SuSE on the same machine.

Bug also happens on SuSE on the same machine.

This was traced to a kernel bug/feature, here is the response from Red Hat:

<quote>

The VM on this box is livelocked. Although there is currently free memory, at some point pdflush (which frees memory) failed to allocate memory it needed for a journal update. Since that journal is locked, anything that causes a metadata write, such as an atime update from 'ls', is hanging. It's quite possible that pdflush is regularly recovering from this situation, only to be forced right back into it by the system load, causing it to appear functionally dead, though it is still making some progress. If the application workload stopped hogging so much CPU and memory, the system would be able to fully recover, though it could potentially take hours after the load is removed.

A livelock may or may not be considered a bug, depending on how far out of the way an application or administrator has to go to trigger it. It's possible that this is in fact a deadlock, which would clearly be a bug. The data we have strongly indicate that this is a livelock that can be fixed by tuning, but if it happens again, please capture a vmcore, as this will allow us to conclusively analyze the system state and determine if there's something we need to do beyond simply tuning the system.

On the tuning front, there are several things that can be done to keep this from happening:

1) noatime *

The noatime mount option prevents reads of files and directories from updating the inode access time. Updating the access time triggers a journal commit, which can cause journal contention. Most of the uninterruptible processes on this system are sleeping due to journal contention, which is preventing the system from freeing write buffers.

2) lower vm.dirty_ratio

The system can't free memory because it filled up too much buffer memory, and couldn't allocate the journal handle to commit it to storage. By lowering this ratio, it will cause the system to flush dirty buffers earlier, reducing the memory pressure they impose. One server-grade storage, you can usually lower this sysctl all the way to 1 without substantially harming throughput.

3) stop mlocking everything

This system isn't using any swap, though it desperately wants to, because it doesn't have any inactive swappable memory. I suspect that this is due to mysql using mlock() aggressively to improve performance. I'm not familiar with this tuning option in mysql, but I know it's a non-default option. If you can control how much memory mysql mlocks, lower that value. If it's simply an on/off switch, consider turning it off, or try the other tuning options first if you don't want to risk the performance hit this would cause.

* Forgot to mention:

The atime property is generally only used for cleaning up temporary files in /tmp and /var/tmp. It is generally fine to mount any other filesystem with the noatime option. If the data set for this workload resides on its own filesystem, just mounting that filesystem with noatime should suffice to reduce journal contention.

</quote>