Bug #33779 | mysql hangs very badly, bringing the machine down | ||
---|---|---|---|
Submitted: | 9 Jan 2008 18:18 | Modified: | 8 May 2008 11:09 |
Reporter: | Philip Stoev | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S3 (Non-critical) |
Version: | 6.0.4 | OS: | Any |
Assigned to: | Philip Stoev | CPU Architecture: | Any |
[9 Jan 2008 18:18]
Philip Stoev
[16 Jan 2008 22:25]
Philip Stoev
Setting to verified so that it comes up in the reports of unsolved bugs.
[16 Jan 2008 22:28]
Philip Stoev
Stack trace files for bug #33779
Attachment: bug33779.trace (text/plain), 471.10 KiB.
[16 Jan 2008 22:34]
Philip Stoev
I just attached gdb output of "bt" and "bt full" for all threads, taken less than 1 minute before the machine hanged. The output is produced using a simple tool which attaches to mysqld every minute and runs gdb commands. Can you please take a look at the output and let me know if anything rings a bell. If not, I will expand the tool to collect SHOW PROCESSLIST and INFORMATION_SCHEMA output, along with more information from gdb. The bad news is that with the debug binary the issue happened after running the test for less than one hour.
[17 Jan 2008 22:31]
Philip Stoev
Philip, please provide couple more scrapes, information_schema dump and SHOW PROCESSLIST.
[19 Jan 2008 18:15]
Philip Stoev
Bug also happens on SuSE on the same machine.
[19 Jan 2008 18:15]
Philip Stoev
Bug also happens on SuSE on the same machine.
[8 May 2008 11:09]
Philip Stoev
This was traced to a kernel bug/feature, here is the response from Red Hat: <quote> The VM on this box is livelocked. Although there is currently free memory, at some point pdflush (which frees memory) failed to allocate memory it needed for a journal update. Since that journal is locked, anything that causes a metadata write, such as an atime update from 'ls', is hanging. It's quite possible that pdflush is regularly recovering from this situation, only to be forced right back into it by the system load, causing it to appear functionally dead, though it is still making some progress. If the application workload stopped hogging so much CPU and memory, the system would be able to fully recover, though it could potentially take hours after the load is removed. A livelock may or may not be considered a bug, depending on how far out of the way an application or administrator has to go to trigger it. It's possible that this is in fact a deadlock, which would clearly be a bug. The data we have strongly indicate that this is a livelock that can be fixed by tuning, but if it happens again, please capture a vmcore, as this will allow us to conclusively analyze the system state and determine if there's something we need to do beyond simply tuning the system. On the tuning front, there are several things that can be done to keep this from happening: 1) noatime * The noatime mount option prevents reads of files and directories from updating the inode access time. Updating the access time triggers a journal commit, which can cause journal contention. Most of the uninterruptible processes on this system are sleeping due to journal contention, which is preventing the system from freeing write buffers. 2) lower vm.dirty_ratio The system can't free memory because it filled up too much buffer memory, and couldn't allocate the journal handle to commit it to storage. By lowering this ratio, it will cause the system to flush dirty buffers earlier, reducing the memory pressure they impose. One server-grade storage, you can usually lower this sysctl all the way to 1 without substantially harming throughput. 3) stop mlocking everything This system isn't using any swap, though it desperately wants to, because it doesn't have any inactive swappable memory. I suspect that this is due to mysql using mlock() aggressively to improve performance. I'm not familiar with this tuning option in mysql, but I know it's a non-default option. If you can control how much memory mysql mlocks, lower that value. If it's simply an on/off switch, consider turning it off, or try the other tuning options first if you don't want to risk the performance hit this would cause. * Forgot to mention: The atime property is generally only used for cleaning up temporary files in /tmp and /var/tmp. It is generally fine to mount any other filesystem with the noatime option. If the data set for this workload resides on its own filesystem, just mounting that filesystem with noatime should suffice to reduce journal contention. </quote>