MySQL Bugs: #11783: Server goes into mode "D" in ps under linux (using 99% proc)

Bug #11783	Server goes into mode "D" in ps under linux (using 99% proc)
Submitted:	6 Jul 2005 20:01	Modified:	15 Jul 2005 8:19
Reporter:	Trey Stout	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1.12-log	OS:	Linux (Gentoo Linux (2005.0))
Assigned to:		CPU Architecture:	Any

Description:
Unfortunately this is one of those problems that has been very hard to track down. Here is the basic rundown...

Our application is distributed across seven servers. Each application on each node can access data on another node as well as itself. After running for sometimes 5 minutes, and sometimes 5 days, mysql takes up 99% cpu (as reported via `top`) and goes into state "D" which the `ps` man page calls "Uninterruptible sleep (usually IO)".

At this point not even a kill -9 will stop mysqld. The server allows no new connections either by network, or pipe. The only solution we have is to reboot the box so that mysql will restart.

Some basic info (each node is identical):
Dual Xeon 3.2Ghz hyperthreading on (4 processors)
4GB RAM
Kernel 2.6.12-rc4-mm1 SMP
MySQL 4.1.12-log
mysql data directory is formatted as ReiserFS4
All tables are MyISAM
Hardware RAID 0 SATA 300GB Drives.

At the point where mysql hangs, the kernel prints out the following message...

Jul  6 01:41:25 [kernel] Unable to handle kernel NULL pointer dereference at virtual address 00000003
Jul  6 01:41:25 [kernel]  printing eip:
Jul  6 01:41:25 [kernel] c01dbc12
Jul  6 01:41:25 [kernel] *pde = 00000000
Jul  6 01:41:25 [kernel] Oops: 0002 [#1]
Jul  6 01:41:25 [kernel] PREEMPT SMP
Jul  6 01:41:25 [kernel] Modules linked in:
Jul  6 01:41:25 [kernel] CPU:    0
Jul  6 01:41:25 [kernel] EIP:    0060:[<c01dbc12>]    Not tainted VLI
Jul  6 01:41:25 [kernel] EFLAGS: 00010202   (2.6.12-rc4-mm1)
Jul  6 01:41:25 [kernel] EIP is at lock_object+0x52/0x80
Jul  6 01:41:25 [kernel] eax: cdb07e9c   ebx: 00000003   ecx: ef79be88   edx: ef79be9c
Jul  6 01:41:25 [kernel] esi: cbc06380   edi: ef79bed0   ebp: ef79bed0   esp: ef79bb40
Jul  6 01:41:25 [kernel] ds: 007b   es: 007b   ss: 0068
Jul  6 01:41:25 [kernel] Process mysqld (pid: 29268, threadinfo=ef79a000 task=e133da70)
Jul  6 01:41:25 [kernel] Stack: 00000000 cbc06380 00000000 c01dc098 ef79bed0 cbc063b4 00000000 cbc063f0
Jul  6 01:41:25 [kernel]        ef79bed0 c01dc155 ef79bed0 00000000 00000000 00000001 00000000 00000000
Jul  6 01:41:25 [kernel]        ef79bed0 cbc06380 cbc063f0 c01dc400 ef79bed0 00000000 ffffffff ffffffff
Jul  6 01:41:25 [kernel] Call Trace:
Jul  6 01:41:25 [kernel]  [<c01dc098>] lock_tail+0x68/0x80
Jul  6 01:41:25 [kernel]  [<c01dc155>] longterm_lock_tryfast+0xa5/0xd0
Jul  6 01:41:25 [kernel]  [<c01dc400>] longterm_lock_znode+0x280/0x2a0
Jul  6 01:41:25 [kernel]  [<c01ecd57>] cbk_cache_scan_slots+0x147/0x2f0
Jul  6 01:41:25 [kernel]  [<c01ecf3b>] cbk_cache_search+0x3b/0x60
Jul  6 01:41:25 [kernel]  [<c01ebbf3>] coord_by_handle+0x13/0x40
Jul  6 01:41:25 [kernel]  [<c01ebbac>] object_lookup+0xbc/0xf0
Jul  6 01:41:25 [kernel]  [<c021ee82>] find_file_item+0x122/0x1c0
Jul  6 01:41:25 [kernel]  [<c0220dce>] read_file+0xfe/0x340
Jul  6 01:41:25 [kernel]  [<c0212430>] read_extent+0x0/0x210
Jul  6 01:41:25 [kernel]  [<c01e202e>] txn_begin+0x1e/0x30
Jul  6 01:41:25 [kernel]  [<c022123e>] read_unix_file+0x21e/0x340
Jul  6 01:41:25 [kernel]  [<c0324d74>] as_dispatch_request+0x164/0x300
Jul  6 01:41:25 [kernel]  [<c03765e0>] tw_scsi_queue+0x170/0x1f0
Jul  6 01:41:25 [kernel]  [<c01df0dc>] init_context+0x6c/0xa0
Jul  6 01:41:25 [kernel]  [<c01f56e7>] reiser4_read+0x77/0xc0
Jul  6 01:41:25 [kernel]  [<c01677c6>] vfs_read+0xb6/0x180
Jul  6 01:41:25 [kernel]  [<c0167c88>] sys_pread64+0x88/0x90
Jul  6 01:41:25 [kernel]  [<c0103115>] syscall_call+0x7/0xb
Jul  6 01:41:25 [kernel] Code: 51 0c 89 79 04 89 71 08 8b 58 04 89 41 0c 89 50 04 89 13 89 5a 04 8d 51 14 ff 47 18 8b 86 84 00 00 00 8b 58 04 89 41 14 89 50 04 <89> 13 89 5a 04 c7 01 00 00 00 00 8b 4f 0c 85 c9 74 03 ff 46 7c
Jul  6 01:41:25 [kernel]  <6>note: mysqld[29268] exited with preempt_count 1

How to repeat:
Not too sure. Probably has something to do with disk IO? Since the kernel puts it in state "D".

Suggested fix:
Make it not hang on IO? I hate to presume to know more than you guys :D

Is there any more info I can provide that would get some attention on this bug? It is a serious problem on my company's production site. No mysqld stays up for more than a day.

We're sorry, but the bug system is not the appropriate forum for 
asking help on using MySQL products. Your problem is not the result 
of a bug.

Support on using our products is available both free in our forums
at http://forums.mysql.com and for a reasonable fee direct from our
skilled support engineers at http://www.mysql.com/support/

Thank you for your interest in MySQL.

Additional info:

This looks like a hardware fault or a bug in the operating system, not MySQL. You may try smartctl (smartmontools) and memtest86 to track down hardware problems.

Based on the stack trace in the dmesg output, this could also be a reiserfs bug. You may want to try some other file system.