Description:
Unfortunately this is one of those problems that has been very hard to track down. Here is the basic rundown...
Our application is distributed across seven servers. Each application on each node can access data on another node as well as itself. After running for sometimes 5 minutes, and sometimes 5 days, mysql takes up 99% cpu (as reported via `top`) and goes into state "D" which the `ps` man page calls "Uninterruptible sleep (usually IO)".
At this point not even a kill -9 will stop mysqld. The server allows no new connections either by network, or pipe. The only solution we have is to reboot the box so that mysql will restart.
Some basic info (each node is identical):
Dual Xeon 3.2Ghz hyperthreading on (4 processors)
4GB RAM
Kernel 2.6.12-rc4-mm1 SMP
MySQL 4.1.12-log
mysql data directory is formatted as ReiserFS4
All tables are MyISAM
Hardware RAID 0 SATA 300GB Drives.
At the point where mysql hangs, the kernel prints out the following message...
Jul 6 01:41:25 [kernel] Unable to handle kernel NULL pointer dereference at virtual address 00000003
Jul 6 01:41:25 [kernel] printing eip:
Jul 6 01:41:25 [kernel] c01dbc12
Jul 6 01:41:25 [kernel] *pde = 00000000
Jul 6 01:41:25 [kernel] Oops: 0002 [#1]
Jul 6 01:41:25 [kernel] PREEMPT SMP
Jul 6 01:41:25 [kernel] Modules linked in:
Jul 6 01:41:25 [kernel] CPU: 0
Jul 6 01:41:25 [kernel] EIP: 0060:[<c01dbc12>] Not tainted VLI
Jul 6 01:41:25 [kernel] EFLAGS: 00010202 (2.6.12-rc4-mm1)
Jul 6 01:41:25 [kernel] EIP is at lock_object+0x52/0x80
Jul 6 01:41:25 [kernel] eax: cdb07e9c ebx: 00000003 ecx: ef79be88 edx: ef79be9c
Jul 6 01:41:25 [kernel] esi: cbc06380 edi: ef79bed0 ebp: ef79bed0 esp: ef79bb40
Jul 6 01:41:25 [kernel] ds: 007b es: 007b ss: 0068
Jul 6 01:41:25 [kernel] Process mysqld (pid: 29268, threadinfo=ef79a000 task=e133da70)
Jul 6 01:41:25 [kernel] Stack: 00000000 cbc06380 00000000 c01dc098 ef79bed0 cbc063b4 00000000 cbc063f0
Jul 6 01:41:25 [kernel] ef79bed0 c01dc155 ef79bed0 00000000 00000000 00000001 00000000 00000000
Jul 6 01:41:25 [kernel] ef79bed0 cbc06380 cbc063f0 c01dc400 ef79bed0 00000000 ffffffff ffffffff
Jul 6 01:41:25 [kernel] Call Trace:
Jul 6 01:41:25 [kernel] [<c01dc098>] lock_tail+0x68/0x80
Jul 6 01:41:25 [kernel] [<c01dc155>] longterm_lock_tryfast+0xa5/0xd0
Jul 6 01:41:25 [kernel] [<c01dc400>] longterm_lock_znode+0x280/0x2a0
Jul 6 01:41:25 [kernel] [<c01ecd57>] cbk_cache_scan_slots+0x147/0x2f0
Jul 6 01:41:25 [kernel] [<c01ecf3b>] cbk_cache_search+0x3b/0x60
Jul 6 01:41:25 [kernel] [<c01ebbf3>] coord_by_handle+0x13/0x40
Jul 6 01:41:25 [kernel] [<c01ebbac>] object_lookup+0xbc/0xf0
Jul 6 01:41:25 [kernel] [<c021ee82>] find_file_item+0x122/0x1c0
Jul 6 01:41:25 [kernel] [<c0220dce>] read_file+0xfe/0x340
Jul 6 01:41:25 [kernel] [<c0212430>] read_extent+0x0/0x210
Jul 6 01:41:25 [kernel] [<c01e202e>] txn_begin+0x1e/0x30
Jul 6 01:41:25 [kernel] [<c022123e>] read_unix_file+0x21e/0x340
Jul 6 01:41:25 [kernel] [<c0324d74>] as_dispatch_request+0x164/0x300
Jul 6 01:41:25 [kernel] [<c03765e0>] tw_scsi_queue+0x170/0x1f0
Jul 6 01:41:25 [kernel] [<c01df0dc>] init_context+0x6c/0xa0
Jul 6 01:41:25 [kernel] [<c01f56e7>] reiser4_read+0x77/0xc0
Jul 6 01:41:25 [kernel] [<c01677c6>] vfs_read+0xb6/0x180
Jul 6 01:41:25 [kernel] [<c0167c88>] sys_pread64+0x88/0x90
Jul 6 01:41:25 [kernel] [<c0103115>] syscall_call+0x7/0xb
Jul 6 01:41:25 [kernel] Code: 51 0c 89 79 04 89 71 08 8b 58 04 89 41 0c 89 50 04 89 13 89 5a 04 8d 51 14 ff 47 18 8b 86 84 00 00 00 8b 58 04 89 41 14 89 50 04 <89> 13 89 5a 04 c7 01 00 00 00 00 8b 4f 0c 85 c9 74 03 ff 46 7c
Jul 6 01:41:25 [kernel] <6>note: mysqld[29268] exited with preempt_count 1
How to repeat:
Not too sure. Probably has something to do with disk IO? Since the kernel puts it in state "D".
Suggested fix:
Make it not hang on IO? I hate to presume to know more than you guys :D