Bug #40731 Maria: hang (probably in page cache) under concurrency
Submitted: 14 Nov 2008 9:06 Modified: 16 Jan 2009 10:22
Reporter: Guilhem Bichot Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Maria storage engine Severity:S3 (Non-critical)
Version:5.1-maria OS:Linux
Assigned to: Assigned Account CPU Architecture:Any

[14 Nov 2008 9:06] Guilhem Bichot
Description:
Seen in pushbuild2, even after the fix for BUG#40308.
pushbuild2 reports that 10 queries are stalled, and here are the stack traces, more interesting sorted first:

:::::::::::: maria_bitmap ::::::::::::

Thread 14 (process 27263):

#0  0x00a17402 in __kernel_vsyscall ()
#1  0x0089e256 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x086435d9 in safe_cond_wait (cond=0xb13736d4, mp=0xb1373690, 
    file=0x87ea73b "ma_bitmap.c", line=356) at thr_mutex.c:249
#3  0x085bf93a in _ma_bitmap_flush_all (share=0xb1373158) at ma_bitmap.c:356
#4  0x08601e13 in collect_tables (str=0xb1e40274, 
    checkpoint_start_log_horizon=4310724968) at ma_checkpoint.c:1048
#5  0x086004fa in really_execute_checkpoint () at ma_checkpoint.c:195
#6  0x08600364 in ma_checkpoint_execute (level=CHECKPOINT_MEDIUM, 
    no_wait=1 '\001') at ma_checkpoint.c:132
#7  0x08601106 in ma_checkpoint_background (arg=0x1e) at ma_checkpoint.c:618
#8  0x0089a45b in start_thread () from /lib/libpthread.so.0
#9  0x007f1c4e in clone () from /lib/libc.so.6

:::::::::::: maria_pagecache ::::::::::::

Thread 9 (process 27289):
#0  0x00a17402 in __kernel_vsyscall ()
#1  0x0089e256 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x086435d9 in safe_cond_wait (cond=0x9c83724, mp=0x9151d84, 
    file=0x87e2a72 "ma_pagecache.c", line=1906) at thr_mutex.c:249
#3  0x0857bbe8 in find_block (pagecache=0x9151d20, file=0xb13a5330, pageno=33, 
    init_hits_left=0, wrmode=1 '\001', reg_req=1 '\001', page_st=0xb1ffdbb0)
    at ma_pagecache.c:1905
#4  0x0858036f in pagecache_read (pagecache=0x9151d20, file=0xb13a5330, 
    pageno=33, level=0, buff=0x0, type=PAGECACHE_LSN_PAGE, 
    lock=PAGECACHE_LOCK_WRITE, page_link=0xb1ffdc7c) at ma_pagecache.c:3302
#5  0x085b2a2a in get_head_or_tail_page (info=0xb13a5098, block=0xb1381938, 
    buff=0xb13a54c0 "", length=35, page_type=1, lock=PAGECACHE_LOCK_WRITE, 
    res=0xb1ffdd00) at ma_blockrec.c:1683
#6  0x085b682a in allocate_and_write_block_record (info=0xb13a5098, 
    record=0xb133a278 "\a", row=0xb13a50e0, undo_lsn=1) at ma_blockrec.c:3381
<cut>
#22 0x08263476 in dispatch_command (command=COM_QUERY, thd=0x9cd0830, 
    packet=0x9cd2209 "INSERT INTO j SELECT * FROM `x`", packet_length=31)
    at sql_parse.cc:1160
#23 0x082645de in do_command (thd=0x9cd0830) at sql_parse.cc:817
#24 0x08251e23 in handle_one_connection (arg=0x9cd0830) at sql_connect.cc:1115
#25 0x0089a45b in start_thread () from /lib/libpthread.so.0
#26 0x007f1c4e in clone () from /lib/libc.so.6

============

Thread 8 (process 27291):
#0  0x00a17402 in __kernel_vsyscall ()
#1  0x0089e256 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x086435d9 in safe_cond_wait (cond=0x9cd066c, mp=0x9151d84, 
    file=0x87e2a72 "ma_pagecache.c", line=1515) at thr_mutex.c:249
#3  0x0857c95b in wait_for_readers (pagecache=0x9151d20, block=0xb2301c08)
    at ma_pagecache.c:1515
#4  0x08581f2d in free_block (pagecache=0x9151d20, block=0xb2301c08)
    at ma_pagecache.c:4094
#5  0x085829d2 in flush_cached_blocks (pagecache=0x9151d20, file=0xb132f390, 
    cache=0xb1fcb1c0, end=0xb1fcb1c4, type=FLUSH_RELEASE, 
    first_errno=0xb1fcb128) at ma_pagecache.c:4267
#6  0x08582fab in flush_pagecache_blocks_int (pagecache=0x9151d20, 
    file=0xb132f390, type=FLUSH_RELEASE, filter=0, filter_arg=0x0)
    at ma_pagecache.c:4521
#7  0x08583300 in flush_pagecache_blocks_with_filter (pagecache=0x9151d20, 
    file=0xb132f390, type=FLUSH_RELEASE, filter=0, filter_arg=0x0)
    at ma_pagecache.c:4643
#8  0x08595bc0 in _ma_flush_table_files (info=0xb132f0f8, 
    flush_data_or_index=3, flush_type_for_data=FLUSH_RELEASE, 
    flush_type_for_index=FLUSH_RELEASE) at ma_extra.c:589
#9  0x08609895 in _ma_reenable_logging_for_table (info=0xb132f0f8, 
    flush_pages=1 '\001') at ma_recovery.c:3300
#10 0x0858760f in ha_maria::external_lock (this=0xb1369828, thd=0x9cc0720, 
    lock_type=2) at ha_maria.cc:2364
#11 0x0836bc66 in handler::ha_external_lock (this=0xb1369828, thd=0x9cc0720, 
    lock_type=2) at handler.cc:4433
#12 0x0823fc4f in unlock_external (thd=0x9cc0720, table=0xb134e760, count=2)
    at lock.cc:786
<cut>
#20 0x08263476 in dispatch_command (command=COM_QUERY, thd=0x9cc0720, 
    packet=0x9ce5ba1 "INSERT INTO j SELECT * FROM v", packet_length=29)
    at sql_parse.cc:1160
#21 0x082645de in do_command (thd=0x9cc0720) at sql_parse.cc:817
#22 0x08251e23 in handle_one_connection (arg=0x9cc0720) at sql_connect.cc:1115
#23 0x0089a45b in start_thread () from /lib/libpthread.so.0
#24 0x007f1c4e in clone () from /lib/libc.so.6

And other threads, probably less interesting, blocked in:
:::::::::::: lock_and_wait_for_table_name ::::::::::::
:::::::::::: open_table ::::::::::::
:::::::::::: thr_multi_lock ::::::::::::
:::::::::::: signal thread ::::::::::::
:::::::::::: reading from client ::::::::::::
:::::::::::: innodb ::::::::::::
(sorry for not pasting those threads here but bugs db says my description is too long; you can look at pushbuild2 to find them).

So we have thread 14 waiting on condition in ma_bitmap.c, and threads 9 and 8 waiting on condition in ma_pagecache.c.
The stall of thread 14 does not explain the stall of thread 9 and 8: indeed, thread 14 is waiting for some bitmap-related boolean to become false, and this cannot prevent other threads from reading/writing pages in the page cache. Moreover, thread 9 is on pageno 33 which is not a bitmap page.
On the other hand, the stall of thread 9 can explain the stall of thread 14: 
thread 9 started allocate_and_write_block_record() so has called _ma_bitmap_flushable(info, 1) and thus thread 14 is expected to block.
Note that thread 8 and 9 are both writing to the same table "j".

Given this reasoning, I categorize it as a page cache bug (apparently one or two threads which never wake up from condition waits in the page cache code).

It is apparently not due to threads waiting on mutexes, so Monty's deadlock detector will not help, fixing this bug is not dependent on the deadlock detector.

How to repeat:
# 02:53:15 # runall.pl \ 
# 02:53:15 # --basedir=/export/home/pb2/test/sb_1-125796-1226541190.52/mysql-5.1.30-maria-alpha-linux-i686-test \ 
# 02:53:15 # --vardir=/export/home/pb2/test/sb_1-125796-1226541190.52/mysql-5.1.30-maria-alpha-linux-i686-test/vardirs \ 
# 02:53:15 # --reporters=Deadlock,ErrorLog,Backtrace,Recovery,WinPackage \ 
# 02:53:15 # --mysqld=--loose-falcon-lock-wait-timeout=1 \ 
# 02:53:15 # --mysqld=--loose-falcon-debug-mask=2 \ 
# 02:53:15 # --mysqld=--log-output=none \ 
# 02:53:15 # --mysqld=--log=off \ 
# 02:53:15 # --mysqld=--skip-safemalloc \ 
# 02:53:15 # --duration=1200 \ 
# 02:53:15 # --engine=Maria \ 
# 02:53:15 # --grammar=conf/maria_bulk_insert.yy \ 
# 02:53:15 # --queries=100000
[24 Nov 2008 20:01] Guilhem Bichot
More precise post:
Same stack traces under Windows too:
push: guilhem@mysql.co... 2008-11-22 16:27:59 
machine "winx86 max", test maria_bulk_insert:
http://clustra.norway.sun.com/~bteam/pb2/web.py?action=archive_download&archive_id=142573&...;
pretty=please
But note that this is before Sanja's latest push.
[15 Dec 2008 10:08] Bugs System
Pushed into 6.0.9-alpha  (revid:sanja@sun.com-20081124074821-4ajjre4p6f8j4ws3) (version source revid:guilhem@mysql.com-20081213204800-0nubni3t4ihn4hv9) (pib:5)
[16 Jan 2009 10:22] Guilhem Bichot
Hasn't been seen for long. Probably duplicate of BUG#40308