Bug #57249 crash in mysql_unlock_tables, sysbench with high number of threads
Submitted: 5 Oct 2010 13:53 Modified: 22 Nov 2010 15:21
Reporter: Mikhail Izioumtchenko Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Locking Severity:S2 (Serious)
Version:5.1,5.5 OS:Any
Assigned to: Dmitry Lenev CPU Architecture:Any
Tags: regression

[5 Oct 2010 13:53] Mikhail Izioumtchenko
Description:
a crash like below happens intermittently but often when I run
sysbench with 256 threads. 5.5 crashes, 5.1 does not.

extracting stack backtrace for core.27794...
 Thread 1 (Thread 28544):
 #0  0x000000371ce0cb3c in pthread_kill () from /lib64/libpthread.so.0
 #1  0x00000000005097be in handle_segfault (sig=6)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/mysqld.cc:2517
 #2  <signal handler called>
 #3  0x000000371c633955 in raise () from /lib64/libc.so.6
 #4  0x000000371c635135 in abort () from /lib64/libc.so.6
 #5  0x000000371c6716bb in __libc_message () from /lib64/libc.so.6
 #6  0x000000371c677076 in malloc_printerr () from /lib64/libc.so.6
 #7  0x0000000000543ebb in close_thread_tables (thd=0x134cd1a0)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_base.cc:1505
 #8  0x00000000005758e0 in mysql_execute_command (thd=0x134cd1a0)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_parse.cc:4445
 #9  0x00000000005862f5 in Prepared_statement::execute (this=0x7f0298038980,
     expanded_query=<value optimized out>, open_cursor=false)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_prepare.cc:3767
 #10 0x00000000005894df in Prepared_statement::execute_loop (
     this=0x7f0298038980, expanded_query=0x7f03eabbdec0,
     open_cursor=<value optimized out>, packet=<value optimized out>,
     packet_end=<value optimized out>)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_prepare.cc:3420
 #11 0x0000000000589a60 in mysqld_stmt_execute (thd=0x134cd1a0,
     packet_arg=0x134d07a1 "", packet_length=15)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_prepare.cc:2569
 #12 0x000000000057ad81 in dispatch_command (command=<value optimized out>,
     thd=0x134cd1a0, packet=0x134d07a1 "", packet_length=15)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_parse.cc:1051
 #13 0x000000000060de02 in do_handle_one_connection (
     thd_arg=<value optimized out>)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_connect.cc:1192
 #14 0x000000000060dec0 in handle_one_connection (arg=0x134cd1a0)
     at /data2/bench/mizioum/nroot/mysql_src_c55b/sql/sql_connect.cc:1131
 #15 0x000000371ce07951 in start_thread () from /lib64/libpthread.so.0
 #16 0x000000371c6e4d3d in clone () from /lib64/libc.so.6

How to repeat:
sysbench readonly 256 threads, 25 million rows, 10G innodb buffer pool
[7 Oct 2010 14:58] Mikhail Izioumtchenko
it does happen with 5.1 but so far I only saw it once:

 Thread 1 (Thread 3621):
 #0  0x000000371ce0cb3c in pthread_kill () from /lib64/libpthread.so.0
 #1  0x00000000005a86e6 in handle_segfault (sig=6) at mysqld.cc:2609
 #2  <signal handler called>
 #3  0x000000371c633955 in raise () from /lib64/libc.so.6
 #4  0x000000371c635135 in abort () from /lib64/libc.so.6
 #5  0x000000371c6716bb in __libc_message () from /lib64/libc.so.6
 #6  0x000000371c677076 in malloc_printerr () from /lib64/libc.so.6
 #7  0x0000000000595522 in select_send::send_eof (this=0x7f8c70058b08)
     at sql_class.cc:1726
 #8  0x000000000060f45d in do_select (join=0x7f8c7000cd98,
     fields=<value optimized out>, table=0x0, procedure=0x0)
     at sql_select.cc:11181
 #9  0x0000000000624237 in JOIN::exec (this=<value optimized out>)
     at sql_select.cc:2331
 #10 0x00000000006200b2 in mysql_select (thd=0x14bbe480,
     rref_pointer_array=<value optimized out>, tables=0x7f8c70057e90,
     wild_num=0, fields=<value optimized out>, conds=<value optimized out>,
     og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0,
     select_options=2417248768, result=0x7f8c70058b08, unit=0x7f8c70056c50,
     select_lex=0x7f8c70057078) at sql_select.cc:2526
 #11 0x0000000000625834 in handle_select (thd=0x14bbe480, lex=0x7f8c70056bb0,
     result=0x7f8c70058b08, setup_tables_done_option=0) at sql_select.cc:269
 #12 0x00000000005b5d25 in execute_sqlcom_select (thd=0x14bbe480,
     all_tables=0x7f8c70057e90) at sql_parse.cc:5127
 #13 0x00000000005bafdd in mysql_execute_command (thd=0x14bbe480)
     at sql_parse.cc:2292
 #14 0x00000000006309a3 in Prepared_statement::execute (this=0x7f8c70056850,
     expanded_query=<value optimized out>, open_cursor=false)
     at sql_prepare.cc:3622
 #15 0x00000000006337fa in Prepared_statement::execute_loop (
     this=0x7f8c70056850, expanded_query=0x7f8d19ccf250,
     open_cursor=<value optimized out>, packet=<value optimized out>,
     packet_end=<value optimized out>) at sql_prepare.cc:3297
 #16 0x0000000000633ec8 in mysqld_stmt_execute (thd=0x14bbe480,
     packet_arg=0x14bc16f1 "", packet_length=19) at sql_prepare.cc:2507

   1723   /* Unlock tables before sending packet to gain some speed */
   1724   if (thd->lock)
   1725   {
   1726     mysql_unlock_tables(thd, thd->lock);
   1727     thd->lock=0;
   1728   }

so it's not a regression from 5.1 to 5.5. Whether it's a regression
introduced into 5.1 recently I can't say because I've only started
using this machine a few days ago.
[7 Oct 2010 16:38] Davi Arnaut
Looks like a PS or locking bug. Would be nice if you could preserve the core file for later examination.
[7 Oct 2010 16:48] Mikhail Izioumtchenko
I'm saving the corefiles, can give access/location.
Not sure if it's PS, I think I saw a crash without PS as well, will 
doublecheck when I'm done trying to get the coredump for an unoptimized build.
[8 Nov 2010 15:39] Calvin Sun
Dimitri had similar problem:

I'm observing a constant MySQL crashing on a simple Read-Only test (CPU is even not 10% busy yet) - the goal is to reach 1600 running sessions, but it's not out-passing 800-1000 sessions..

The printed back trace is here:

stack_bottom = (nil) thread_stack 0x40000
/apps/mysql556-rc/libexec/mysqld(my_print_stacktrace+0x29) [0x8a9529]
/apps/mysql556-rc/libexec/mysqld(handle_segfault+0x3fb) [0x5b1e6b]
/lib64/libpthread.so.0() [0x3ab920f0f0]
/apps/mysql556-rc/libexec/mysqld(create_thread_to_handle_connection(THD*)+0x142) [0x5b6d12]
/apps/mysql556-rc/libexec/mysqld(handle_connections_sockets()+0x707) [0x5b5f67]
/apps/mysql556-rc/libexec/mysqld(mysqld_main(int, char**)+0x6e8) [0x5b6678]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x3ab8a1eb1d]
/apps/mysql556-rc/libexec/mysqld() [0x4e5629]

Did you observe something similar? any idea what it can be?.. (I never saw such a crash since many years..)
 
Rgds,
-Dimitri
[8 Nov 2010 15:40] Calvin Sun
Dimitri's further analysis:

Hi ALL,

I'm happy to say you that over a week-end I've found where is the problem! :-))
(when a problem drives you mad it's hard to free your head from it ;-))

The limitation is coming from one simple thing that threads under Linux are designed as *processes* (how easy to forget about when mainly working with Solaris :-)) - so if the "nproc" limit is not high enough for mysql user - it cannot create more sessions (threads) than this current limit (default: 1024)..

So, the bug is not within InnoDB, but on the MySQL level for sure, and it should be easy to fix I think as it's somewhere on the thread creation :-)

Also - it'll be fine to keep this limit in mind when recalculating the "max_connections" value: I mean not only use the max open files limit, but also the max process/thread too! - in this case 

  - 1. it'll simplify the analyze/debug: user which is unable to set 2000 max_connections will first look on why it's not possible on their system, rather reach this limit during a production load!..

  - 2. will avoid "unexpected" errors - such a crash will never be possible :-) as well it's quite crazy to see that MySQL is accepting your max_sessions setting, but then refusing your connections.. :-))

From all research I've done over last Friday it seems the issue is happening quite often and there are even several bugs logged with the same error:11 - and there was no answer or fix, because as you may imagine such a situation will be impossible to reproduce on the system having "nproc" limit high enough ;-)) as well the problem may come or not come according the number of other processes already running on your system :-))  So, I'll write a short blog about for others who may meet such a wall one day like me :-))

Rgds,
-Dimitri
[8 Nov 2010 18:56] Mikhail Izioumtchenko
Dimitri's problem is different, it's a setup problem (nproc)
probably aggravated by poor diagnostics, an assert instead of an error message.
This bug happens in some form of cleanup and looks like a double freeing
of memory.
[9 Nov 2010 11:45] Davi Arnaut
Reported Dimitri's issue as Bug#58080.
[12 Nov 2010 17:22] Dmitry Lenev
Hello!

It turned out that host on which original problem was reported has used development release of glibc - 2.11.90-20.

AFAIU this version of glibc is affected by bug in sanity checking inside of free() procedure which has symptoms similar to those described above:
https://bugzilla.redhat.com/show_bug.cgi?id=589706
https://bugzilla.redhat.com/show_bug.cgi?id=593396

This bug is fixed in later versions of glibc.

(Timeline looks like this:
- On 2010-04-03 patch introducing bug is committed
  http://sourceware.org/git/?p=glibc.git;a=commit;h=90a3055e8bdd9308eceeadc0b37278f324ec6b5d
- On 2010-04-16 glibc-2.11.90-20 is released
- On 2010-05-06 patch fixing bug is committed
  http://sourceware.org/git/?p=glibc.git;a=commit;h=5f24d53acb2cd02ffb4ac925c0244e4f1f854499
)

Judging from symptoms described in the bug report, the fact that both 5.5 and 5.1 versions of server are affected and the fact that in 5.5 different parts of code doing memory allocation are affected I believe that the problem described in this report boils down to the above glibc bug.

Could you please re-test after updating glibc version to one in which the glibc bug is fixed? Meanwhile I am putting this bug report to "Need Feedback" state.
[22 Nov 2010 15:21] Mikhail Izioumtchenko
an OS issue as Dmitri diagnosed, after upgrading to a stable release
the problem is gone.