Bug #50201 Server crashes in explain_filename on an InnoDB partitioned table
Submitted: 9 Jan 2010 0:49 Modified: 15 Mar 2010 15:50
Reporter: Elena Stepanova Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.5.1-m2 OS:Any
Assigned to: Mattias Jonsson
Triage: Triaged: D1 (Critical)

[9 Jan 2010 0:49] Elena Stepanova
Description:
#0  0x00002af61d8eaea3 in pthread_kill () from /lib64/libpthread.so.0
#1  0x00000000005cc065 in handle_segfault (sig=11) at mysqld.cc:2575
#2  <signal handler called>
#3  explain_filename (thd=0x16bad70,
    from=0x45809050 "table1000_innodb_key_pk_parts_2_int_autoinc#P#p0",
    to=0x45808f50 "`table1000_innodb_key_pk_parts_2_int_autoinc` /* ",
    to_length=<value optimized out>,
    explain_mode=EXPLAIN_PARTITIONS_AS_COMMENT) at sql_table.cc:329
#4  0x00000000007779ec in innobase_convert_identifier (
    buf=0x458091c7 "format; info bits 0\n 0: len 4; hex 80000008; asc     ;;\n 1: le328; hex 80╟ш\216\035Ж*", buflen=569,
    id=0x15e3c4d "table1000_innodb_key_pk_parts_2_int_autoinc#P#p0", idlen=48,
    thd=0x16bad70, file_id=<value optimized out>) at handler/ha_innodb.cc:1732
#5  0x000000000077822c in innobase_convert_name (
    buf=0x458091c0 "`test`.format; info bits 0\n 0: len 4; hex 80000008; asc     ;;\n 1: le328; hex 80╟ш\216\035Ж*", buflen=576,
    id=0xfffffffffffffffc <Address 0xfffffffffffffffc out of bounds>,
    idlen=53, thd=0x16bad70, table_id=<value optimized out>)
    at handler/ha_innodb.cc:1823
#6  0x0000000000800a86 in ut_print_namel (f=0x11975e0,
    trx=<value optimized out>, table_id=<value optimized out>,
    name=<value optimized out>, namelen=0) at ut/ut0ut.c:551
#7  0x000000000078e2d1 in lock_rec_print (file=0x11975e0, lock=0x187d1e8)
    at lock/lock0lock.c:4218
#8  0x00000000007901ad in lock_print_info_all_transactions (file=0x11975e0)
    at lock/lock0lock.c:4438
#9  0x00000000007e1628 in srv_printf_innodb_monitor (file=0x11975e0,
    trx_start=0x0, trx_end=0x0) at srv/srv0srv.c:1758
#10 0x00000000007e1dfd in srv_lock_timeout_and_monitor_thread (
    arg=<value optimized out>) at srv/srv0srv.c:2003
#11 0x00002af61d8e6143 in start_thread () from /lib64/libpthread.so.0
#12 0x00002af61df628cd in clone () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()

Could not reproduce on 5.5.0-m2 or 5.1.42

How to repeat:
To reproduce with RQG (the test with threads=2 also causes the crash, with greater number of threads it just seems to happen faster):

runall.pl \
--grammar=part_crash.yy \
--gendata=part_crash.zz \
--engine=Innodb \
--mysqld=--log-output=file \
--mysqld=--skip-safemalloc \
--rows=1000 \
--threads=16 \
--queries=100000000 \
--duration=120 \
--basedir=/export/home/tmp/qauser/bin64_551-m2/ \
--mysqld=--transaction-isolation=SERIALIZABLE 

cat part_crash.yy

query:
  transaction |
  delete ;

transaction:
  START TRANSACTION |
  SELECT SLEEP( 1 ) ;

delete:
  DELETE FROM _table WHERE _field_key = _digit ORDER BY _field_key LIMIT 2 ;

<EOF>

cat part_crash.zz

$tables = {
        partitions => [ 'KEY (pk) PARTITIONS 2' ]
};

$fields = {
        types => [ 'int' ],
};

$data = {
        numbers => [ 'digit' ]
}

<EOF>
[11 Jan 2010 13:15] Mikhail Izioumtchenko
Is InnoDB code in 5.5.1-m2 any different from InnoDB in 5.1.42 and 5.5.1-m1
where the bug is not reproducing?
[11 Jan 2010 14:58] Elena Stepanova
5.5.0-m2 contains InnoDB plugin 1.0.5, while 5.5.1-m2 goes with 1.0.6, so there is difference; also, bug#32430 was fixed in 5.5.1, but not in 5.5.0.

However, 5.1.42 also contains InnoDB plugin 1.0.6 (and, presumably, bugfix for bug#32430), so I'm not sure whether there is a difference in InnoDB plugin code between 5.1.42 and 5.5.1-m2.
(I could not reproduce the crash on 5.1.42 neither with the built-in InnoDB nor with the plugin).
[11 Jan 2010 16:59] Mikhail Izioumtchenko
if this is not reproducible in 5.1.42 there's probably not much we could do here.
On the other hand considering it's a SEGV there's a chance it's a simple bug debuggable by a good coredump and code review.
Could you reproduce it with 5.5.1-m2 with mysqld compiled without -O3, those
<value optimized out> coredumps are a pain to debug, then 
please provide the full stack backtrace (gdb's bt full) of the thread
in question. Assigning to Jimmy in the meantime.
[12 Jan 2010 9:06] Sergey Vojtovich
From the stack trace... it looks like explain_filename() is called by InnoDB "timeout_and_monitor" thread, which has no binding to THD object.

When explain_filename() attempts to get some nice text ER(ER_PARTITION_NAME) macro is used. This macro uses current_thd. As there is no binding to THD object we end up with server crash.

How to fix: add new ER_THD(thd, err_code) macro, which will use explicit thd object instead of current_thd.
[12 Jan 2010 15:13] Elena Stepanova
bt full of all threads on debug version

Attachment: bug50201_all_threads_bt_full.out (application/octet-stream, text), 55.54 KiB.

[13 Jan 2010 7:29] Marko Mäkelä
The function explain_filename was introduced by Mattias Jonsson in order to fix Bug #32430. I noticed and fixed some bugs in explain_filename before making InnoDB call it, that is, actually fixing Bug #32430 in InnoDB.

Unfortunately, I did not notice that explain_filename could crash when called outside a MySQL connection handler thread. InnoDB background threads can report table names to the error log, even before any client connection has been started. Therefore, explain_filename must be callable when current_thd==NULL.

I am tentatively assigning this to Mattias Jonsson and setting myself as a reviewer. Note that this bug is in 5.1 too, in both the built-in InnoDB and the Plugin.
[13 Jan 2010 9:21] Mattias Jonsson
Seems like it is not affecting 5.1, since it is first in mysql-trunk (5.5+) that uses current_thd in the ER macro.
[13 Jan 2010 10:35] Mattias Jonsson
This patch seems to fix the problem (as Sergey suggested), I'm currently looking how to create a repeatable test case for mysql-test-run.

b50201-mysql-trunk-bugfixing$ bzr diff
=== modified file 'sql/sql_table.cc'
--- sql/sql_table.cc	2009-12-11 09:39:38 +0000
+++ sql/sql_table.cc	2010-01-13 09:59:39 +0000
@@ -291,7 +291,7 @@
   {
     if (explain_mode == EXPLAIN_ALL_VERBOSE)
     {
-      to_p= strnmov(to_p, ER(ER_DATABASE_NAME), end_p - to_p);
+      to_p= strnmov(to_p, ER_THD(thd, ER_DATABASE_NAME), end_p - to_p);
       *(to_p++)= ' ';
       to_p= add_identifier(thd, to_p, end_p, db_name, db_name_len);
       to_p= strnmov(to_p, ", ", end_p - to_p);
@@ -304,7 +304,7 @@
   }
   if (explain_mode == EXPLAIN_ALL_VERBOSE)
   {
-    to_p= strnmov(to_p, ER(ER_TABLE_NAME), end_p - to_p);
+    to_p= strnmov(to_p, ER_THD(thd, ER_TABLE_NAME), end_p - to_p);
     *(to_p++)= ' ';
     to_p= add_identifier(thd, to_p, end_p, table_name, table_name_len);
   }
@@ -321,18 +321,18 @@
     if (name_type != NORMAL)
     {
       if (name_type == TEMP)
-        to_p= strnmov(to_p, ER(ER_TEMPORARY_NAME), end_p - to_p);
+        to_p= strnmov(to_p, ER_THD(thd, ER_TEMPORARY_NAME), end_p - to_p);
       else
-        to_p= strnmov(to_p, ER(ER_RENAMED_NAME), end_p - to_p);
+        to_p= strnmov(to_p, ER_THD(thd, ER_RENAMED_NAME), end_p - to_p);
       to_p= strnmov(to_p, " ", end_p - to_p);
     }
-    to_p= strnmov(to_p, ER(ER_PARTITION_NAME), end_p - to_p);
+    to_p= strnmov(to_p, ER_THD(thd, ER_PARTITION_NAME), end_p - to_p);
     *(to_p++)= ' ';
     to_p= add_identifier(thd, to_p, end_p, part_name, part_name_len);
     if (subpart_name)
     {
       to_p= strnmov(to_p, ", ", end_p - to_p);
-      to_p= strnmov(to_p, ER(ER_SUBPARTITION_NAME), end_p - to_p);
+      to_p= strnmov(to_p, ER_THD(thd, ER_SUBPARTITION_NAME), end_p - to_p);
       *(to_p++)= ' ';
       to_p= add_identifier(thd, to_p, end_p, subpart_name, subpart_name_len);
     }

=== modified file 'sql/unireg.h'
--- sql/unireg.h	2009-10-22 22:30:28 +0000
+++ sql/unireg.h	2010-01-13 09:58:29 +0000
@@ -46,6 +46,10 @@
 #define ER(X)         CURRENT_THD_ERRMSGS[(X) - ER_ERROR_FIRST]
 #define ER_DEFAULT(X) DEFAULT_ERRMSGS[(X) - ER_ERROR_FIRST]
 #define ER_SAFE(X) (((X) >= ER_ERROR_FIRST && (X) <= ER_ERROR_LAST) ? ER(X) : "Invalid error code")
+#define ER_THD(thd,X) ((thd) ? \
+                       (thd)->variables.lc_messages->errmsgs->errmsgs[(X) - \
+                       ER_ERROR_FIRST] : \
+                       ER_DEFAULT(X))
 
 
 #define ERRMAPP 1				/* Errormap f|r my_error */
[13 Jan 2010 10:35] Mattias Jonsson
Removing 5.1.41+ from version field, since it is not affected.
[13 Jan 2010 10:43] Marko Mäkelä
Matthias, from an educated guess how the macros might work, your patch looks OK to me.

Side note: why is CURRENT_THD_ERRMSGS there in the first place? I was under the impression that current_thd expands to a slow function call that should be avoided whenever possible. As far as I understand, that perceived slowness is the reason why handler::ha_thd() and ha_innobase::user_thd exist.
[13 Jan 2010 11:07] Mattias Jonsson
Marko:
CURRENT_THD_ERRMSGS is new in 5.5+ and as I understand it is because of the new internationalization effort to get error messages to the current language setting per session.

Btw, do you have an idea on how to easily create a repeatable test case for mysql-test-run, (perhaps by adding DEBUG_SYNC() somewhere in the mysql server code in a two thread test)?
[13 Jan 2010 11:56] Marko Mäkelä
Could the repeatable test case be something simple like this:

BEGIN;
SELECT COUNT(*) FROM partitioned_table FOR UPDATE;
--send
SELECT SLEEP(10);
-- another connection
SET innodb_lock_wait_timeout=1;
SELECT COUNT(*) FROM partitioned_table FOR UPDATE;

-- lock wait timeout will occur

A deadlock would be similar, but easiest to do with two tables. The first transaction would lock table A by a SELECT, the second one table B and attempt to access table A, and the first one would try to lock table B. You would need the --send and reap statements in mysql-test for this.
[13 Jan 2010 15:14] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/96792

2932 Mattias Jonsson	2010-01-13
      Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
      
      Problem was that in mysql-trunk the ER() macro is now dependent on current_thd
      and the innodb monitor thread has no binding to that thd object. This cause 
      the crash because of bad derefencing.
      
      Solution was to add a new macro which take the thd as an argument (which the innodb
      thread uses for the call).
     @ mysql-test/r/partition_innodb_status_file.result
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        New result file
     @ mysql-test/t/partition_innodb_status_file-master.opt
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        New opt-file
     @ mysql-test/t/partition_innodb_status_file.test
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        New test file for testing the innodb monitor thread with partitioning.
        (i.e. possible bugs related to explain_filename and other functions).
     @ sql/sql_table.cc
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        Using the new ER_THD() macro instead, to supply the correct THD reference.
     @ sql/unireg.h
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        Added ER_THD() macro to be used when current_thd is not apropriate or does not
        resolve to the correct thd.
[13 Jan 2010 15:22] Mattias Jonsson
Thanks Marko for the test.

I finally got stuck on the timing since it seems to be hardcoded for the innodb monitor thread to only wake up every 15 seconds, which I don't know how to decrease. Do you know how one could do a test that does not take at least 15 seconds to run?

If you (the reviewers) want, I can at least move the test to the parts-suite.

Note that the crash only occurs when running with --innodb-status-file=1
[14 Jan 2010 10:58] Marko Mäkelä
Sorry Mattias, I don’t know how the 15-second delay could be avoided in the test case. Your patch is OK to push, as far as I am concerned.
[10 Feb 2010 9:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99789

2938 Mattias Jonsson	2010-02-10
      Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
      
      Problem was that in mysql-trunk the ER() macro is now dependent on current_thd
      and the innodb monitor thread has no binding to that thd object. This cause 
      the crash because of bad derefencing.
      
      Solution was to add a new macro which take the thd as an argument (which the innodb
      thread uses for the call).
      
      (Updated according to reviewers comments, i.e. added ER_THD_OR_DEFAULT and
      moved test to suite parts.)
     @ mysql-test/suite/parts/r/partition_innodb_status_file.result
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        New test result file
     @ mysql-test/suite/parts/t/partition_innodb_status_file-master.opt
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        New test opt file
     @ mysql-test/suite/parts/t/partition_innodb_status_file.test
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        New test.
        Note that the innodb monitor thread only runs every 15 seconds, so this
        test will take at least 15 seconds, so I have moved it to the parts suite.
     @ sql/sql_table.cc
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        Using thd safe ER macro.
     @ sql/unireg.h
        Bug#50201: Server crashes in explain_filename on an InnoDB partitioned table
        
        Added ER macros for use with specified thd pointer.
[10 Feb 2010 11:08] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99801

3875 Mattias Jonsson	2010-02-10 [merge]
      manual merge of bug#50201 from mysql-next-mr-bugfixing
      into mysql-6.0-codebase-bugfixing, due to read only
      innodb_lock_wait_timeout variable
[10 Feb 2010 11:09] Mattias Jonsson
pushed to mysql-trunk-bugfixing, mysql-next-mr-bugfixing and mysql-6.0-codebase-bugfixing.
[12 Feb 2010 17:41] Bugs System
Pushed into 5.5.2-m2 (revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (version source revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (merge vers: 5.5.2-m2) (pib:16)
[13 Feb 2010 8:36] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100213083436-9pesg4h55w1mekxc) (version source revid:luis.soares@sun.com-20100211135109-t63avry9fqpgyh78) (merge vers: 6.0.14-alpha) (pib:16)
[13 Feb 2010 8:38] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100213083327-cee4ao3jpg33eggv) (version source revid:luis.soares@sun.com-20100211135018-1f9dbghg0itszigo) (pib:16)
[4 Mar 2010 17:10] Paul Dubois
Noted in 5.5.2, 6.0.14 changelogs.

The server crashed when an InnoDB background thread attempted to
write a message containing a partitioned table name to the error log.

Setting report to Need Merge pending push of Celosia to release tree.
[15 Mar 2010 15:50] Paul Dubois
This is in mysql-trunk now. Closing.