MySQL Bugs: #7186: Server crashes in query_cache

Bug #7186	Server crashes in query_cache_abort()
Submitted:	10 Dec 2004 22:45	Modified:	5 Apr 2005 3:00
Reporter:	Marko Mäkelä	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	5.0.3-bk	OS:	Linux (GNU/Linux)
Assigned to:	Oleksandr Byelkin	CPU Architecture:	Any

Description:
When testing InnoDB crash recovery, I ran the test "ibtest7a" and then killed and restarted mysqld. The InnoDB crash recovery went fine, but shortly after that, the server crashed:

041211  0:30:52  InnoDB: Started; log sequence number 0 24507213
[New Thread 147466 (LWP 21230)]
041211  0:30:52 [Warning] mysql.user table is not updated to new password format; Disabling new password usage until mysql_fix_privilege_tables is run

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 16384 (LWP 21220)]
0x08246820 in query_cache_abort ()
(gdb) bt
#0  0x08246820 in query_cache_abort ()
#1  0x0816c182 in my_message_sql ()
#2  0x084054ca in my_error (nr=1146, MyFlags=0) at my_error.c:69
#3  0x081a7b87 in openfrm ()
#4  0x081a5192 in open_unireg_entry ()
#5  0x081a6578 in open_table ()
#6  0x081a6d15 in open_tables ()
#7  0x082661d3 in my_tz_init ()
#8  0x0816fd35 in main ()

For some reason, the debugging information for query_cache_abort() is not available. Based on the disassembly, the error occurs here:

    if (query_block)			// Test if changed by other thread
    {
      DUMP(&query_cache);
      BLOCK_LOCK_WR(query_block);

It's either the DUMP or BLOCK_LOCK_WR statement.

How to repeat:
Run ibtest7a. killall -KILL mysqld. Start mysqld, and immediately after that, start ibtest7a again, so that it will block until InnoDB crash recovery is completed. You'll get an instant crash after InnoDB crash recovery.

ibtest7a does REPLACE INTO queries. Maybe this is related to that. The bug does not occur when mysqld is configured --with-debug. For this report, I built the server as follows:

CC=gcc-3.4 CXX=gcc-3.4 CFLAGS='-march=pentium-m -g' ./configure
make clean
make

Suggested fix:
Do not use uninitialized or freed resources, I guess.

Actually, it is not necessary to have any client attempting to connect to mysqld when the server
is restarted. A crash recovery of "ibtest7a" is enough to trigger the crash. It does not matter if the table used by "ibtest7a" ("alex1") is in the old or new InnoDB format.

Here's a more descriptive stack trace, after recompiling mysqld.cc and sql_cache.cc with -g and without -O3 (CXXFLAGS was not overridden to include -g in my ./configure line):

#0  0x08248928 in Query_cache_block::headers_len (this=0x37)
    at sql_cache.cc:407
#1  0x08248919 in Query_cache_block::data (this=0x37) at sql_cache.cc:413
#2  0x08248999 in Query_cache_block::query (this=0x37) at sql_cache.cc:422
#3  0x082447c1 in query_cache_abort (net=0x89b1ecc) at sql_cache.cc:641
#4  0x0816dbfb in my_message_sql (error=1146, 
    str=0xbfffdf20 "Table 'mysql.time_zone_leap_second' doesn't exist", 
    MyFlags=0) at mysqld.cc:2205
#5  0x08405fca in my_error (nr=1146, MyFlags=55) at my_error.c:69
#6  0x081a8837 in openfrm () at sql_cache.cc:384
#7  0x081a5e42 in open_unireg_entry () at sql_cache.cc:384
#8  0x081a7228 in open_table () at sql_cache.cc:384
#9  0x081a79c5 in open_tables () at sql_cache.cc:384
#10 0x08266cd3 in my_tz_init ()
#11 0x0816ed8d in main (argc=1, argv=0xbffffc14) at mysqld.cc:3043

The 0x37 pointer comes from net->query_cache_query in query_cache_abort(). Here's
a complete printout of *net in that stack frame:

(gdb) p *net
$2 = {vio = 0x0, buff = 0x61f11e07 <Address 0x61f11e07 out of bounds>, 
  buff_end = 0x32383661 <Address 0x32383661 out of bounds>, 
  write_pos = 0x3636 <Address 0x3636 out of bounds>, read_pos = 0x0, fd = 0, 
  max_packet = 1425670144, max_packet_size = 495381, pkt_nr = 520568840, 
  compress_pkt_nr = 912351503, write_timeout = 909521464, read_timeout = 0, 
  retry_count = 0, fcntl = 16777216, compress = 2 '\002', 
  remain_in_buf = 1208484871, length = 1630347015, buf_length = 842544737, 
  where_b = 13878, return_status = 0x0, reading_or_writing = 0 '\0', 
  save_char = 0 '\0', no_send_ok = 0 '\0', no_send_eof = 0 '\0', 
  last_error = "\000\001\023¤\025\217\a\000\bP\a Ðaa68266", '\0' <repeats 11 times>, "\001#\216\024\016\006\000\bX\a\037haA6827\200\000\0223\bøX\230\000\000\000\000ÅV\025\017\a\000\b`\a\037\206aa68281\200\000\0223\bøX\230\000\000\000\000Û\000\025\217\a\000\bh\a\037¤aa68296", '\0' <repeats 12 times>, "à\205\025\217\a\000\bp\a\000taa68296", '\0' <repeats 12 times>, "ðR\025\217\a\000\bx\a!*aa67967", '\0' <repeats 11 times>, "\001E5\025\217\a\000\b\200\a!Haa68038", '\0' <repeats 11 times>..., sqlstate = "\030\025\217\a\000\b", 
  last_errno = 1633784354, error = 54 '6', 
  query_cache_query = 0x37 <Address 0x37 out of bounds>, 
  report_error = 1 '\001', return_errno = 0 '\0'}

On a subsequent run, upon entering query_cache_abort(), net->query_cache_query == 0x36.
Here's the complete printout:

(gdb) p *net
$3 = {vio = 0x0, buff = 0xf21e0738 <Address 0xf21e0738 out of bounds>, 
  buff_end = 0x38326161 <Address 0x38326161 out of bounds>, 
  write_pos = 0x313634 <Address 0x313634 out of bounds>, read_pos = 0x0, 
  fd = 0, max_packet = 3892314112, max_packet_size = 126817609, 
  pkt_nr = 121636864, compress_pkt_nr = 1633751071, 
  write_timeout = 909391922, read_timeout = 49, retry_count = 0, fcntl = 0, 
  compress = 0 '\0', remain_in_buf = 134481807, length = 773785416, 
  buf_length = 942825825, where_b = 3225140, return_status = 0x0, 
  reading_or_writing = 0 '\0', save_char = 0 '\0', no_send_ok = 0 '\0', 
  no_send_eof = 0 '\0', 
  last_error = "\000\000\000ÿ¹\025\217\a\000\bP\a\037Laa28461", '\0' <repeats 11 times>, "\001\a\211\025\217\a\000\bX\a!Þaa28461", '\0' <repeats 11 times>, "\001\016å\024\016\006\000\b`\a\037\207aA2847\200\000\0223\bøX\230\000\000\000\000½\220\025\017\a\000\bh\a\037¥aa28476\200\000\0223\bøX\230\000\000\000\000Ö^\024\016\006\000\bp\a\000taA2848\200\000\0223\bøX\230\000\000\000\000½\221\025\217\a\000\bx\a\"°aa28334", '\0' <repeats 11 times>, "\001\0230\025\217\a\000\b\200\a\"Îaa27781", '\0' <repeats 11 times>..., sqlstate = "þ\025\217\a\004\b", 
  last_errno = 1633780259, error = 50 '2', 
  query_cache_query = 0x36 <Address 0x36 out of bounds>, 
  report_error = 1 '\001', return_errno = 0 '\0'}

On this occasion, thd == 0x89b17f0 in my_message_sql().

Marko,

I think in the past days there have been also other bug reports involving timezones.

I was not able to repeat this crash in hundin with ./BUILD/compile-pentium-debug

My /mysql system tables are from an about 3 months old 'make test' run. Maybe the crash depends on the system tables.

Regards,

Heikki

Hi!

I am assigning this nug report to Sanja. Maybe he can cast light on why my_message_sql() calls query_cache_abort(). Maybe a query result should not be stored in the query cache if there is an error in a query execution? But in this case this is a general error that is printed to the .err log, and there is really no query involved, I presume.

Regards,

Heikki

mysqld.cc:
   /*
      thd->lex->current_select == 0 if lex structure is not inited
      (not query command (COM_QUERY))
    */
    if (thd->lex->current_select &&
        thd->lex->current_select->no_error && !thd->is_fatal_error)
    {
      DBUG_PRINT("error", ("Error converted to warning: current_select: no_erro\
r %d  fatal_error: %d",
                           (thd->lex->current_select ?
                            thd->lex->current_select->no_error : 0),
                           (int) thd->is_fatal_error));

      push_warning(thd, MYSQL_ERROR::WARN_LEVEL_ERROR, error, str);
    }
    else
    {
      NET *net= &thd->net;
      net->report_error= 1;
#ifndef EMBEDDED_LIBRARY  /* TODO query cache in embedded library*/
      query_cache_abort(net);
#endif
      if (!net->last_error[0])                  // Return only first message
      {
        strmake(net->last_error, str, sizeof(net->last_error)-1);
        net->last_errno= error ? error : ER_UNKNOWN_ERROR;
      }
    }
  }
  if (!thd || MyFlags & ME_NOREFRESH)
    sql_print_error("%s: %s",my_progname,str); /* purecov: inspected */
  DBUG_RETURN(0);
}

Any news regarding this?
My temporary fix is to remove the frm_error() call from openfrm() in table.cc. (I could perhaps also update the mysql.* system tables, but I want to use the same set of system tables when  testing 4.0, 4.1 and 5.0.)

I'll say something as soon as I fetch this bug for processing, I can't say something aster fast look on it.

Marko,

Where can I get "ibtest7a" for testing?

Victoria,
This bug can manifest itself also without any SQL being executed. It was just a coincidence that it first occurred after InnoDB crash recovery.

To reproduce the bug, start MySQL 5.0.3-bk on system tables created with 4.0 or 4.1.
Some 5.0-specific tables will be missing then. As you can see from the stack trace, mysqld is trying to complain "Table 'mysql.time_zone_leap_second' doesn't exist" when it crashes. 

I don't know why mysqld is not always crashing at start-up when the new system tables are missing, but I'm fully sure that all recent start-up crashes of 5.0.3-bk I have seen have been caused by this bug. Perhaps it is accessing uninitialized memory, or maybe it fails to protect a critical section with a mutex.

Marko,

I wasn't able to repeat the crash either with 4.0 system tables or
4.1 system tables. Just I got the messages for missed tables.

Could you please try again with latest BK source.

Thanks

Miguel,
it doesn't crash every time. There have been days with no crash, but it did crash again this week, with fresh BK sources. As a workaround, I disabled the frm_error() call in openfrm(). I'm not sure, but it could be that mysqld has to be compiled with the certain options, or the data segments have to be placed in a certain way in memory in order for the crash to be possible. If mysqld crashes on startup due to this bug, the crash seems to be repeatable.

Marko,

Sorry I tried again and I wasn't able to repeat the behavior reported.
Could you please test again on your side.