MySQL Bugs: #42784: archive tables cause 100% cpu usage and hang in show table status

Bug #42784	archive tables cause 100% cpu usage and hang in show table status
Submitted:	12 Feb 2009 10:05	Modified:	6 Dec 2011 0:51
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Archive storage engine	Severity:	S3 (Non-critical)
Version:	5.0.88-debug, 5.1.31-debug, 5.1.42-debug	OS:	Any (Windows, Mac OS X)
Assigned to:	Sergey Vojtovich	CPU Architecture:	Any

Description:
hanged thread stack

mysqld-debug.exe!my_pread(int Filedes=-1  Line 60
mysqld-debug.exe!azflush  Line 627
mysqld-debug.exe!ha_archive::info(unsigned int flag=84)  Line 1487
mysqld-debug.exe!get_schema_tables_record
mysqld-debug.exe!get_all_tables
mysqld-debug.exe!get_schema_tables_result
mysqld-debug.exe!JOIN::exec
mysqld-debug.exe!mysql_select
mysqld-debug.exe!handle_select
mysqld-debug.exe!execute_sqlcom_select
mysqld-debug.exe!mysql_execute_command
mysqld-debug.exe!mysql_parse
mysqld-debug.exe!dispatch_command
mysqld-debug.exe!do_command
mysqld-debug.exe!handle_one_connection
mysqld-debug.exe!pthread_start
mysqld-debug.exe!_callthreadstart
mysqld-debug.exe!_threadstart

How to repeat:
no certain testcase yet.
have some archive tables.
run combination of show table status and flush tables?

the bug is here:

pthread_mutex_lock(&my_file_info[Filedes].mutex);

Filedes is -1 so we have pthread function waiting (or crashing?) due
to invalid parameter...

Thank you for the report.

I tried with generic test, but could not repeat lock, although CPU usage was about 90% (not 100% anyway). Please try to generate test case or at least provide more information about statements.

RQG test case which I used:

$cat bug42784.zz
$tables = {
        rows => [10000],
        engines => ['Archive'],
};

$fields = {
        types => ['int', 'varchar'],
        indexes => [undef],
        null => ['not null'],
        sign => ['unsigned'],
};

$data = {
        numbers => ['digit'],
        strings => ['english'],
};

$cat bug42784.yy 
query:
         select | hselect | flush | status ;

select:
          SELECT  * from _table where _field=digit;

hselect:
          SELECT HIGH_PRIORITY COUNT(*) FROM _table;

flush:
         FLUSH TABLES ;

status:
          SHOW TABLE STATUS ;

sveta, add some inserts to the testcase.  btw, in 5.1.38 archive tables are totally broken anyway, so you should be getting crashes/hangs quickly

the testcase in bug #47012 (bug47012.c) can repeat this hangup too.

Verified just as described with recent 5.1.42-debug on Mac OS X.

5.0.88 is also affected. I see the following in the error log:

091114 18:48:48 [Note] /Users/openxs/dbs/5.0/libexec/mysqld: ready for connections.
Version: '5.0.88-debug'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution
mysqld(11040,0xb0441000) malloc: *** mmap(size=2779099136) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
091114 18:48:49 [ERROR] /Users/openxs/dbs/5.0/libexec/mysqld: Out of memory at line 51, 'sql_string.cc'
091114 18:48:49 [ERROR] /Users/openxs/dbs/5.0/libexec/mysqld: needed 2779096488 byte (2713962k), memory in use: 9172385 bytes (8958k)
mysqld(11040,0xb04d7000) malloc: *** mmap(size=3876610048) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
091114 18:48:49 [ERROR] /Users/openxs/dbs/5.0/libexec/mysqld: Out of memory at line 51, 'sql_string.cc'
091114 18:48:49 [ERROR] /Users/openxs/dbs/5.0/libexec/mysqld: needed 3876608928 byte (3785751k), memory in use: 2074027253 bytes (2025418k)
mysqld(11040,0xb04d7000) malloc: *** mmap(size=2064842752) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
091114 18:48:49 [ERROR] /Users/openxs/dbs/5.0/libexec/mysqld: Out of memory at line 51, 'sql_string.cc'
091114 18:48:49 [ERROR] /Users/openxs/dbs/5.0/libexec/mysqld: needed 2064839056 byte (2016445k), memory in use: 2074256967 bytes (2025642k)
Warning: Found locks from different threads in write: after waking up waiters
Warning: Found locks from different threads in write: after waking up waiters
...

Non-debug binaries just show 100+% CPU usage.

I was able to reproduce the Out of memory  error and also other crashes when running the testcase bug47102.c but I couldn't get the stack trace in the description section.

Can you please tell me how to get the stack trace(in description)? 

mysqld-debug.exe!my_pread(int Filedes=-1  Line 60
mysqld-debug.exe!azflush  Line 627
mysqld-debug.exe!ha_archive::info(unsigned int flag=84)  Line 1487
mysqld-debug.exe!get_schema_tables_record
mysqld-debug.exe!get_all_tables
mysqld-debug.exe!get_schema_tables_result
mysqld-debug.exe!JOIN::exec
mysqld-debug.exe!mysql_select
mysqld-debug.exe!handle_select
mysqld-debug.exe!execute_sqlcom_select
mysqld-debug.exe!mysql_execute_command
mysqld-debug.exe!mysql_parse
mysqld-debug.exe!dispatch_command
mysqld-debug.exe!do_command
mysqld-debug.exe!handle_one_connection
mysqld-debug.exe!pthread_start
mysqld-debug.exe!_callthreadstart
mysqld-debug.exe!_threadstart

Satya, just opening a archive table in 5.1.41 is causing corruptions of memory. what happens after that (assuming crash doesn't happen) is more or less random.
I'll upload you a single .sql file.  Use a debug build on windows, or valgrind on linux.  Step 1 to fixing archive tables will be to get this stack trace, and fix it:

mysqld-debug.exe!_free_dbg_nolock()[dbgheap.c:1367]
mysqld-debug.exe!_free_dbg()[dbgheap.c:1220]
mysqld-debug.exe!free()[dbgheap.c:1179]
mysqld-debug.exe!my_no_flags_free()[my_malloc.c:60]
mysqld-debug.exe!ha_archive::free_share()[ha_archive.cc:413]
mysqld-debug.exe!ha_archive::close()[ha_archive.cc:561]
mysqld-debug.exe!closefrm()[table.cc:1971]
mysqld-debug.exe!intern_close_table()[sql_base.cc:783]
mysqld-debug.exe!free_cache_entry()[sql_base.cc:805]
mysqld-debug.exe!my_hash_delete()[hash.c:547]
mysqld-debug.exe!close_open_tables()[sql_base.cc:1204]
mysqld-debug.exe!close_thread_tables()[sql_base.cc:1353]
mysqld-debug.exe!close_tables_for_reopen()[sql_base.cc:5456]
mysqld-debug.exe!get_all_tables()[sql_show.cc:3490]
mysqld-debug.exe!get_schema_tables_result()[sql_show.cc:6135]
mysqld-debug.exe!JOIN::exec()[sql_select.cc:1766]
mysqld-debug.exe!mysql_select()[sql_select.cc:2437]
mysqld-debug.exe!handle_select()[sql_select.cc:269]
mysqld-debug.exe!execute_sqlcom_select()[sql_parse.cc:5051]
mysqld-debug.exe!mysql_execute_command()[sql_parse.cc:2246]
mysqld-debug.exe!mysql_parse()[sql_parse.cc:5974]
mysqld-debug.exe!dispatch_command()[sql_parse.cc:1233]
mysqld-debug.exe!do_command()[sql_parse.cc:872]
mysqld-debug.exe!handle_one_connection()[sql_connect.cc:1127]
mysqld-debug.exe!pthread_start()[my_winthread.c:85]
mysqld-debug.exe!_callthreadstart()[thread.c:295]
mysqld-debug.exe!_threadstart()[thread.c:277]
kernel32.dll!BaseThreadStart()
ariables.
 invalid and cause the dump to abort...
000444B1E8=select * from information_schema.tables

load tables into db. run "select * from information_schema.tables" on a debug build. observe memory corruptions

Attachment: tables.zip (application/zip, text), 102.67 KiB.

unzip the tables.sql and import using mysql -uroot --force test <tables.sql
some tables will error out. this is okay.  i think there's bug when many tables
need to be opened and closed. so you could try a tiny table_open_cache, or hundreds of archive tables.  valgrind will detect system calls such as pread getting -1 fd for example.

Shane,

I have tried your query(after loading tables using tables.sql) in following combinations on ubuntu 9.04.

5.0-bugteam, 5.1-bugteam, 5.1 (main) ( valgrind and non valgrind builds) and I tried to run the server with valgrind and without valgrind and I couldn't see any errors or crash under valgrind
I don't have windows setup ready and couldn't try on it :(

I am using these scripts ./BUILD/compile-pentium-valgrind-max-no-ndb and ./BUILD/compile-pentium-debug-max-no-ndb.

BTW, I have set table_open_cache to 4 in all scenarios. Can you please try on any linux machine and let me know the mysql branch used to reproduce?

more problems/details about original bug47012.c testcase.

Attachment: bug47012_5.1.43-debug_multiple_problems.txt (text/plain), 11.56 KiB.

Shane,

I was able to reproduce the bug finally :). Thanks for all the help.

I was not running the bug47102 testcase when running the "show table status" query. So when I ran bug47102 testcase along with the 'show table status' query I was able to reproduce.

Problems found.

1. crashes at ha_archive::check , ha_archive::real_write_row, ha_archive::optimize

2. Some of the tables are corrupted after starting the server from crash.

3. Incorrect key file for table 'table0003'; try to repair it) (HY000) (same 2 ?)

4. Out of memory errors.

5. this bug, fd being negative (Assertion `fd >= 0' failed.)

s/bug47102/bug47012/ in the above comment

The problem seems to be in ha_archive::optimize() where the locks are not taken when we re-open the file.

The draft patch below solves the fd issue but the deadlock problem still remains. The locking is not yet proper in the below draft patch.

For 5.1
=======

=== modified file 'storage/archive/ha_archive.cc'
--- storage/archive/ha_archive.cc	2009-11-11 08:03:29 +0000
+++ storage/archive/ha_archive.cc	2010-01-04 04:57:47 +0000
@@ -1293,6 +1293,7 @@
 
   init_archive_reader();
 
+  pthread_mutex_lock(&share->mutex);
   // now we close both our writer and our reader for the rename
   if (share->archive_write_open)
   {
@@ -1305,7 +1306,10 @@
             MY_REPLACE_EXT | MY_UNPACK_FILENAME);
 
   if (!(azopen(&writer, writer_filename, O_CREAT|O_RDWR|O_BINARY)))
+  {
+    pthread_mutex_unlock(&share->mutex);
     DBUG_RETURN(HA_ERR_CRASHED_ON_USAGE); 
+  }
 
   /* 
     An extended rebuild is a lot more effort. We open up each row and re-record it. 
@@ -1377,11 +1381,13 @@
   rc = my_rename(writer_filename,share->data_file_name,MYF(0));
 
 
+  pthread_mutex_unlock(&share->mutex);
   DBUG_RETURN(rc);
 error:
   DBUG_PRINT("ha_archive", ("Failed to recover, error was %d", rc));
   azclose(&writer);
 
+  pthread_mutex_unlock(&share->mutex);
   DBUG_RETURN(rc); 
 }

Note:The above suggested patch might be a risky one.
Looking for reviewers for this bug.

here is a stress test. still causes valgrind errors on trunk and 2GB mallocs on 5.5.17

Attachment: bug47012.c (text/plain), 146.64 KiB.

Noted in 5.1.61, 5.5.20, 5.6.4 changelogs.

Concurrent access to ARCHIVE tables could cause corruption.