Bug #34043 Server loops excessively in _checkchunk() when safemalloc is enabled
Submitted: 24 Jan 2008 20:47 Modified: 21 Aug 2010 1:25
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: General Severity:S1 (Critical)
Version:5.1-BK OS:Any
Assigned to: Davi Arnaut CPU Architecture:Any

[24 Jan 2008 20:47] Philip Stoev
Description:
When running the iuds1 systems test with 200 concurrent users, mysqld hangs and refuses further connections. CPU usage remains at 95-102% (1 core occupied out of 8 cores physically present).

I will post more information as it becomes available.

How to repeat:
Run the iuds1 test scenario as described at:

https://inside.mysql.com/wiki/SystemQATestPlan
[24 Jan 2008 21:01] Philip Stoev
Thread stacks for bug #34043 with various levels of verbosity

Attachment: bug34043-stacks.txt (text/plain), 282.34 KiB.

[24 Jan 2008 21:08] Philip Stoev
The thread appears to be cycling in that region:

515       for (irem= sf_malloc_root; irem != NULL && count-- ; irem= irem->next)
516         flag+= _checkchunk (irem, filename, lineno);

Plenty of free memory is available. Error log contains:

Warning: found too many locks at write_wait: enter write_lock
repeated about a dozen times.
[25 Jan 2008 14:46] Philip Stoev
I am sorry, but the the core and the datadir were not preserved. Please let me know if they are needed and I will rerun the test case until I get new ones.
[28 Jan 2008 21:07] Guilhem Bichot
The stack traces mention MERGE code and server (sql/) code, this does not look Maria-specific.
Serg also think this does not look Maria-specific.
I suggest re-testing with MyISAM instead.
[30 Jan 2008 16:56] Philip Stoev
This issue is present when using MyISAM against the latest 5.1 BitKeeper treee. 

The core and the data directory from a previous run are available at:

dl360-g5-a:/data1/maria/systest_vardir-bug34043

The directory includes the binary that was used, along with a dump of all threads (bug34043.threads).

Here is the thread that is cycling from the latest run:

[Switching to thread 31 (Thread 1167137088 (LWP 23396))]#0  0x000000000096c678 in _sanity (filename=0xb4a970 "array.c", lineno=281) at safemalloc.c:490
490       for (irem= sf_malloc_root; irem != NULL && count-- ; irem= irem->next)
(gdb) bt
#0  0x000000000096c678 in _sanity (filename=0xb4a970 "array.c", lineno=281) at safemalloc.c:490
#1  0x000000000096be7c in _myfree (ptr=0x1d8ee598, filename=0xb4a970 "array.c", lineno=281, myflags=16) at safemalloc.c:264
#2  0x0000000000978de9 in delete_dynamic (array=0x1d9d19a0) at array.c:281
#3  0x00000000007f85f5 in sp_pcontext::destroy (this=0x1d9d1918) at sp_pcontext.cc:129
#4  0x00000000007f56d0 in sp_head::destroy (this=0x1d9d1558) at sp_head.cc:756
#5  0x00000000007f5830 in ~sp_head (this=0x1d9d1558) at sp_head.cc:738
#6  0x0000000000801b22 in ~Table_triggers_list (this=0x1dd3df18) at sql_trigger.cc:978
#7  0x000000000068f877 in intern_close_table (table=0x1caca9d8) at sql_base.cc:852
#8  0x000000000068f924 in free_cache_entry (table=0x1caca9d8) at sql_base.cc:876
#9  0x0000000000977ff9 in hash_delete (hash=0xf58ca0, record=0x1caca9d8 "xЖё\033") at hash.c:529
#10 0x000000000068e6a2 in close_open_tables (thd=0x2aaaac29b388) at sql_base.cc:1254
#11 0x000000000068eab9 in close_thread_tables (thd=0x2aaaac29b388) at sql_base.cc:1398
#12 0x00000000007f11ff in sp_lex_keeper::reset_lex_and_exec_core (this=0x1e2e22f0, thd=0x2aaaac29b388, nextp=0x4590f398, open_tables=false, instr=0x1e2e22b0)
    at sp_head.cc:2703
#13 0x00000000007f1862 in sp_instr_stmt::execute (this=0x1e2e22b0, thd=0x2aaaac29b388, nextp=0x4590f398) at sp_head.cc:2809
#14 0x00000000007f38e4 in sp_head::execute (this=0x1e2d3698, thd=0x2aaaac29b388) at sp_head.cc:1221
#15 0x00000000007f45d3 in sp_head::execute_procedure (this=0x1e2d3698, thd=0x2aaaac29b388, args=0x2aaaac29d488) at sp_head.cc:1945
#16 0x000000000064c1ff in mysql_execute_command (thd=0x2aaaac29b388) at sql_parse.cc:4090
#17 0x000000000064e27f in mysql_parse (thd=0x2aaaac29b388, inBuf=0x1e1bac08 "CALL ins_tb0_eng1 (@tmp_num, @tmp_word)", length=39, found_semicolon=0x45910f00)
    at sql_parse.cc:5623
#18 0x000000000064f052 in dispatch_command (command=COM_QUERY, thd=0x2aaaac29b388, packet=0x2aaaac2cc7e9 "CALL ins_tb0_eng1 (@tmp_num, @tmp_word)",
    packet_length=39) at sql_parse.cc:1121
#19 0x00000000006503ee in do_command (thd=0x2aaaac29b388) at sql_parse.cc:781
#20 0x000000000063e08f in handle_one_connection (arg=0x2aaaac29b388) at sql_connect.cc:1120
#21 0x0000003ba88062f7 in start_thread () from /lib64/libpthread.so.0
#22 0x0000003ba80ce85d in clone () from /lib64/libc.so.6
[30 Jan 2008 20:18] Konstantin Osipov
Philip, I would greatly appreciate if you reduce the test case to a simple SQL script (possibly mysql-test language script).
That would save time to many people (myself, engineer, reviewer).
Thanks.
[31 Jan 2008 10:27] Philip Stoev
Test case for bug #34043

Attachment: bug34043.zip (application/x-zip-compressed, text), 65.31 KiB.

[31 Jan 2008 10:31] Philip Stoev
It turns out the issue is that mysqld spends all of its time in _checkchunk() (safemalloc is enabled due to a debug build) and thus all tables remain perpetually locked. Only a single CPU core is used, the rest are idle. While the manual does say that safemalloc is very slow, I still think that a completely hanging server is a bug of some sort.

How to repeat:

1. Compile a debug binary;

2. Download the attached archive, placing the .txt files in mysql-test and the .test files in mysql-test/t.

3. Run test as follows:

perl ./mysql-test-run.pl --stress --stress-init-file=bug_34043_init.txt \
--stress-test-file=bug_34043_run.txt --stress-test-duration=600 \
--stress-threads=200 --mysqld=--max_connections=500

You will observe some initial progress, afterwards the test output will cease. logging into the server, running show processlist or show table status or shutdown will become extremely slow operations.

If you run the test with --stress-threads=50, you will observe nice progress all the way.
[31 Jan 2008 17:06] Sergei Golubchik
Just a couple of thoughts:

It's, of course, normal for safemalloc to spend time checking memory links, and depending on the allocation pattern it can waste quite a lot of time doing it. All the effects, described in the bugreport, are, basically, expected, safemalloc using a lot of CPU time in _checkchunk() is certainly not a bug.

On the other hand, _checkchunk() would need a lot of time only when there are a lot (and I mean, A LOT - a really huge number) of small memory blocks allocated. This could often mean a problem in a code - for example, mem_root with badly chosen default parameters (too small block size, resulting in a huge number of blocks). This needs to be investigated (mysqladmin debug should help, probably)
[31 Jan 2008 17:16] Philip Stoev
Whatever it is, the performance hit is pretty severe -- only 1 core is ever used. If it is impossible to remedy this, then we should ship the debug binary with safemalloc disabled, otherwise we will run into trouble with customers who run debug binaries for any reason.

I am also concerned that the data structure being walked by _checkchunk() gets corrupt when more than a certain number of threads are accessing it.
[23 Apr 2008 20:05] Philip Stoev
6.0 is also affected.

Can we somehow increase the priority of this. If this situation occurs, complete transaction serialization results -- only a single thread, which holds a lock, loops in _checkchunk() and no other threads are able to move forward.
[1 Jul 2010 15:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/112691

3085 Davi Arnaut	2010-07-01
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Essentially, the problem is that safemalloc is excruciatingly
      slow as it checks all allocated blocks for overrun at each
      memory management primitive, yielding a almost exponential
      slowdown for the memory management functions (malloc, realloc,
      free). The overrun check basically consists of verifying some
      bytes of a block for certain magic keys, which catches some
      simple forms of overrun. Another minor problem is violation
      of aliasing rules and that its own internal list of blocks
      is prone to corruption.
      
      Given the magnitude of memory debuggers available nowadays,
      especially those that are provided with the platform malloc
      implementation, maintenance of a in-house and largely obsolete
      memory debugger becomes a burden that is not worth the effort
      due to its slowness and lack of support for detecting more
      common forms of heap corruption.
      
      Since there are third-party tools (such as valgrind, malloc
      debug, etc) that can provide the same functionality at a
      lower or comparable performance cost, the solution is to
      simply remove safemalloc.
      
      The removal of safemalloc also allows a simplification of the
      malloc wrappers, removing quite a bit of kludge: redefinition
      of my_malloc, my_free and the removal of the unused second
      argument of my_free. Since my_free always check whether the
      supplied pointer is null, redudant checks are also removed.
     @ client/mysqldump.c
        Pass my_free directly as its signature is compatible with the
        callback type -- which wasn't the case for free_table_ent.
[8 Jul 2010 21:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/113178

3102 Davi Arnaut	2010-07-08
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Essentially, the problem is that safemalloc is excruciatingly
      slow as it checks all allocated blocks for overrun at each
      memory management primitive, yielding a almost exponential
      slowdown for the memory management functions (malloc, realloc,
      free). The overrun check basically consists of verifying some
      bytes of a block for certain magic keys, which catches some
      simple forms of overrun. Another minor problem is violation
      of aliasing rules and that its own internal list of blocks
      is prone to corruption.
      
      Another issue with safemalloc is rather the maintenance cost
      as the tool has a significant impact on the server code.
      Given the magnitude of memory debuggers available nowadays,
      especially those that are provided with the platform malloc
      implementation, maintenance of a in-house and largely obsolete
      memory debugger becomes a burden that is not worth the effort
      due to its slowness and lack of support for detecting more
      common forms of heap corruption.
      
      Since there are third-party tools that can provide the same
      functionality at a lower or comparable performance cost, the
      solution is to simply remove safemalloc. Third-party tools
      can provide the same functionality at a lower or comparable
      performance cost. 
      
      The removal of safemalloc also allows a simplification of the
      malloc wrappers, removing quite a bit of kludge: redefinition
      of my_malloc, my_free and the removal of the unused second
      argument of my_free. Since free() always check whether the
      supplied pointer is null, redudant checks are also removed.
      
      Also, this patch adds unit testing for my_malloc and moves
      my_realloc implementation into the same file as the other
      memory allocation primitives.
     @ client/mysqldump.c
        Pass my_free directly as its signature is compatible with the
        callback type -- which wasn't the case for free_table_ent.
[8 Jul 2010 21:45] Davi Arnaut
Queued to mysql-trunk
[8 Jul 2010 21:48] Davi Arnaut
Alternatives to safemalloc include:

* mtrace - catches memory leaks and is very fast
* mudflap - catches memory leaks and is a bit slower.
* Valgrind
* malloc debug

A summary of memory debuggers are found here  http://en.wikipedia.org/wiki/Memory_debugger
http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man3/malloc...
http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man3/libgma...
http://developer.apple.com/mac/library/technotes/tn2004/tn2124.html
[9 Jul 2010 1:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/113192

3103 Davi Arnaut	2010-07-08
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Post-merge fix: cast argument and correct type in assignment.
[9 Jul 2010 11:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/113230

3105 Davi Arnaut	2010-07-09
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Post-merge fix: remove reference to file that is now gone.
[17 Jul 2010 14:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/113812

3127 Davi Arnaut	2010-07-17
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Post-merge fix: remove leftovers from safemalloc removal.
[23 Jul 2010 12:25] Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:32] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[1 Aug 2010 22:36] Paul DuBois
Noted in 5.5.6 changelog.

safemalloc was excessively slow under certain conditions and has been
removed.
[2 Aug 2010 0:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/114814

3161 Davi Arnaut	2010-08-01
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Post-merge fix: remove --with-debug=full, it was only used for safemalloc.
[9 Aug 2010 13:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/115315

3167 Davi Arnaut	2010-08-09
      Bug#34043: Server loops excessively in _checkchunk() when safemalloc is enabled
      
      Post-merge fix: add missing comma.
[16 Aug 2010 6:29] Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alik@sun.com-20100816062701-qo9dpnk5tkt1pksb) (version source revid:alik@sun.com-20100816062603-xc16eftmv7rmktyq) (merge vers: 5.6.1-m4) (pib:20)
[16 Aug 2010 6:37] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100816062819-bluwgdq8q4xysmlg) (version source revid:alik@sun.com-20100816062612-enatdwnv809iw3s9) (pib:20)
[16 Aug 2010 15:29] Paul DuBois
Noted in 5.6.1 changelog.