Bug #25966 2MB per second endless memory consumption after LOCK TABLE ... WRITE
Submitted: 31 Jan 2007 7:03 Modified: 16 Mar 2007 17:52
Reporter: Shane Bester (Platinum Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Locking Severity:S1 (Critical)
Version:<=5.0.36, 5.1 OS:Microsoft Windows (windows, linux)
Assigned to: Dmitry Lenev
Tags: bfsm_2007_02_15, bfsm_2007_03_01, lock table write, memory leak

[31 Jan 2007 7:03] Shane Bester
Description:
there is a serious memory hog when a connection waits in
"Locked" state after issuing LOCK TABLE <table> WRITE;

On my PC, memory increased at a rate of 2MB/s and here is the processlist
during that time.  The other queries began to fail with out of memory errors:

mysql> show processlist;
+---------+------+------+------+---------+------+----------------+-----------------------------
| Id      | User | Host | db   | Command | Time | State          | Info
+---------+------+------+------+---------+------+----------------+-----------------------------
| 1291852 | root |      | test | Sleep   |    8 |                | NULL
| 1299724 | root |      | test | Query   |  228 | Locked         | lock table table136 write
| 1300134 | root |      | test | Query   |  414 | System lock    | select * from table136 order
| 1300159 | root |      | test | Query   |    0 | NULL           | set global thread_cache_size
| 1300192 | root |      | test | Query   |  377 | Locked         | INSERT INTO `table136` SET `
| 1300198 | root |      | test | Query   |  366 | Locked         | select * from table136 order
| 1300202 | root |      | test | Query   |  409 | Locked         | select * from table136 order
| 1300204 | root |      | test | Query   |  260 | NULL           | select * from table136 order
| 1300219 | root |      | test | Query   |  330 | Locked         | INSERT INTO `table136` SET `
| 1300220 | root |      | test | Query   |  330 | Locked         | select * from table136 order
| 1300222 | root |      | test | Query   |    0 | NULL           | set global thread_cache_size
| 1300223 | root |      | test | Query   |  265 | Locked         | select * from table136 order
| 1300226 | root |      | test | Query   |  330 | Locked         | select * from table136 order
| 1300257 | root |      | test | Query   |    0 | NULL           | set global thread_cache_size
| 1300288 | root |      | test | Query   |  415 | updating       | delete from table136 limit 1
| 1300299 | root |      | test | Query   |  328 | Locked         | select * from table136 order
| 1300303 | root |      | test | Query   |  330 | Locked         | select * from table136 order
| 1300312 | root |      | test | Query   |  415 | Locked         | INSERT INTO `table136` SET `
| 1300316 | root |      | test | Query   |  414 | Locked         | INSERT INTO `table136` SET `
| 1300317 | root |      | test | Query   |  414 | Locked         | INSERT INTO `table136` SET `
| 1300318 | root |      | test | Query   |  330 | Opening tables | INSERT INTO `table136` SET `
| 1300323 | root |      | test | Query   |  309 | Locked         | select * from table136 order
| 1300324 | root |      | test | Query   |    0 | NULL           | show processlist
+---------+------+------+------+---------+------+----------------+-----------------------------
23 rows in set (4.25 sec)

 query failed (5) - Out of memory (Needed 1240928 bytes): select * from table136 order
 query failed (5) - Out of memory (Needed 1224600 bytes): select * from table136 order
 query failed (5) - Out of memory (Needed 1224600 bytes): INSERT INTO `table136` SET `c

How to repeat:
Simply have a few INSERTS and SELECT to a table t1.
Then open another connection and run LOCK TABLE t1 WRITE.
Now watch memory.

I'll upload a testcase soon, after I create one.

Suggested fix:
..
[31 Jan 2007 8:29] Shane Bester
now I find it harder to repeat each time. However it is a bug, as I've encountered it a few times in the past.

A clue about the memory consumption:

When the problem happened, in the processlist there was 1 SELECT query whose state was continuously changing between values of 'Locked', 'NULL', and 'Opening tables'.

So I think it was that query waiting for release of write-lock, allocating memory in a loop probably.
[11 Feb 2007 18:14] Shane Bester
verified testcase for 5.0BK on windows + linux to cause 1 - 3MB/s memory leak
-----------------------------------------------------------------------------

Note: the merge isn't needed, as we've seen this without merge. However, I found it easier to reproduce with the merge in place.

#connection 1
drop table if exists t1;
drop table if exists m1;
create table t1(id int) engine=myisam;
create table m1(id int) engine=merge union=(t1) insert_method=last;

#connection 2
lock table t1 write;

#connection 1
insert into m1 values (1);

#connection 2
repair table t1;

#now watch memory using task manager or top command.
also, check processlist of mysqld, the state of the insert query will be changing from Opening tables to NULL to Locked....
[11 Feb 2007 21:15] Shane Bester
Additional information:

I reproduced this bug on the following versions (didn't have them all installed locally so couldn't test every one)

at least 5.1.11, 5.0.36, 5.0.27, 5.0.26, 5.0.24a, 5.0.22, 5.0.13-rc are affected by this memory leak bug. 4.1.21 didn't leak memory, but the cpu was at 100%
[13 Feb 2007 6:43] Shane Bester
5.0.6 doesn't run out of memory, instead it acts like 4.1.21 and just floors the CPU.
[4 Mar 2007 22:50] Shane Bester
some debug info. repeated on small tables, using debug build

Attachment: small_repeated_20070304.txt (text/plain), 154.04 KiB.

[4 Mar 2007 22:58] Shane Bester
dmitri,

i caught debug build in debugger.  but it didn't leak memory. only floored the cpu in the same code as release binary leaked memory. maybe it's a safemalloc thing - but the problem still shows.

the code ends up looping in that for(;;) loop in open_and_lock_tables()
for the one thread.  See attached as much debug info I could get.

I saved a 'core file' using visual studio - hopefully you have access to a windows machine to study it.  I repeated this once in 4 hours with release build, and once in 2 two hours with debug build.  i will upload the corefile which contains all the local variables for each thread.
[5 Mar 2007 17:14] Dmitry Lenev
Further investigation shows that the problem which was originally reported occurs in situations when thread which serviced some connection to which KILL command was applied (or its internal equivalent) is reused for servicing other connection (via thread cache) and this new connection has to wait for some table lock.

One possible scenario (which probably happens in customer's case) is when thread which provides binlog dump to replication slave is implicitly/automatically killed when the same slave reconnects and starts pulling data through different thread/connection.

Possible work around for this problem is to disable thread cache.
[6 Mar 2007 10:38] Dmitry Lenev
Since the problem that was originally reported (one that involves explicit or implicit KILL + thread cache) and the problem demonstrated by the test case involving REPAIR, LOCK TABLES and merge table, allbeit show different symptoms, have different root causes, I've reported the latter separately as bug #26867.

Please use http://bugs.mysql.com/bug.php?id=26867 for its further discussion and to track its progress.
[7 Mar 2007 7:04] Shane Bester
we have a confirmation that disabling thread_cache_size makes the testcase no longer reproducible on one server.
[13 Mar 2007 21:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21850

ChangeSet@1.2479, 2007-03-14 00:35:18+03:00, dlenev@mockturtle.local +3 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
   TABLE ... WRITE".
  
  When thread which serviced some connection to which KILL command was applied
  (or its internal equivalent) was reused for servicing some other connection
  (via thread cache), this new connection started to hog CPU and memory once
  it had to wait for some table lock. One possible scenario which exposed this
  problem was when thread which provided binlog dump to replication slave was
  implicitly/automatically killed when the same slave reconnected and started
  pulling data through different thread/connection. This problem also occured
  if one simply killed particular query instead of whole connection and when
  connection in which query was killed had to wait for some table lock.
  
  This problem was caused by the fact that THD::mysys_var::abort member, which
  is actually is not member of THD but is a sort of thread-specific variable 
  and which indicates that waiting operations on mysys layer should be aborted
  (this includes waiting for table locks), was set by kill operation but was
  never reset back. So this value was "inherited" by the following statements
  or even other connections (which reused the same physical thread). Indeed
  having this flag set without actual kill operation (and thus without setting
  THD::killed flag) broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.
[15 Mar 2007 7:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21975

ChangeSet@1.2479, 2007-03-15 10:55:03+03:00, dlenev@mockturtle.local +3 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
  TABLE ... WRITE".
  
  Memory and CPU hogging occured when connection which had to wait for table
  lock was serviced by thread which previously serviced connection that was
  killed (note that connections can reuse threads if thread cache is enabled).
  One possible scenario which exposed this problem was when thread which
  provided binlog dump to replication slave was implicitly/automatically
  killed when the same slave reconnected and started pulling data through
  different thread/connection.
  The problem also occured when one killed particular query in connection
  (using KILL QUERY) and later this connection had to wait for some table
  lock.
  
  This problem was caused by the fact that thread-specific mysys_var::abort
  variable, which indicates that waiting operations on mysys layer should
  be aborted (this includes waiting for table locks), was set by kill
  operation but was never reset back. So this value was "inherited" by the
  following statements or even other connections (which reused the same
  physical thread). Such discrepancy between this variable and THD::killed
  flag broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.
[15 Mar 2007 8:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21977

ChangeSet@1.2618, 2007-03-15 11:30:17+03:00, dlenev@mockturtle.local +1 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
  TABLE ... WRITE".
  
  CPU hogging occured when connection which had to wait for table lock was
  serviced by thread which previously serviced connection that was killed
  (note that connections can reuse threads if thread cache is enabled).
  One possible scenario which exposed this problem was when thread which
  provided binlog dump to replication slave was implicitly/automatically
  killed when the same slave reconnected and started pulling data through
  different thread/connection.
  In 5.* versions memory hogging was added to CPU hogging. Moreover in
  those versions the problem also occured when one killed particular query
  in connection (using KILL QUERY) and later this connection had to wait for
  some table lock.
  
  This problem was caused by the fact that thread-specific mysys_var::abort
  variable, which indicates that waiting operations on mysys layer should
  be aborted (this includes waiting for table locks), was set by kill
  operation but was never reset back. So this value was "inherited" by the
  following statements or even other connections (which reused the same
  physical thread). Such discrepancy between this variable and THD::killed
  flag broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.
[15 Mar 2007 8:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21978

ChangeSet@1.2479, 2007-03-15 11:51:35+03:00, dlenev@mockturtle.local +3 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
  TABLE ... WRITE".
  
  Memory and CPU hogging occured when connection which had to wait for table
  lock was serviced by thread which previously serviced connection that was
  killed (note that connections can reuse threads if thread cache is enabled).
  One possible scenario which exposed this problem was when thread which
  provided binlog dump to replication slave was implicitly/automatically
  killed when the same slave reconnected and started pulling data through
  different thread/connection.
  The problem also occured when one killed particular query in connection
  (using KILL QUERY) and later this connection had to wait for some table
  lock.
  
  This problem was caused by the fact that thread-specific mysys_var::abort
  variable, which indicates that waiting operations on mysys layer should
  be aborted (this includes waiting for table locks), was set by kill
  operation but was never reset back. So this value was "inherited" by the
  following statements or even other connections (which reused the same
  physical thread). Such discrepancy between this variable and THD::killed
  flag broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.
[15 Mar 2007 21:17] Dmitry Lenev
Fix pushed into trees marked as 5.0.38, 5.1.17.

It was not pushed into 4.1 yet.
[16 Mar 2007 16:54] Paul Dubois
Noted in 5.0.38, 5.1.17 changelogs.

If a thread previously serviced a connection that was killed,
excessive memory and CPU use by the thread occurred if it later
serviced a connection that had to wait for a table lock.

Resetting report to Patch queued pending the push to 4.1.x.
[16 Mar 2007 17:06] Dmitry Lenev
Fix was pushed into tree tagged as 4.1.23
[16 Mar 2007 17:52] Paul Dubois
Noted in 4.1.23 changelog.