MySQL Bugs: #25966: 2MB per second endless memory consumption after LOCK TABLE ... WRITE

Bug #25966	2MB per second endless memory consumption after LOCK TABLE ... WRITE
Submitted:	31 Jan 2007 7:03	Modified:	16 Mar 2007 17:52
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S1 (Critical)
Version:	<=5.0.36, 5.1	OS:	Windows (windows, linux)
Assigned to:	Dmitry Lenev	CPU Architecture:	Any
Tags:	bfsm_2007_02_15, bfsm_2007_03_01, lock table write, memory leak

Description:
there is a serious memory hog when a connection waits in
"Locked" state after issuing LOCK TABLE <table> WRITE;

On my PC, memory increased at a rate of 2MB/s and here is the processlist
during that time.  The other queries began to fail with out of memory errors:

mysql> show processlist;
+---------+------+------+------+---------+------+----------------+-----------------------------
| Id      | User | Host | db   | Command | Time | State          | Info
+---------+------+------+------+---------+------+----------------+-----------------------------
| 1291852 | root |      | test | Sleep   |    8 |                | NULL
| 1299724 | root |      | test | Query   |  228 | Locked         | lock table table136 write
| 1300134 | root |      | test | Query   |  414 | System lock    | select * from table136 order
| 1300159 | root |      | test | Query   |    0 | NULL           | set global thread_cache_size
| 1300192 | root |      | test | Query   |  377 | Locked         | INSERT INTO `table136` SET `
| 1300198 | root |      | test | Query   |  366 | Locked         | select * from table136 order
| 1300202 | root |      | test | Query   |  409 | Locked         | select * from table136 order
| 1300204 | root |      | test | Query   |  260 | NULL           | select * from table136 order
| 1300219 | root |      | test | Query   |  330 | Locked         | INSERT INTO `table136` SET `
| 1300220 | root |      | test | Query   |  330 | Locked         | select * from table136 order
| 1300222 | root |      | test | Query   |    0 | NULL           | set global thread_cache_size
| 1300223 | root |      | test | Query   |  265 | Locked         | select * from table136 order
| 1300226 | root |      | test | Query   |  330 | Locked         | select * from table136 order
| 1300257 | root |      | test | Query   |    0 | NULL           | set global thread_cache_size
| 1300288 | root |      | test | Query   |  415 | updating       | delete from table136 limit 1
| 1300299 | root |      | test | Query   |  328 | Locked         | select * from table136 order
| 1300303 | root |      | test | Query   |  330 | Locked         | select * from table136 order
| 1300312 | root |      | test | Query   |  415 | Locked         | INSERT INTO `table136` SET `
| 1300316 | root |      | test | Query   |  414 | Locked         | INSERT INTO `table136` SET `
| 1300317 | root |      | test | Query   |  414 | Locked         | INSERT INTO `table136` SET `
| 1300318 | root |      | test | Query   |  330 | Opening tables | INSERT INTO `table136` SET `
| 1300323 | root |      | test | Query   |  309 | Locked         | select * from table136 order
| 1300324 | root |      | test | Query   |    0 | NULL           | show processlist
+---------+------+------+------+---------+------+----------------+-----------------------------
23 rows in set (4.25 sec)

 query failed (5) - Out of memory (Needed 1240928 bytes): select * from table136 order
 query failed (5) - Out of memory (Needed 1224600 bytes): select * from table136 order
 query failed (5) - Out of memory (Needed 1224600 bytes): INSERT INTO `table136` SET `c

How to repeat:
Simply have a few INSERTS and SELECT to a table t1.
Then open another connection and run LOCK TABLE t1 WRITE.
Now watch memory.

I'll upload a testcase soon, after I create one.

Suggested fix:
..

now I find it harder to repeat each time. However it is a bug, as I've encountered it a few times in the past.

A clue about the memory consumption:

When the problem happened, in the processlist there was 1 SELECT query whose state was continuously changing between values of 'Locked', 'NULL', and 'Opening tables'.

So I think it was that query waiting for release of write-lock, allocating memory in a loop probably.

verified testcase for 5.0BK on windows + linux to cause 1 - 3MB/s memory leak
-----------------------------------------------------------------------------

Note: the merge isn't needed, as we've seen this without merge. However, I found it easier to reproduce with the merge in place.

#connection 1
drop table if exists t1;
drop table if exists m1;
create table t1(id int) engine=myisam;
create table m1(id int) engine=merge union=(t1) insert_method=last;

#connection 2
lock table t1 write;

#connection 1
insert into m1 values (1);

#connection 2
repair table t1;

#now watch memory using task manager or top command.
also, check processlist of mysqld, the state of the insert query will be changing from Opening tables to NULL to Locked....

Additional information:

I reproduced this bug on the following versions (didn't have them all installed locally so couldn't test every one)

at least 5.1.11, 5.0.36, 5.0.27, 5.0.26, 5.0.24a, 5.0.22, 5.0.13-rc are affected by this memory leak bug. 4.1.21 didn't leak memory, but the cpu was at 100%

5.0.6 doesn't run out of memory, instead it acts like 4.1.21 and just floors the CPU.

some debug info. repeated on small tables, using debug build

Attachment: small_repeated_20070304.txt (text/plain), 154.04 KiB.

dmitri,

i caught debug build in debugger.  but it didn't leak memory. only floored the cpu in the same code as release binary leaked memory. maybe it's a safemalloc thing - but the problem still shows.

the code ends up looping in that for(;;) loop in open_and_lock_tables()
for the one thread.  See attached as much debug info I could get.

I saved a 'core file' using visual studio - hopefully you have access to a windows machine to study it.  I repeated this once in 4 hours with release build, and once in 2 two hours with debug build.  i will upload the corefile which contains all the local variables for each thread.

Further investigation shows that the problem which was originally reported occurs in situations when thread which serviced some connection to which KILL command was applied (or its internal equivalent) is reused for servicing other connection (via thread cache) and this new connection has to wait for some table lock.

One possible scenario (which probably happens in customer's case) is when thread which provides binlog dump to replication slave is implicitly/automatically killed when the same slave reconnects and starts pulling data through different thread/connection.

Possible work around for this problem is to disable thread cache.

Since the problem that was originally reported (one that involves explicit or implicit KILL + thread cache) and the problem demonstrated by the test case involving REPAIR, LOCK TABLES and merge table, allbeit show different symptoms, have different root causes, I've reported the latter separately as bug #26867.

Please use http://bugs.mysql.com/bug.php?id=26867 for its further discussion and to track its progress.

we have a confirmation that disabling thread_cache_size makes the testcase no longer reproducible on one server.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21850

ChangeSet@1.2479, 2007-03-14 00:35:18+03:00, dlenev@mockturtle.local +3 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
   TABLE ... WRITE".
  
  When thread which serviced some connection to which KILL command was applied
  (or its internal equivalent) was reused for servicing some other connection
  (via thread cache), this new connection started to hog CPU and memory once
  it had to wait for some table lock. One possible scenario which exposed this
  problem was when thread which provided binlog dump to replication slave was
  implicitly/automatically killed when the same slave reconnected and started
  pulling data through different thread/connection. This problem also occured
  if one simply killed particular query instead of whole connection and when
  connection in which query was killed had to wait for some table lock.
  
  This problem was caused by the fact that THD::mysys_var::abort member, which
  is actually is not member of THD but is a sort of thread-specific variable 
  and which indicates that waiting operations on mysys layer should be aborted
  (this includes waiting for table locks), was set by kill operation but was
  never reset back. So this value was "inherited" by the following statements
  or even other connections (which reused the same physical thread). Indeed
  having this flag set without actual kill operation (and thus without setting
  THD::killed flag) broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21975

ChangeSet@1.2479, 2007-03-15 10:55:03+03:00, dlenev@mockturtle.local +3 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
  TABLE ... WRITE".
  
  Memory and CPU hogging occured when connection which had to wait for table
  lock was serviced by thread which previously serviced connection that was
  killed (note that connections can reuse threads if thread cache is enabled).
  One possible scenario which exposed this problem was when thread which
  provided binlog dump to replication slave was implicitly/automatically
  killed when the same slave reconnected and started pulling data through
  different thread/connection.
  The problem also occured when one killed particular query in connection
  (using KILL QUERY) and later this connection had to wait for some table
  lock.
  
  This problem was caused by the fact that thread-specific mysys_var::abort
  variable, which indicates that waiting operations on mysys layer should
  be aborted (this includes waiting for table locks), was set by kill
  operation but was never reset back. So this value was "inherited" by the
  following statements or even other connections (which reused the same
  physical thread). Such discrepancy between this variable and THD::killed
  flag broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21977

ChangeSet@1.2618, 2007-03-15 11:30:17+03:00, dlenev@mockturtle.local +1 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
  TABLE ... WRITE".
  
  CPU hogging occured when connection which had to wait for table lock was
  serviced by thread which previously serviced connection that was killed
  (note that connections can reuse threads if thread cache is enabled).
  One possible scenario which exposed this problem was when thread which
  provided binlog dump to replication slave was implicitly/automatically
  killed when the same slave reconnected and started pulling data through
  different thread/connection.
  In 5.* versions memory hogging was added to CPU hogging. Moreover in
  those versions the problem also occured when one killed particular query
  in connection (using KILL QUERY) and later this connection had to wait for
  some table lock.
  
  This problem was caused by the fact that thread-specific mysys_var::abort
  variable, which indicates that waiting operations on mysys layer should
  be aborted (this includes waiting for table locks), was set by kill
  operation but was never reset back. So this value was "inherited" by the
  following statements or even other connections (which reused the same
  physical thread). Such discrepancy between this variable and THD::killed
  flag broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21978

ChangeSet@1.2479, 2007-03-15 11:51:35+03:00, dlenev@mockturtle.local +3 -0
  Fix for bug #25966 "2MB per second endless memory consumption after LOCK
  TABLE ... WRITE".
  
  Memory and CPU hogging occured when connection which had to wait for table
  lock was serviced by thread which previously serviced connection that was
  killed (note that connections can reuse threads if thread cache is enabled).
  One possible scenario which exposed this problem was when thread which
  provided binlog dump to replication slave was implicitly/automatically
  killed when the same slave reconnected and started pulling data through
  different thread/connection.
  The problem also occured when one killed particular query in connection
  (using KILL QUERY) and later this connection had to wait for some table
  lock.
  
  This problem was caused by the fact that thread-specific mysys_var::abort
  variable, which indicates that waiting operations on mysys layer should
  be aborted (this includes waiting for table locks), was set by kill
  operation but was never reset back. So this value was "inherited" by the
  following statements or even other connections (which reused the same
  physical thread). Such discrepancy between this variable and THD::killed
  flag broke logic on SQL-layer and caused CPU and memory hogging.
  
  This patch tries to fix this problem by properly resetting this member.
  
  There is no test-case associated with this patch since it is hard to test
  for memory/CPU hogging conditions in our test-suite.

Fix pushed into trees marked as 5.0.38, 5.1.17.

It was not pushed into 4.1 yet.

Noted in 5.0.38, 5.1.17 changelogs.

If a thread previously serviced a connection that was killed,
excessive memory and CPU use by the thread occurred if it later
serviced a connection that had to wait for a table lock.

Resetting report to Patch queued pending the push to 4.1.x.

Fix was pushed into tree tagged as 4.1.23

Noted in 4.1.23 changelog.