MySQL Bugs: #12231: chronic 'error 12's after long uptime (memory leak)

Bug #12231	chronic 'error 12's after long uptime (memory leak)
Submitted:	27 Jul 2005 21:50	Modified:	7 Feb 2008 12:55
Reporter:	Nathan Cheng	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	4.1.9,4.1.13	OS:	Linux (RedHat release 3)
Assigned to:		CPU Architecture:	Any

Description:
This is a bug that causes us to have to restart our production instance of mysqld about once a week due to the following situation: after some number of days of uptime (it started at 14 days, but has now shrunk to 5 days), "error 12" error messages of two types begin to appear in the mysqld.log file; the two types are as follows:

050726 12:44:52 [ERROR] Got error 12 when reading table './newco/io_export_message_blob'

050726 14:32:59 [ERROR] /usr/local/mysql/bin/mysqld: Got error 12 from storage engine
050726 14:32:59 [ERROR] /usr/local/mysql/bin/mysqld: Sort aborted

It may be important to note that the several tables named in the first type of error are all MyISAM tables with one BLOB column.

At first they occur only once every couple minutes, but within a day the situation worsens to the point that we get this error many times per second. When we stop the application that is connecting to the database, verify that all connections to the database have been dropped, and restart the application, the problem goes away for a couple hours, but then returns.

The error has never appeared in our dev, qa, or staging environments. Our staging environment database is a recent copy of our production database. The only explanation is that the error is directly related to volume of usage and not any single query, though perhaps it is the repetition of a single query or type of query that eventually causes the error.

The only logical conclusion is that there is a memory leak.

Here are some clues to help narrow down the source of the memory leak:

We have relatively high usage rates, e.g. right now our status is:

Uptime: 479381  Threads: 30  Questions: 409695664  Slow queries: 10777  Opens: 3446327  Flush tables: 3  Open tables: 456  Queries per second avg: 854.635

We have been using MySQL in production for 29 months; our upgrade path since October 2004 has been:

10/26/2004 4.0.20-standard-log
10/30/2004 4.1.7-standard-log
02/04/2005 4.1.9-standard-log
07/22/2005 4.1.13-standard-log

Usage of the application has been gradually increasing over this entire period, and updates to the application have occurred about weekly. The very first "error 12" ever occurred on 07/06/2005; at that point the db server had been running since 06/22/2005. Prior to the restart on 06/22/2005, the db server had been running continuously since 04/13/2005. We did not restart the server immediately after the "error 12" began occurring on 07/06, but waited until the situation became unbearable on 07/11/2005. After restarting on 07/11, and again on 07/13 (for a different reason), the errors began again on 07/18. The usage rate of the application from 04/13 to 06/22 was not significantly less than the usage rate from 06/22 to 07/06; and the usage rate from 06/22 to 07/06 was not significantly less than from 07/13 to 07/18; it follows, therefore, that this memory leak was most likely triggered by a change in the application on or about 07/01. However, looking back at our history of nearly weekly releases, the last time that there was a period of continuous running of the application sufficiently long enough to be sure that whatever query/queries is causing this problem did not exist at that time was the period from 06/10 to 06/19. So the problematic query/queries must have entered the production code base on or after 06/19, and by sometime several days before 07/06, they were definitely already in the production code base. Unfortunately the diff on our database code for the two releases that match these date constraints is 6947 lines long, with 4638 lines of new/modified code. 

However, I have gone through all of this code and found something of interest: In the new code we use two things for the first time:

1) UNION
2) nested SELECTs

Not only that, but we use UNION only once, and it is part of a sub SELECT--a sub SELECT that uses STRAIGHT_JOIN DISTINCT in the parent and STRAIGHT_JOIN in the child, no less. Also, the total number of queries that use nested SELECTs is very minimal--perhaps two or three (out of about 1500 different queries in the application).

I can provide examples of all of our nested queries if that would be helpful.

How to repeat:
Unknown.

I flushed tables and threads, and that seems to have bought me some time. The SZ being reported by ps for mysqld is now 604571. Before flushing it was 639780, and yesterday before the errors started, it was 636074. So it would follow that the error threshold is somewhere between 636074 and 639780. Are there any STATUS or VARIABLES entries that would be helpful for you to know? I have been pasting them into a log file periodically.

Hi,

Operating System Error 12 is Cannot allocate memory.  You need to adjust your configuration
settings as I'm guessing you are using 32 bit hardware and you are hitting the addressable limit.

If you are not a support customer you can get help from these community avenues:
mysql.com/IRC
forums.mysql.com
lists.mysql.com

Best Regards

Hi,

I apologize, I missed the memory leak portion of your report.

This is an important subject but we do need a repeatable test
case to proceed using this avenue.  Have you been able to 
narrow it down to a set of steps that you can see leaking 
X bytes?

Best Regards

I will see if I can do a massive repetition of the nested queries in our staging environment and recreate the problem there. The difficulty with this is that I have only seen the error begin after about 300 million db questions, so it may take a while.

I have thought that perhaps I just have bad memory settings, but that explanation doesn't make sense for the following reason: let's say the set of all queries run by my application is {q}; a "bug free" db system must then behave in the following way: as long as the memory settings are such that all queries in {q} can run once successfully without any kind of memory error, then all queries in {q} can run an infinite number of times successfully without any kind of memory error. Let's call this "the inferior memory management axiom of a bug-free db system".

Now allow me to formulate "the superior memory management axiom of a bug-free db system": if {q}_S is the set of all queries that can run successfully at least once with memory settings S, and {p}_S is the set of all queries that cannot run successfully even once with memory settings S, then a bug-free db system must be able to run successfully any query from {q}_S an infinite number of times, no matter how many and at what times a query from {p}_S is attempted.

I cannot rule out the possibility that what we are dealing with is a breach of the "superior" axiom, rather than the "inferior" axiom. In other words, it is a possibility that my application attempts a query that is not possible given my memory settings, and in so doing, something breaks and causes, for example, that one thread to not be able to do anything any more.

Here is an interesting fact that _may_ support that hypothesis: After restarting the db a little after midnight 7/22, at noon 7/26 the errors started. They continued to worsen throughout the day, and tapered off a little during the night (expected, since people use the application mainly during business hours). The last error occurred at 7:06 AM 7/27, because right at about 7:07 I ran the following two commands:

mysqladmin flush-tables
mysqladmin flush-threads

And here's the interesting thing: it is now 9:44 AM 7/28, and the error has not yet re-occurred. In other words, flushing tables and threads seems to have cleared up whatever was causing the error. It will be interesting to see if the errors start up again in a few days, which would mean flushing tables and threads accomplishes the same thing as restarting the db. It will be baffling if the error never occurs again as long as we do not restart the db. That would be absolutely flabergasting, and defy all logic.

Hi!

Errno. 12 means "Not enough space".

Can you confirm or not if you have enough space in your server ?

I have been running 4.1.X for extended amount of time, and I have never received those kinds of errors.

THanks!

If "space" means disk space, then yes, I definitely have enough disk space:

[root@localhost root]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2             69408144  25362948  40519332  39% /
/dev/sda1               101089     14941     80929  16% /boot
/dev/sdb1            141106132  94874920  39063416  71% /mnt/drive2
none                   1934772         0   1934772   0% /dev/shm
[root@localhost root]# 

(The data directory resides under /mnt/drive2.)

If "space" means memory, here is the present state of my memory:

 11:49:27  up 260 days,  8:55,  2 users,  load average: 2.17, 1.98, 1.84
111 processes: 110 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total   51.0%    0.0%    3.2%   0.2%     2.2%    0.5%   42.5%
           cpu00   37.8%    0.0%    4.6%   0.0%     1.6%    0.4%   55.6%
           cpu01   53.2%    0.0%    2.6%   1.2%     4.4%    0.6%   38.0%
           cpu02   38.3%    0.0%    3.5%   0.0%     2.9%    0.7%   54.2%
           cpu03   74.8%    0.0%    2.3%   0.0%     0.1%    0.3%   22.1%
Mem:  3869544k av, 3844552k used,   24992k free,       0k shrd,  196264k buff
                   2955108k actv,  735780k in_d,   18908k in_c
Swap: 1052248k av,  770384k used,  281864k free                 1725612k cached

Of course, right now I am not getting the errors.

However, I think this particular line of inquiry is beside the point, because what happens is that if I do not restart mysqld, eventually ALL queries fail--even simple ones that would require only a few bytes of memory to run.

Historical discussion of this problem can be found at:

http://forums.mysql.com/read.php?21,33887,33887#msg-33887

Hi!

I made a mistake in my previous comment.  I stated error 12 was 'not enough disk' and that was wrong.  Under Linux, error 12 means not enough memory:

jorge-/home/jorge> perror 12
OS error code  12:  Cannot allocate memory
jorge-/home/jorge>

I don't agree that this is not a bug. If my memory configuration is such that a query can run once, it should be able to run that same query an infinite number of times without running out of memory. Otherwise, you should display the following disclaimer in a prominent place in the documentation: 

MYSQLD IN A PRODUCTION ENVIRONMENT MUST BE REBOOTED EVERY NIGHT TO AVOID THE THREAT OF RUNNING OUT OF MEMORY AND BEING UNABLE TO RESPOND TO ANY QUERY

I am glad I am not the only one getting the "Got error 12 when reading table" ERROR. 

Nathan,I agree with you - this is a highly critical BUG as its affecting our production system adversely. 

We are running the std-4.1.7 version on HPUX11.0 

The first few months of testing on DEV were fine. We never encounterd an issue with "Got error 12 when reading table" 

One month into production and we started noticing the ERROR 12. It started will less frequency and now has started occussing every 3-4 days. 

The only way I could get rid of it was a stop/ start of MySqld 
I was using the my-medium.cnf as my configuration file. 

I thought that some internal variables might be running out of memory and increased the parameter values using my-huge.cnf. With those settings, I start getting mysql_query() failed / mysql_store_result() failed "Got Error 12 from Storage Engine" 

How can I check how much memory is being used by Mysql ? 

The ps -le | grep mysqld -- the SZ is very low -- its hovering around 8000 as of now. 

Any advice as to how to anticipate the memory running out OR a command to check how much memory is getting used up by Mysql will be highly appreciated. 

The flush-tables and flush-threads did not help me with the my-huge.cnf scenario. 

Thanks 
Puneesh

Ok, here is one of the queries that periodically generates the "error 12" message. It usually only happens after calling this line several tens of thousands of times, and even then it is only sporadic:

SELECT a.queue, a.destination, b.binary_data, a.tries, a.send_after, a.status, a.created, a.created_by, a.modified, a.modified_by, a.sequence_id FROM io_export_message AS a LEFT JOIN io_export_message_blob AS b USING( export_message_id ) WHERE a.export_message_id = ? 

Here are the two tables involved in this query:

CREATE TABLE `io_export_message` (
  `export_message_id` int(10) unsigned NOT NULL auto_increment,
  `queue` varchar(128) NOT NULL default '',
  `destination` varchar(128) NOT NULL default '',
  `tries` int(10) unsigned NOT NULL default '0',
  `send_after` int(10) unsigned NOT NULL default '0',
  `status` enum('creating','queued','sending','succeeded','failed','deleted') NOT NULL default 'creating',
  `created` int(10) unsigned NOT NULL default '0',
  `created_by` varchar(128) NOT NULL default '',
  `modified` int(10) unsigned NOT NULL default '0',
  `modified_by` varchar(128) NOT NULL default '',
  `sequence_id` int(11) NOT NULL default '1',
  PRIMARY KEY  (`export_message_id`),
  KEY `idx_1` (`queue`),
  KEY `idx_2` (`created`),
  KEY `idx_3` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `io_export_message_blob` (
  `export_message_id` int(10) unsigned NOT NULL default '0',
  `binary_data` longblob,
  UNIQUE KEY `idx_1` (`export_message_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 MAX_ROWS=100000000 AVG_ROW_LENGTH=13000

Note that a.export_message_id is the PRIMARY KEY. I do not believe that anyone out there could possibly claim with a straight face that this query is legitimately taking up too much memory.

THIS A BUG!!!!!!!!!!!!!!!!!!!!!!!!!!!!

OK. This is a first step towards a repeatable test case. But still we need some data to fill into the tables before running the SELECT command, right?

Other questions:
- Did you run this yourself on a test machine and got the error?
- Is it sufficient to run it in a single session or is it required that multiple sessions run it in parrallel?
- Did you ever try to run it on a "debug" server, which might note not freed memory in the error log?
- Can you please wait a minute between "mysqladmin flush-tables" and "mysqladmin flush-threads" and save the SZ value from "ps -le" right before both commands and a miniute after the secons one?

---------------------
- Did you run this yourself on a test machine and got the error?
---------------------

No, this has never happened on a test machine--dev, qa, staging, or otherwise-- and our staging environment is the *exact* same configuration (hardware, software, network, and otherwise) as our production environment. The only difference between production and staging is traffic and diversity of usage: production has several orders of magnitude more of both.

---------------------
- Is it sufficient to run it in a single session or is it required that multiple
sessions run it in parrallel?
---------------------

It seems that eventually the error occurs no matter what you do--whether single or multiple sessions, whether all at once or one at a time, whether complex query or trivial query.

---------------------
- Did you ever try to run it on a "debug" server, which might note not freed
memory in the error log?
---------------------

I lied in the answer to the first question, because as of last night our staging environment is running the debug version of mysql 4.1.13. How do I ensure that I have the correct level of logging turned on?

---------------------
- Can you please wait a minute between "mysqladmin flush-tables" and "mysqladmin
flush-threads" and save the SZ value from "ps -le" right before both commands
and a miniute after the secons one?
---------------------

I'll attach this as a file.

ps before and after flush-tables and flush-threads

Attachment: ps.log (application/octet-stream, text), 19.84 KiB.

As a work-around, we had been running flush-tables and flush-threads twice per day. But recently even this stopped working, so now we use crond to flush-tables and flush-threads every five minutes in our production environment. So far, this has contained the problem.

shows how closing all connections does nothing to free db memory

Attachment: debug-mysql.log (application/octet-stream, text), 70.27 KiB.

Hi!

To rule out a memory leak in InnoDB, you can monitor SHOW INNODB STATUS. It prints the total memory allocated by InnoDB.

Regards,

Heikki

Hi,

So, according to the error log provided you've got:

050829  8:40:00 [ERROR] /usr/local/mysql/bin/mysqld: Got error 12 from storage engine
050829  8:40:00 [ERROR] /usr/local/mysql/bin/mysqld: Sort aborted
050829  8:40:19 [ERROR] Got error 12 when reading table './newco/io_export_message_blob'

error 12 when reading MyISAM table. Have you noted similar errors related to any of InnoDB tables (io_export_message)? If yes, please, send the results of
SHOW INNODB STATUS command just after that, as Heikki asked you. We need to figure out is this possible memory leak related to InnoDB stogare engine.

The results of ulimit -a command executed as a user who runs mysqld may be also useful. As well, as the server variables values (I saw no my.cnf file posted yet).

You provided tables structure and a problem query. To make a test case we need to know the number of rows in each of these 2 tables. The results of 

EXPLAIN SELECT a.queue, a.destination, b.binary_data, a.tries, a.send_after, a.status, a.created, a.created_by, a.modified, a.modified_by, a.sequence_id FROM io_export_message AS a LEFT JOIN io_export_message_blob AS b USING(export_message_id ) WHERE a.export_message_id = ? 

with ? substituted by some reasonable value can also help to pinpoint the possible reason.

Have you tried to work with the latest 4.1.14 version? I will be used for verification in any case.

Thank you for your patience and commitment to MySQL.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

We're having problems with this too. How do I re-open this issue?

Chris,

Please, provide all the information you have on similar issue.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".