MySQL Bugs: #16041: Out of Memory when replicating constantly updated tables on slave

Bug #16041	Out of Memory when replicating constantly updated tables on slave
Submitted:	28 Dec 2005 11:43	Modified:	9 Feb 2006 8:52
Reporter:	Ricky Chan	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	5.0.15	OS:	FreeBSD (FreeBSD)
Assigned to:		CPU Architecture:	Any

Description:
Happens on 5.0.15 from FreeBSD ports on FreeBSD 4.7,5.3, 5.4 and 6.0.
Also happens when using mysql 5.0.16 pre-build binaries from mysql website.

When replicating from a master database version 3.23.49. The slave handles fine and has no problems. When the master database is constantly updated (2 tables, 2 updates a second per table approx). The memory usage on the slave exceeds 500M and crashes with:

key_buffer_size=8388600
read_buffer_size=131072
max_used_connections=0
max_connections=100
threads_connected=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 225791 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Naturally I believe these values are fine, but for a check I reduced all these variables but when I re-start within 3 seconds the database spirals out of control and uses in excess of 500M and crashes due to out of memory with:

key_buffer_size=4194304
read_buffer_size=131072
max_used_connections=0
max_connections=10
threads_connected=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 46335 K
bytes of memory

The box the slave is running on has 2G free but the point is the database seems to be in a race condition and uses up all the memory.

When I used Mysql 3.x or Mysql 4.x as a slave the problem no longer persists. However I really need to use Mysql 5.x as I have written triggers to act on the slave data (they are not enabled because of the problem I am having).

Further due the database crashes and restarting I now have > 9000 relay-bin logs in my database directory but relay.info shows it has only gotten to the 16th file due to problems.

How to repeat:
Set up mysql 5.0.15 from FreeBSD ports and set up replication as show above, the slave will run out of memory quickly and crash.

Mysql 4.x/3.x does not have this problem.

I am aware 512M process limit (by default in FreeBSD) but I do not want to remove this as I don't believe that the slave should get anywhere near it and if I am right is a bug then removing will affect the rest of the box.

Thank you for a problem report. Please, send the table definitions (both from master AND slave, with triggers) and typical updates you perform on master. Have you tried to work on 5.0.x, but without triggers on slave?

On both master/slave the schema is the same and no triggers are used (I was going to but couldn't because of the problem I received).

Here they are (about 4 tables), you can see I am using them to store radius accounting packets, which are replicated to a slave so that internally they can be queried.  This is currently work fine the slave running mysql 4.1.x (using less than 10 Megs of memory), when I try to move the slave to mysql 5.x (I naturally adjusted the master.info according so that we get clean relay-bin data as I know format not the same) the issue escalates out of control.

CREATE TABLE ACCOUNTING_2005_12 (
  count int(11) NOT NULL auto_increment,
  user char(32) NOT NULL default '',
  realm char(32) NOT NULL default '',
  time_stamp datetime default NULL,
  status char(10) default NULL,
  delay int(11) default NULL,
  input int(10) unsigned default NULL,
  input_wrap int(10) unsigned default NULL,
  output int(10) unsigned default NULL,
  output_wrap int(10) unsigned default NULL,
  sess_id char(30) default NULL,
  sess_time int(11) default NULL,
  term_reason int(11) default NULL,
  nas_id char(50) default NULL,
  nas_port int(11) default NULL,
  nas_identifier char(50) default NULL,
  ip char(22) default NULL,
  called_station char(22) default NULL,
  calling_station char(22) default NULL,
  time_weekend int(11) NOT NULL default '0',
  time_peak int(11) NOT NULL default '0',
  time_offpeak int(11) NOT NULL default '0',
  PRIMARY KEY  (count),
  KEY user (user),
  KEY realm (realm),
  KEY ip (ip),
  KEY time_stamp (time_stamp),
  KEY calling_station (calling_station)
) TYPE=MyISAM;

CREATE TABLE Usage_2005_12 (
  user varchar(32) NOT NULL default '',
  realm varchar(32) NOT NULL default '',
  day smallint(5) unsigned NOT NULL default '0',
  totalseconds int(10) unsigned default NULL,
  totalupload bigint(20) unsigned default NULL,
  totaldownload bigint(20) unsigned default NULL,
  cursessid varchar(41) default NULL,
  curseconds int(10) unsigned default NULL,
  curupload bigint(20) unsigned default NULL,
  curdownload bigint(20) unsigned default NULL,
  lastupdate datetime default NULL,
  PRIMARY KEY  (user,realm,day)
) TYPE=MyISAM;

CREATE TABLE RADPOOL (
  STATE int(11) NOT NULL default '0',
  TIME_STAMP int(11) default NULL,
  EXPIRY int(11) default NULL,
  USERNAME char(50) default NULL,
  POOL char(50) NOT NULL default '',
  YIADDR char(50) NOT NULL default '',
  SUBNETMASK char(50) NOT NULL default '',
  DNSSERVER char(50) default NULL,
  UNIQUE KEY RADPOOL_I (YIADDR),
  KEY RADPOOL_I2 (POOL)
) TYPE=MyISAM;

CREATE TABLE RADONLINE (
  USERNAME char(50) NOT NULL default '',
  REALM char(48) default NULL,
  NASIDENTIFIER char(50) NOT NULL default '',
  NASPORT int(11) NOT NULL default '0',
  ACCTSESSIONID char(30) NOT NULL default '',
  TIME_STAMP int(11) default NULL,
  FRAMEDIPADDRESS char(22) default NULL,
  NASPORTTYPE char(10) default NULL,
  SERVICETYPE char(20) default NULL,
  called_station char(22) default NULL,
  calling_station char(22) default NULL,
  SESSION_KEY char(30) NOT NULL default '',
  KEY RADONLINE_I2 (USERNAME)
) TYPE=MyISAM;

I underestimated the inserts, in 1 second sample I took from the master logs there is 60 changes to the tables. I have modified the output to hide sensitive information and have attched as a file.

Forgot to ask, sorry... Can you, please, send the my.cnf files from both master and slave? You may upload them as private files.

Please, try to repeat with newer version of MySQL (5.0.18 is already available) on the slave. Send the SHOW PROCESSLIST statement and top command results while you observe that memory usage growth.

As instructed I grabbed a copy of 5.0.18 off the server.  No change problem still persists.

Here is the details you wants:

>show processlist;
+----+-------------+-----------+--------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User        | Host      | db     | Command | Time   | State                                  | Info                                                                                                 |
+----+-------------+-----------+--------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+
|  1 | system user |           | NULL   | Connect |    229 | Queueing master event to the relay log | NULL                                                                                                 |
|  2 | system user |           | radius | Connect | 688571 | init                                   | update RADPOOL set state=0 where YIADDR not in ('x.x.x.x','x.x.x.x','x.x.x.x',' |
|  4 | root        | localhost | NULL   | Query   |      0 | NULL                                   | show processlist                                                                                     |
+----+-------------+-----------+--------+---------+--------+----------------------------------------+------------------------------------------------------------------------------------------------------+

Top shows:

last pid:  2855;  load averages:  1.88,  0.93,  0.40                                                                               up 140+01:23:15 15:41:27
60 processes:  4 running, 56 sleeping
CPU states: 47.3% user,  0.0% nice, 46.9% system,  5.8% interrupt,  0.0% idle
Mem: 334M Active, 37M Inact, 110M Wired, 15M Cache, 60M Buf, 992K Free
Swap: 1100M Total, 149M Used, 951M Free, 13% Inuse, 8K In, 8848K Out

  PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 2777 mysql     20    0   496M   365M kserel   3:35 88.09% 88.09% mysqld

(As soon as it hits > 512 (the limit we have allowed process to grow), the program dies and mysqld_safe restarts mysql for the same thing to happen again.

Entry in error log is the expect out of memory error message. Note no errors noted from start of replication.

Version: '5.0.18-standard'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Edition - Standard (GPL)
060125 15:37:23 [Note] Slave SQL thread initialized, starting replication in log 'x.264' at position 36442135, relay log './x.000031' position: 721304081
060125 15:37:23 [Note] Slave I/O thread: connected to master 'x@x:x',  replication started in log 'x.264' at position 36442135
060125 15:41:32 [ERROR] Out of memory; check if mysqld or some other process uses all available memory; if not, you may have to use 'ulimit' to allow mysqld to use more memory or you can add more swap space
mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

I had noticed NOT IN in your last SHOW PROCESSLIST:

update RADPOOL set state=0 where YIADDR not in ('x.x.x.x','x.x.x.x','x.x.x.x','

How long is that list of values in typical [NOT] IN clause? 10 elements, or hundreds of them? (I know similar bugs with this kind of clauses in MySQL optimizer, hence the question.)

The update in can contain up to 15,000 entries.  On the actually (master) database itself this action does not seem to take any noticeable time.  

In addition, this is done inside a WRITE LOCK, UNLOCK for the table RADONLINE and RADPOOL as the process is used to sync these DB's so other data can't be written while this goes on.

Just a bit strange that replicating to a 3.x/4.x mysql database I don't have a problem only on 5.0.x where replication seems to be causing problems.

Looks like a duplicate of bug #15872 (and several other similar reports) then. It is a 5.0.x-specific optimizer-related bug. Please, check.

It sounds feasible.

I will change that script to use several smaller IN/NOT IN rather than 1 large one, then bring the slave database to match master, then switch replication on.

I see if I can see the problem still exists after theses changes.

Update to follow.

Please, do that and inform about the results. I am almost sure that NOT IN optimizer bug of 5.0.x is the real reason!

This is looking promising, I removed code which used "NOT IN", created a master.info  on a spare box on 5.0.18 to start replication after this point.  Usually with in a few minutes I got an issue.  I have had it running for 90 minutes now at 16M using 1.9% CPU.  It that time period the relay log has gotten to ~60M in size.

It looks almost certain this is related to the other bug.  I want to leave it running for a bit longer (~24hrs), after that I will be more than happy to close the ticket.

Thanks.

Duplicate of bug #15872.