MySQL Bugs: #414: mysqld eats almost all CPUs and does little work

Bug #414	mysqld eats almost all CPUs and does little work
Submitted:	10 May 2003 18:22	Modified:	2 Jun 2003 3:28
Reporter:	shilling lee	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: MyISAM storage engine	Severity:	S3 (Non-critical)
Version:	4.0.4-beta	OS:	FreeBSD (FreeBSD 4.8-RC)
Assigned to:		CPU Architecture:	Any

Description:

1. As below, produced by top
=== cut from top ===
last pid: 19124;  load averages:  1.00,  1.00,  1.00                                                       up 61+22:44:04  09:09:48
27 processes:  3 running, 24 sleeping
CPU states: 15.7% user,  0.0% nice, 84.3% system,  0.0% interrupt,  0.0% idle
Mem: 462M Active, 355M Inact, 152M Wired, 32M Cache, 112M Buf, 2324K Free
Swap: 2032M Total, 48K Used, 2032M Free

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
 9520 data      64   0   449M   425M RUN    167.2H 98.10% 98.10% mysqld
=== cut from top end ===

2. As below, produced by mysqladmin status
=== cut from status ===
mysql  Ver 12.14 Distrib 4.0.4-beta, for unknown-freebsdelf4.7 (i386)

Connection id:          444002
Current database:       sms
Current user:           root@localhost
SSL:                    Not in use
Current pager:          more
Using outfile:          ''
Server version:         4.0.4-beta-log
Protocol version:       10
Connection:             Localhost via UNIX socket
Client characterset:    gb2312
Server characterset:    gb2312
UNIX socket:            /tmp/mysql.sock
Uptime:                 18 days 15 hours 11 min 37 sec

Threads: 74  Questions: 19760547  Slow queries: 817  Opens: 430  Flush tables: 1  Open tables: 305  Queries per second avg: 12.274
=== cut from status ends ===

3. After I did "flush-hosts, flush-tables .... flush everything and refresh"
the mysqladmin status produced as below, but CPU loads as high as before.

=== cut from status ===
mysql  Ver 12.14 Distrib 4.0.4-beta, for unknown-freebsdelf4.7 (i386)

Connection id:          444115
Current database:       sms
Current user:           root@localhost
SSL:                    Not in use
Current pager:          more
Using outfile:          ''
Server version:         4.0.4-beta-log
Protocol version:       10
Connection:             Localhost via UNIX socket
Client characterset:    gb2312
Server characterset:    gb2312
UNIX socket:            /tmp/mysql.sock
Uptime:                 18 days 15 hours 37 min 57 sec

Threads: 70  Questions: 19775946  Slow queries: 0  Opens: 26  Flush tables: 3  Open tables: 21  Queries per second avg: 12.272
=== cut from status ends ===

4. This is part of my.cnf that different from the example
[mysqld]
set-variable  = key_buffer=384M
#set-variable = max_allowed_packet=1M
set-variable  = table_cache=512
set-variable  = sort_buffer=2M
set-variable  = record_buffer=2M
set-variable  = thread_cache=8
set-variable  = thread_concurrency=2
set-variable  = myisam_sort_buffer_size=64M
server-id = 1
max_heap_table_size=384M
interactive_timeout=5400
wait_timeout=5400
max_connections=512
log=/usr/home/mysql/data/mysqld.log
bulk_insert_buffer_size=128M
max_connect_errors=100
max_allowed_packet=32M
tmpdir=/usr/home/mysql/tmp
tmp_table_size=384M

How to repeat:
just run mysqld for half or 1 month and do not restart

Suggested fix:
restart?

Not enough information was provided for us to be able
to handle this bug. Please re-read the instructions at
http://bugs.mysql.com/how-to-report.php

If you can provide more information, feel free to add it
to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

Sorry but as we don't have any possibilities to ourselves repeat this problem, it doesn't fall into the category of bugs that should be handled by this bugs system.

We know that this problem is not something that is typical as we have many customers who have had MySQL running on FreeBSD for longer times, without any problems.

If you can find out any more information of what MySQL is doing that may give a solution for this problem, we are willing to help you solve this.

In the mean time, you can try the following:
- Try the latest MySQL 4.0.x version;  I think we have fixed some FreeBSD specific issues since 4.0.4.  There is a small change that that may fix your problem, but I can't promise that.
- Try to compile the MySQL source with linuxthreads on FreeBSD.  For some users this have given them a faster MySQL version and I haven't heard of anyone running into new problems by doing that.

As our previous answers stated, we can't repeat this and your are on our own to solve this.  As this is something that doesn't happen (at least we have got very few similar reports for other system) I suspect this is a OS or library bug.

Some suggestions:

- Update to the latest MySQL version. This will probably not solve this problem but will at least enusure that if you find a bug we should be able to easily verify / test it.
- Compile mysqld with debugging (configure --debug) and when this happens again, connect to it with gdb to see what mysqld is really doing
- Examine with truss, strace what system calls mysqld is doing. This may give a clue what is happening.
- Compile MySQL with the other FreeBSD thread library. Linuxthreads is reported to work good on FreeBSD.

This bug was solved by upgrading my MySQL server to 4.0.12, thank u :)

Now the bug appears again 4 monthes later, mysql version is 4.0.12.

It seems that mysql only eats the "idle" cpu and limit the system load to 1.00
and the load will not grows up than 1.00. 
It seems that something wrong with a certain thread, but I close all the connections to mysql, the load is also 1.00, and nothing strange in error file and processlist. I try to flush the open tables, but it wasn't affected.

=== cut ===

last pid: 30167;  load averages:  1.00,  1.00,  1.00                                                       up 115+15:17:50 09:40:22
42 processes:  2 running, 40 sleeping
CPU states: 14.1% user,  0.0% nice, 84.5% system,  1.4% interrupt,  0.0% idle
Mem: 475M Active, 219M Inact, 192M Wired, 15M Cache, 112M Buf, 103M Free
Swap: 2032M Total, 140K Used, 2032M Free

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
48359 gateway   61   0   456M   433M RUN     78.9H 97.66% 97.66% mysqld
30167 root      29   0  1908K  1092K RUN      0:00  6.30%  0.88% top

=== cut ===

*** It will share the cpu with other process ***

=== cut ===
last pid: 30266;  load averages:  1.28,  1.06,  1.02                                                       up 115+15:22:20 09:44:52
44 processes:  4 running, 40 sleeping
CPU states: 57.3% user,  0.0% nice, 42.7% system,  0.0% interrupt,  0.0% idle
Mem: 471M Active, 294M Inact, 177M Wired, 15M Cache, 112M Buf, 46M Free
Swap: 2032M Total, 140K Used, 2032M Free

  PID USERNAME PRI NICE  SIZE    RES STATE    TIME   WCPU    CPU COMMAND
48359 gateway   53   0   456M   433M RUN     79.0H 65.04% 65.04% mysqld
30265 root      50   0   576K   408K RUN      0:10 45.57% 30.42% gzip
30264 root      -6   0   536K   364K pipdwt   0:00  0.51%  0.34% tar
30266 root      28   0  1908K  1084K RUN      0:00  0.33%  0.20% top
=== cut ===

Using mysqld-nt on windows XP SP2.  I'm having the same problem, CPU usage is about 98%.  Rebooting didn't work.  Also, my computer stops about every 10 to 30 seconds, for about 1 or two  seconds.  I can still move the mouse, but nothing else.  I'm not using an up-to-date version,  I'll see what that gives me.

I'm running MySQL 4.0.24 with FreeBSD 5.2.1 and have the same issue:

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
84666 mysql    129    0   166M 84860K RUN    0 1837.4 76.61% 76.61% mysqld

it happens after 2-3 weeks after startup. Restarting mysql resolves the problem for 2-3 weeks.

I've been doing further investigation and noticed a few things:

1) strace -c -p <pid>

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 50.98    0.326158          25     12932           poll
 48.69    0.311519          24     12935           gettimeofday
  0.20    0.001308         327         4           write
  0.07    0.000422          21        20         8 read
  0.04    0.000278          17        16           fcntl
  0.02    0.000128          16         8           lseek
  0.00    0.000015          15         1           clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00    0.639828                 25916         8 total

we can note that poll and gettimeofday are probably the reason for taking all that cpu time.

2) strace -tt -p <pid>

16:33:28.100282 gettimeofday({1121870008, 100334}, NULL) = 0
16:33:28.100777 poll([{fd=3, events=POLLRDNORM, revents=POLLRDNORM}, {fd=42, events=POLLRDNORM}, {fd=50, events=POLLRDNORM}, {fd=5, events=POLLRDNORM}, {fd=6, events=POLLRDNORM}, {fd=31, events=POLLRDNORM}, {fd=27, events=POLLRDNORM}, {fd=155, events=POLLRDNORM}, {fd=154, events=POLLRDNORM}, {fd=153, events=POLLRDNORM}, {fd=104, events=POLLRDNORM}, {fd=102, events=POLLRDNORM}, {fd=11, events=POLLRDNORM}], 13, 456) = 1
16:33:28.101175 gettimeofday({1121870008, 101282}, NULL) = 0
16:33:28.101620 poll([{fd=3, events=POLLRDNORM, revents=POLLRDNORM}, {fd=42, events=POLLRDNORM}, {fd=50, events=POLLRDNORM}, {fd=5, events=POLLRDNORM}, {fd=6, events=POLLRDNORM}, {fd=31, events=POLLRDNORM}, {fd=27, events=POLLRDNORM}, {fd=155, events=POLLRDNORM}, {fd=154, events=POLLRDNORM}, {fd=153, events=POLLRDNORM}, {fd=104, events=POLLRDNORM}, {fd=102, events=POLLRDNORM}, {fd=11, events=POLLRDNORM}], 13, 455) = 1
16:33:28.102007 gettimeofday({1121870008, 102035}, NULL) = 0
16:33:28.102298 poll([{fd=3, events=POLLRDNORM, revents=POLLRDNORM}, {fd=42, events=POLLRDNORM}, {fd=50, events=POLLRDNORM}, {fd=5, events=POLLRDNORM}, {fd=6, events=POLLRDNORM}, {fd=31, events=POLLRDNORM}, {fd=27, events=POLLRDNORM}, {fd=155, events=POLLRDNORM}, {fd=154, events=POLLRDNORM}, {fd=153, events=POLLRDNORM}, {fd=104, events=POLLRDNORM}, {fd=102, events=POLLRDNORM}, {fd=11, events=POLLRDNORM}], 13, 454) = 1

woahh... nice poll()ing in a loop. note the very short time interval between the functions execution. this is probable the reason of the high cpu usage. its checking if it can read from a certain amount of FDs. poll returns 1, which means there's on FD which can be read. we see its FD 3. who is FD 3?

3) fstat -v -p <pid>

USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
...
mysql    mysqld     84666    3* pipe ccbfc204 <-> cf7c6408  16384 rw
mysql    mysqld     84666    4* pipe cf7c6408 <-> ccbfc204      0 rw
...

ok, FD 3 is a pipe and FD 4 is the other end. it seems to have some data waiting to be read on the FD 3 side.

I'm wondering what for this pipe is..

4) searching the libc pipe() function in the source

I found it only once, in mysqlmanager.c it seems to be used to communicate with a child after fork()+exec().

in short, there's a pipe open, there's something to read but none is reading it. the effect is that poll() is always returning >=1 and takes all the CPU.

Has this been fixed yet?

It seems we're running into this issue as well. We have multiple machines running FreeBSD 5.4 and mySQL 4.0.22, with dual 3GHz Xeons (with Hyperthreading(tm) ) and 2GB RAM each.

We can't put our fingers on anything specific besides the fact that system CPU % shown by 'top' will jump up to over 20% (it's usually under 5%), and stays there, slowing down the machine. Restarting mySQL fixes it.

I don't have logs to attach, nor do I have a lot of details besides these. Let me know what logs, traces, etc. you need, and I'll try to grab them next time it  happens. I can capture during or after the event, but once I notice it, I really need to get things back on track as soon as possible.