MySQL Bugs: #1758: Mysql crashed and had to be manually restarted

Bug #1758	Mysql crashed and had to be manually restarted
Submitted:	5 Nov 2003 9:36	Modified:	13 Jan 2004 11:37
Reporter:	Richard Thomas	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	4.0.16-0	OS:	Linux (Redhat 9)
Assigned to:		CPU Architecture:	Any

Description:
Crashed after 3 days of uptime. Server is a new Quad Zeon with 8 gigs of memory.

mysqlbug

>C compiler:    2.95.3
>C++ compiler:  2.95.3
>Environment:
        <machine, os, target, libraries (multiple lines)>
System: Linux localhost.localdomain 2.4.20-20.9bigmem #1 SMP Mon Aug 18 11:18:32 EDT 2003 i686 i686 i386 GNU/Linux
Architecture: i686
                                                                                                                                                             
Some paths:  /usr/bin/perl /usr/bin/make /usr/bin/gmake /usr/bin/gcc /usr/bin/cc
GCC: Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.2/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --wi
Thread model: posix
gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
Compilation info: CC='gcc'  CFLAGS='-O2 -mcpu=i486 -fno-strength-reduce'  CXX='g++'  CXXFLAGS='-O2 -mcpu=i486 -fno-strength-reduce                -felide-co
LIBC:
lrwxrwxrwx    1 root     root           13 Oct 23 17:28 /lib/libc.so.6 -> libc-2.3.2.so
-rwxr-xr-x    1 root     root      1557296 Apr  7  2003 /lib/libc-2.3.2.so
-rw-r--r--    1 root     root      2331360 Apr  7  2003 /usr/lib/libc.a
-rw-r--r--    1 root     root          204 Apr  7  2003 /usr/lib/libc.so
Configure command: ./configure '--disable-shared' '--with-mysqld-ldflags=-all-static' '--with-client-ldflags=-all-static' '--with-server-suffix=-standard' '
                                                                                                                                                             
                                                                                                                                                             
my.cnf settings I have adjusted from default HUGE file included with system.
                                                                                                                                                             
[client]
#password       = your_password
port            = 3306
socket          = /var/lib/mysql/mysql.sock
                                                                                                                                                             
# Here follows entries for some specific programs
                                                                                                                                                             
# The MySQL server
[mysqld]
port            = 3306
socket          = /var/lib/mysql/mysql.sock
query_cache_type = 1
query_cache_limit = 32M
query_cache_size = 500M
ft_min_word_len = 3
                                                                                                                                                             
skip-locking
key_buffer = 2048M
max_allowed_packet = 2M
table_cache = 4096
sort_buffer_size = 32M
read_buffer_size = 1M
myisam_sort_buffer_size = 64M
thread_cache = 1024
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 16
max_connections=500
skip-innodb
                                                                                                                                                             

Error log included.

Version: '4.0.16-standard-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306
mysqld got signal 6;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.
                                                                                                                                                             
key_buffer_size=2147483648
read_buffer_size=1044480
max_used_connections=53
max_connections=500
threads_connected=54
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 2213932 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
                                                                                                                                                             
thd=0xbf443958
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xbf9fdbb8, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x8071fa4
0x8292178
0x82a3081
0x8292509
0x82a34a0
0x8309c98
0x8309cbd
0x830a776
0x830a945
0x8309aac
0x80c1499
0x80be417
0x80975c9
0x8097fb7
0x8096169
0x8095916
0x807cb4a
0x8080775
0x807bcc3
0x807b70e
0x807aef8
0x828f92c
0x82c303a
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://www.mysql.com/doc/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x860bbf8 = SELECT count(*) as parts FROM broker_parts.partall WHERE comp_id = 4099
thd->thread_id=17277
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.
                                                                                                                                                             
Number of processes running now: 0

How to repeat:
Could not repeat

You have to resolve the stacktrace and post the resolved version at least. For example, this might help you:

#!/bin/sh
echo "This scripts expects stack trace from your mysql error log file"
echo "placed in the current directory as ./stack"

if [ ! -r ./stack ]; then
   echo "./stack file not present or readable"
   exit 255
fi

if [ -r /usr/local/mysql/bin/mysqld.sym.gz ]; then
   gzip -dc /usr/local/mysql/bin/mysqld.sym.gz > ./symbols
else
   echo "nm -n /usr/local/mysql/bin/mysqld > ./symbols"
   nm -n /usr/local/mysql/bin/mysqld > ./symbols
fi

echo "/usr/local/mysql/bin/resolve_stack_dump -s ./symbols -n ./stack"
/usr/local/mysql/bin/resolve_stack_dump -s ./symbols -n ./stack

You should find uout which table caused the crashed and what was the SQL command used.

Results from provided script

0x8071fa4  + 134684580
0x8292178  + 136913272
0x82a3081  + 136982657
0x8292509  + 136914185
0x82a34a0  + 136983712
0x8309c98  + 137403544
0x8309cbd  + 137403581
0x830a776  + 137406326
0x830a945  + 137406789
0x8309aac  + 137403052
0x80c1499  + 135009433
0x80be417  + 134997015
0x80975c9  + 134837705
0x8097fb7  + 134840247
0x8096169  + 134832489
0x8095916  + 134830358
0x807cb4a  + 134728522
0x8080775  + 134743925
0x807bcc3  + 134724803
0x807b70e  + 134723342
0x807aef8  + 134721272
0x828f92c  + 136902956
0x82c303a  + 137113658

Suspect query from logs 
SELECT count(*) as parts FROM
broker_parts.partall WHERE comp_id = 4099

Server has been up 6 days since reboot without problems now.

Another crash today.. Same exact stack, diffrent query.
query at 0x870f120 = SELECT COUNT(*) AS cnt FROM broker_users.companytable WHERE (approval=1 OR approval=2)  AND signup_rep=16

Stacks match,query is diffrent.

Running RH9/Apache 1.3.29/php 4.3.4/mysql 4.0.16
MySQL is running on a separate machine, WWW has its own, separate machine.

I experience very same problems with queries using WHERE clause, running on frequently updated (by INSERT, UPDATE & DELETE) tables.

Example of query:
SELECT AVG( re.rating_value ) AS avgscore, r . * , r.date_added AS date_review_added, i . * , v . *
FROM rating AS r, items AS i, vendors AS v, rating_entries AS re
WHERE r.item_id = i.item_id AND v.vendor_id = i.vendor_id AND re.rating_id = r.rating_id
GROUP BY rating_id
ORDER BY r.date_added DESC
LIMIT 5

This query requires extremely healthy indexes, as due to WHERE is used together with GROUP BY and ORDER BY. When index is not refreshed at the execution of DELETE query, problems with capability of handling huge temporary tables & files occurs immediately.

Size vs. index
--------------
The query is operating on 4 tables, which are not impressive by its size on standalone basis:
- rating: approx. 27000 records
- rating_entries: approx. 31000 records
- items: approx. 4700 items
- vendor: approx. 400 items

But especially rating and rating_entries are updated approx. 5-10 times per minute (including issuing DELETE commands to these tables). This is having a tremendous impact on indexes, which for some reason are not refreshed after deleting rows from a database.

Let's have a look on EXPLAIN of above mentioned query:

table type possible_keys key key_len ref rows Extra
i ALL PRIMARY,product_id,vendor_id,item_vendor_category NULL NULL NULL 4818 Using temporary; Using filesort
v ref PRIMARY,vendor_id PRIMARY 4 i.vendor_id 1
r ref PRIMARY,product_id,rating_id product_id 4 i.item_id 6
re ref rating_id rating_id 4 r.rating_id 1 Using where

This means we have a temporary table of 4818 x 1 x 6 x 1, total 28908 rows. Not small, right? But those 4 tables are equipped with right indexes and execution time goes below 1 second.

As soon as DELETE command removes one row from tables rating and rating_entries, which are relating to each other, we have problem with indexes of tables rating & rating_entries. For some (???) reason those indexes cannot be accessed anymore and mysqld builds a sort file on all the rows from these two tables. Accordingly:

This means we have a temporary table of 4818 x 27000 x 31000 x 1, total 4032666000000 rows. Nice...

This hits hard disk and fills it immediately upto limits. At the same time CPU (1 x P4 2,4MHz) stays at 30-50%, RAM is still enough, no memory leaks at all. But simply it cannot work like that. One would need huge 40-60 GB RAM drive to serve it correctly.

Of course there are ways to overcome it, for example by declaring cron job which constantly optimizes these tables. As I did. But then it becomes the question of luck, which process will be first right after deleting a row from these tables - SELECT query or OPTIMIZE query?

Therefore my machine surrvives sometimes 8 days, but with huge load (like Google spidering my site) even each 5 minuts server has to be restarted.

Hope this will lead in the right direction.

I think your right, a fragmented index can cause higher memory usage and issues.

I was already optimizing every night, the problems did not start till I moved to a new server dedicated to mysql ( the previous was shared with apache and sendmail ).

I fine tuned the memory really close to using the max ram. After the crashes I scaled back those settings slightly and have had a uptime of 21 days now.. the longest yet. If your having issues try reducing your memory settings in my.cnf some and see if that helps.

Well. I am even more surprised, as I have also recently moved to a dedicated mysql machine. 

For me the real issue is fragmentation of indexes. As I have 2GB of RAM only, exceeding some Linux system limites values is not a problem, I hope.

Good way to repeat this bug is:
- simple delete one row from table, which you use in your WHERE clause, 
- point your file manager (mc) to mysql temporary catalogue
- then run the query
- observe temporary file size

If your machine is strong enough and table is not locked you can try to obtain EXPLAIN of your SELECT statement. Would be interesting for you to see how big Kartesian table is.

In my case temporary files go immediately over 4GB of HDD space. This is not actually stopping server, it simply extremely slows down execution of mysql threads.  MyISAM tables get locked and all the other threads (including OPTIMIZE) are waiting untill fat boys will be pushed through HDD.

I am just wondering, should I report a separated bug related to index fragmentation?

Sorry, I have forgotten to mention, that optimization (run from apache' cron job) goes each 20 minutes. As my mysql serves approx. 5000 queries per minute:

- I can't rung cron more frequently, due to accesibility issued & performance of the server itself

- Indexes got fragmented anyhow :-(

This bug report contains two different, independent issues. I try to handle both here. Future discussion about 'framgendted indexes' should however be moded to a separte bug report.

The crash:
The stack trace doesn't help us as this is not symbolic (we need the names of the called functions to get an idea of the problem). Sorry but I don't know why the script supplied by Martin Mokrejs doesn't work :(

Please read the link http://www.mysql.com/doc/en/Using_stack_trace.html for how to do the stack trace.

If you have more crashes, I would recommend you to download the debug version of the MySQL server and start this with the --core-file option. If you now get a crash, then the generated core file should contain enough information to solve the problem. You can upload the core file, the used mysqld binary and a short description of the bug to ftp://support.mysql.com/pub/mysql/secret/ so that we can analyse it.

About index fragementation (should be moved to another bug report)

There are a few separate, independent problems with indexes when one
does a lot of deletes on a table

- The cardinality of the index changes (not likely but possible) and MySQL uses the wrong index in the query.

In this case one could try to update the cardinality with ANALYZE
TABLE (faster than OPTIMIZE TABLE) or use IGNORE INDEX / USE INDEX in
the query to get MySQL to use the right indexes.

- The indexes gets 'unbalanced' and the range optimizer estimates
wrongly the wrong number of matching keys for expressions like
'key_column op constant'. In this case one should update the index
with OPTIMIZE TABLE or use IGNORE INDEX / USE INDEX.

- The index files could be fragmented on disk, causing more seeks than
are optimal, but I don't think this is the problem here.

Neither of the above problems are frequent and only happens to some
specific tables and specific usage. One common case where the index
gets unbalanced is when you have an indexed timestamp columns and you
insert a lot of new rows and delete a lot of old rows with an old
timestamp. In this case the optimizer will estimate wrongly the
number of rows for queries like

WHERE timestamp_column BETWEEN date and data2

In most cases the estimate is still 'good enough' but we know of cases
when the estimate is so wrong that MySQL will instead choose another
index to solve the query.

Fragmentation of indexes should not happen for just a single delete of
one row. If this is the case then there is some unknown factor
involved and we would need to get a test case to be able to find this
out.

To know what is happening in this particular case I would need the
explain from tables before the delete is done (when the query was fast)
and the explain after the delete was done (and the query is slow).

You can the the information what exactly happened during a query by doing:

mysql> FLUSH STATUS;
mysql> SELECT ...;
<A short version of the output from SELECT,
including the time taken to run the query>
mysql> SHOW STATUS;
<output from SHOW STATUS>

(The above assumes you are the only one using the system as the STATUS
variables are global).

The best way to know what is happening is of course if we could get a
copy of the problem tables when the query is slow so that we can test
this yourself. (Feel free to reset any columns that are not part of
the query with NULL or ""). Upload the tables, with a description of
the problem, to ftp://support.mysql.com/pub/mysql/secret.

We presently found out that 4.0.16 has a bug (4.0.15 was ok) that in
some cases makes table scans when it doesn't have to. This will be
fixed in 4.0.17 that will be out in a few days.

Regads,
Monty

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".