Bug #31459 Random crashes (stack trace inside)
Submitted: 8 Oct 2007 17:19 Modified: 13 Sep 2008 7:52
Reporter: Samuel Vogel Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: General Severity:S1 (Critical)
Version:5.0.45 OS:Linux (2.6.18-4-686-bigmem)
Assigned to: CPU Architecture:Any

[8 Oct 2007 17:19] Samuel Vogel
Description:
Hey guys,

I have noticed that MySQL is crashing on me rather randomly. We are running the biggest German Ad-Free Freehosting service, so this problem is really bad for us.

I have started a Forum thread as well (this was before I was able to get debug output): http://forums.mysql.com/read.php?11,176500,176500

I will provide you with some infos about our Setup:
We have 2 almost identical Servers, one of them is a replication slave for a differen server. The problem only occurs on the replication slave, so it could be related to the replication.
The connections between our servers are SSL protected and so is the replication connection.
We updated from 5.0.32 to 5.0.24, but the problem is still there.

I have started mysqld like this:
/usr/sbin/mysqld --basedir=/usr --datadir=/data/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-external-locking --port=3306 --socket=/var/run/mysqld/mysqld.sock --debug=d,info,error,query,general,where:O,/data/test/mysqld.trace --log=/data/test/mysql.log --myisam-recover &> mysqlss &

This is the output of mysqld to std*:
071008 18:41:21 [Warning] The syntax for replication startup options is deprecated and will be removed in MySQL 5.2. Please use 'CHANGE MASTER' instead.
/usr/sbin/mysqld: File '/data/test/mysql.log' not found (Errcode: 13)
071008 18:41:22 [ERROR] Could not use /data/test/mysql.log for logging (error 13). Turning logging off for the whole duration of the MySQL server process. To turn it on again: fix the cause, shutdown the MySQL server and restart it.
071008 18:41:22  InnoDB: Started; log sequence number 0 1360922872
071008 18:41:22 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=/var/run/mysqld/mysqld-relay-bin' to avoid this problem.
071008 18:41:22 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.0.45-Debian_1~bpo.1-debug-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Debian etch distribution
071008 18:41:22 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000020' at position 101970, relay log '/var/run/mysqld/mysqld-relay-bin.000003' position: 2538
mysqld: mf_iocache.c:1252: _my_b_seq_read: Assertion `pos_in_file == info->end_of_file' failed.
071008 18:41:22 - mysqld got signal 6;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=536870912
read_buffer_size=131072
max_used_connections=1
max_connections=1000
threads_connected=1
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 2700280 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0x94beea0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x9ee76d18, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x81df50d
0xb7f4d410
0xb7ca6fb9
0xb7c9efbf
0x8533a05
0x82827b1
0x8328b82
0x8329481
0x832a0d3
0xb7f0d240
0xb7d484ae
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do 
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at (nil)  is invalid pointer
thd->thread_id=3
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

The resolved stack trace:
# resolve_stack_dump -s mysqld.sym -n mysqld.stack 
0x81df50d handle_segfault + 797
0xb7f4d410 _end + -1351227584
0xb7ca6fb9 _end + -1354005783
0xb7c9efbf _end + -1354038545
0x8533a05 _my_b_seq_read + 1045
0x82827b1 _ZN9Log_event14read_log_eventEP11st_io_cacheP15st_safe_mutex_tPK28Format_description_log_event + 401
0x8328b82 _Z10next_eventP17st_relay_log_info + 514
0x8329481 _Z20exec_relay_log_eventP3THDP17st_relay_log_info + 81
0x832a0d3 handle_slave_sql + 1475
0xb7f0d240 _end + -1351490192
0xb7d484ae _end + -1353345058

The trace that mysqld with --debug reported can be found here:
http://mysql.pastebin.com/m484ca24d

I do apologize if this is not enough information and that I do not have a reproductable test case.
If anything else is needed, please tell me so.

Regards,
Samy

How to repeat:
Sorry, it happens randomly!
[8 Oct 2007 20:25] Samuel Vogel
There is a typo in my submission. We did not upgrade to 5.0.24, but to 5.0.45 of course.

Something that I do notice, is that the crashed do not occur at all when I have mysqld run with "--debug".
Also it seems to me, that the slave startup on mysqld start does not work reliably. One time I start mysqld the slave starts fine, the other time it complains about not finding the relay_log. Strange.
[10 Oct 2007 9:55] Samuel Vogel
I tried something different:
This Server was acting as a slave before, so I removed all the Slave stuff from it. Also mydns DNS Server was running, but i relocated it to a different server.
Unfortunately I still get the same behavior:

# grep ' mysqld' /var/log/syslog
Oct 10 07:06:29 h1314631 mysqld[29908]: 071010  7:06:29 [ERROR] /usr/sbin/mysqld: Table './blackcrawer@1-db/nuke_nsnst_config' is marked as crashed and last (automatic?) repair failed
Oct 10 07:06:29 h1314631 mysqld[29908]: 071010  7:06:29 [ERROR] /usr/sbin/mysqld: Table './blackcrawer@1-db/nuke_nsnst_config' is marked as crashed and last (automatic?) repair failed
Oct 10 07:06:34 h1314631 mysqld[29908]: 071010  7:06:34 [ERROR] /usr/sbin/mysqld: Table './blackcrawer@1-db/nuke_nsnst_config' is marked as crashed and last (automatic?) repair failed
Oct 10 07:06:34 h1314631 mysqld[29908]: 071010  7:06:34 [ERROR] /usr/sbin/mysqld: Table './blackcrawer@1-db/nuke_nsnst_config' is marked as crashed and last (automatic?) repair failed
Oct 10 11:06:51 h1314631 mysqld_safe[13279]: Number of processes running now: 0
Oct 10 11:06:51 h1314631 mysqld_safe[13281]: restarted
Oct 10 11:06:51 h1314631 mysqld[13284]: 071010 11:06:51 [Warning] The syntax for replication startup options is deprecated and will be removed in MySQL 5.2. Please use 'CHANGE MASTER' instead.
Oct 10 11:06:51 h1314631 mysqld[13284]: 071010 11:06:51  InnoDB: Database was not shut down normally!
Oct 10 11:06:51 h1314631 mysqld[13284]: InnoDB: Starting crash recovery.
Oct 10 11:06:51 h1314631 mysqld[13284]: InnoDB: Reading tablespace information from the .ibd files...
Oct 10 11:16:13 h1314631 mysqld[13284]: InnoDB: Restoring possible half-written data pages from the doublewrite
Oct 10 11:16:13 h1314631 mysqld[13284]: InnoDB: buffer...
Oct 10 11:16:14 h1314631 mysqld[13284]: 071010 11:16:14  InnoDB: Starting log scan based on checkpoint at
Oct 10 11:16:14 h1314631 mysqld[13284]: InnoDB: log sequence number 0 1371471155.
Oct 10 11:16:14 h1314631 mysqld[13284]: InnoDB: Doing recovery: scanned up to log sequence number 0 1371471155
Oct 10 11:16:15 h1314631 mysqld[13284]: 071010 11:16:15  InnoDB: Started; log sequence number 0 1371471155
Oct 10 11:16:16 h1314631 mysqld[13284]: 071010 11:16:16 [Note] /usr/sbin/mysqld: ready for connections.
Oct 10 11:16:16 h1314631 mysqld[13284]: Version: '5.0.45-Debian_1~bpo.1-debug'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Debian etch distribution

The issue doesn't appear at all, when I run with --debug.
What am I supposed to do?
[10 Oct 2007 19:13] Samuel Vogel
My assumption about the problem not appearing with "--debug" was wrong. It just crashed on me two times with debug. I gzipped the debug trace from the last output and it is available here:
http://alpha.kilu.de/mysqld.trace.gz

This is how MySQLd was started:
/usr/sbin/mysqld --basedir=/usr --datadir=/data/mysql --user=mysql --pid-file=/var/run/mysqld/mysqld.pid --skip-external-locking --port=3306 --socket=/var/run/mysqld/mysqld.sock --debug=d,info,error,query,general,where:O,/data/mysqld.trace --myisam-recover

I really hope somebody will acknowledge this problem!
[10 Oct 2007 19:14] Samuel Vogel
Changed the name of the Bug.
[10 Oct 2007 20:26] Samuel Vogel
Now I got a Signal 6 again:

Crash info:
http://alpha.kilu.de/mysqlss.gz

Resolved Stack Trace:
0x81df50d handle_segfault + 797
0xb7fb6410 _end + -1350797504
0xb7d0ffb9 _end + -1353575703
0x854c807 safe_mutex_lock + 343
0x8271db0 _Z18mysql_print_statusv + 512
0x81dcf63 signal_hand + 595
0xb7f76240 _end + -1351060112
0xb7db14ae _end + -1352914978

The debug output:
http://alpha.kilu.de/mysqld.trace.gz

If any more infos are needed, please say so.

Regards,
Samy
[11 Oct 2007 11:07] Samuel Vogel
I did compile MySQL Enterprise Ediditon 5.0.46. But the crashes do still appear, but I do notice that the InnoDB Recovery is much faster with the Enterprise edidition.
As you can see in my logs it used to take about 10 minutes and the enterprise edition does it in approx. 3 minutes.

Does anybody here have any suggestions for me?
[12 Oct 2007 23:44] Samuel Vogel
I can actually somewhat reproduce the crashes now.
When I run mysqldump, it crashes about one minute after mysqldump is started. It does this every single time.

So it seems, that the problem may somehow be load related, even thou we have 2am right now, and not very man people are online at this time!

I would really appreciate it, if some dev would show up here ;)
[28 Oct 2007 13:29] Valeriy Kravchuk
Thank you for a detailed problem report. Please, send the results of:

SHOW GLOBAL STATUS;
SHOW GLOBAL VARIABLES;

from your server, and the results of the following Linux commands:

uname -a
free
[28 Oct 2007 16:36] Samuel Vogel
Hey,

Thanks for the reply, I will submit the requested information.
I have to admin thou, that the problem does not appear hardly ad all anymore, after I checked the slow-query-log and eliminated the user, that was causing almost all of the slow querys by not using indexes and thereby causing MySQL to go up to almost 100% CPU usage. He was using all MyISAM tables, if that helps.

The output of the MySQL commands can be found in the files.

# uname -a
Linux h1314631 2.6.18-4-686-bigmem #1 SMP Thu May 10 00:23:00 UTC 2007 i686 GNU/Linux

# free -m
             total       used       free     shared    buffers     cached
Mem:          4058       3832        225          0          5       1979
-/+ buffers/cache:       1847       2210
Swap:         2055        155       1900

The Server does not ever go under the line of about 130M of free Ram. Not even during the peak times. It seems like this is some kind of magic barrier.

I guess this problem won't be an easy one to solve, because I do not have a reproductible test case.

Regards,
Samy
[28 Oct 2007 16:37] Samuel Vogel
output of the requested MySQL commands

Attachment: mysql output.txt (text/plain), 30.98 KiB.

[24 Jan 2008 12:26] Sergey Kostyliov
Samuel, could you please say - did you use `replicate-rewrite-db' option in your first "slave" setup?

To me your symptoms for a first setup looks like our http://bugs.mysql.com/bug.php?id=30056
[24 Jan 2008 15:31] Samuel Vogel
No, I did not use "replicate-rewrite-db" in my setup.
[17 Feb 2008 15:45] Valeriy Kravchuk
Please, try to repeat crashes with a newer version, 5.0.51a at least, an inform about the results.
[21 Feb 2008 18:17] Samuel Vogel
Unfortunately I do not work for the same project anymore.
We did trace down the problem thou: One of our users was running an ancestor research program which was doing exstensive un-indexed joins. We shut him down and the problem dissapeared almost.
Against the remaining crashes (once a day or so) we could not find a solution.
[11 Aug 2008 17:03] David Carr
Not sure whether this helps but the error starts to occur when the text gets to 2056 characters long (or thereabouts)

I don't want to confuse the issue but there's definitely something very strange occurring because I can't recreate this behaviour on my old schema exactly. Its hard to say exactly when it works and doesn't but it seems to depend on which client I create the procedure in and the character encoding of the schema, database and client. Sorry that I can't pin it down any more - let me know if you need anything else. The previous example still holds for me though.
[13 Aug 2008 7:52] Sveta Smirnova
David, thank you for the feedback.

> Not sure whether this helps but the error starts to occur when the text gets
> to 2056 characters long (or thereabouts)

Do you have table definition and query error happens with? You say "t seems to depend on which client I create the procedure in and the character encoding of the schema, database and client." Do you have statistic which clients (versions in first turn) and character encoding do  you use? Also please try current version 5.0.67 and check if problem still exists.
[13 Sep 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".