MySQL Bugs: #11354: Crashing 5.0 replication slaves

Bug #11354	Crashing 5.0 replication slaves
Submitted:	15 Jun 2005 14:38	Modified:	15 Sep 2005 1:08
Reporter:	Beat Vontobel (Silver Quality Contributor) (OCA)	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.0.11 Beta	OS:	Linux (SuSE Linux Professional (2.4.21))
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
To do some real world testing on MySQL 5.0 I'd like to hook up a 5.0.6 slave to our production environment (4.1.12 master right now). But what works fine in a controlled testing environment fails in an instant in the production system: The 5.0.6 slave never survives for more than a few seconds with our average of 30 replicated queries a second.

See error log with resolved stack trace:

mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=134217728
read_buffer_size=4190208
max_used_connections=2
max_connections=200
threads_connected=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 1768670 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0x8a000c0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xbfe3e7a8, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x8099672 handle_segfault + 430
0x83409c8 pthread_sighandler + 184
0x835cbd5 chunk_alloc + 373
0x835c911 malloc + 209
0x830b66e my_malloc + 30
0x830bc90 alloc_root + 136
0x8077ce4 sql_alloc__FUi + 36
0x80c6b1b yyparse__FPv + 80275
0x80b0444 mysql_parse__FP3THDPcUi + 68
0x8101076 exec_event__15Query_log_eventP17st_relay_log_infoPCcUi + 834
0x8100d2f exec_event__15Query_log_eventP17st_relay_log_info + 27
0x815989a exec_relay_log_event__FP3THDP17st_relay_log_info + 682
0x815a65b handle_slave_sql + 787
0x833e17c pthread_start_thread + 220
0x8367b8a thread_start + 4
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved stack trace is much more helpful in diagnosing the problem, so please do resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x4ce3fa12 is invalid pointer
thd->thread_id=9
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

How to repeat:
As the crashing slave server doesn't leave too much clues on what query actually made it crash (the master.info doesn't seem to be updated when the crash happens) it was pretty difficult to find the suspects. Although I think i got them, even after 2 days of testing I was _not_ able to reproduce this with just this single query and a data set that would be convenient to attach to this bug report (our database is about 9 GB). I definitely need your help on how to make a neat test case out of this.

Suggested fix:
none

Just tried the brand new 5.0.7.-build this morning: Same effect, the slaves crashes as soon as it replicates the production server.

Could you please provide your my.cnf files for master and slave
for to setup an initial test on my side and if possible the database
schema with some limited data which permits to run the queries
which provoke the slave crash.

Thanks in advance.

Thanks Miguel, I'll just attach you the complete schema.sql and the my.cnf for the master. On the slave I tried the same file as on the master as well as the default my-medium.cnf, with the same crashing result (just changed sever-id for the slave, of course).

But these files won't probably be of any use to you, as long as I can not tell you which single queries crash the slave. And that's my actual problem! I ran hundreds of test queries on my testing environment and never got the slave to crash. But as soon as I hook it up to the production master server (4.1.x) under heavy load, the 5.0.x slave crashes after a few seconds to minutes.

What I do: Set up a clean 5.0.x server, install our latest database backup (~9 GB). The 5.0.x server runs perfectly with this data. Then I set it up as a slave (CHANGE MASTER TO position of the backup) and start replication (at this time there is probably already some 500MB of binary log to catch up for it on the master). While catching up to the master, the slave crashes.

My problem is: All the data that's left after the crash does not leave enough clues for me to determine the exact query in the ~500 MB binary log that made the slave crash (I hope there actually is one query - and it's not a combination of queries that are the reason).

As you see the above crash report says "thd->query at 0x4ce3fa12 is invalid pointer", the info in the master.info file doesn't seem to be accurate after the crash (I still tried all the queries around the latest position and none of them crashed the server).

How can I really find out, where exactly the slave crashed? How can I reproduce this in a small testing environment? I just need a hint from you and I'd love to provide you with a really neat and small test case as I did with my latest bug reports...

Sorry,

We need fully repeatable test case.

In order to catch a query that produces a crash, start general log on the slave.

Listen, I really spent hours on this during the last weeks: I just can't find a small test case. Of course I activated binary logging on the slave with log-slave set. Then of course the last query written to the binary log of the slave should be the one before the crashing query, that's what i thought, at least... But actually the log stops everytime just in front of a different query. The only common thing amongst all these crashing queries that I can see is, that they're all of the type "REPLACE INTO ... SELECT ..." with some JOINS in the select part.

So I executed these queries in a clean testing environment (same server version, same configuration - but no other data manipulating queries on master and slave and only the tables involved in the suspect queries plus the mysql system database) - and it just passes without crashing!

I must be missing something that's different between the production and testing environment - but I just can't get it (must be on the master 4.1.x master as I use the exactly same slave for testing against the production and the testing master). Is it possible that the queries before and after the suspect queries somehow interact?

But the bug is definitely repeatable: I checked out _every_ build since 5.0.6 (5.0.11 this morning). The 5.0.x slaves _always_ crash after working through some dozen MB of our binary log (generated by 4.1.x masters, always the latest build, right now 4.1.13).

I'm really desperate because I can't get this to reproduce in a small testing environment convenient for upload. It's a real time eater as well: It takes an hour every time to set up a testing environment with all our production data (backup, copy and load data into testing master and slave). What can I do? I'm sure you want to fix all the crashing bugs before you go from beta to production with 5.0.x. And I'm really keen to switch as quickly as possible from 4.1.x to 5.0.x as soon as you declare it stable (and this bug is fixed)...

This issue seems to be solved in 5.0.12! I just hooked up a 5.0.12 build as slave against our 4.0.14 master and it is up and running flawless since about 36 hours while builds 5.0.11 and before still crash after a few seconds. Was this a duplicate for one of the other fixed crashing bugs or was the problem solved "by accident"?

May be related to http://bugs.mysql.com/bug.php?id=12532, that is fixed in 5.0.12, as mentioned in http://dev.mysql.com/doc/mysql/en/news-5-0-12.html.

Can we mark it as "not a bug", starting with version 5.0.12?

#12532 is probably a candidate for a duplicate to this one, but I was just contacted by one of your developpers (John David Duncan) directly to investigate this bug. He checks with someone of the replication team to see if this bug is really a solved duplicate. I'll wait for their answer and if we're sure that this one's really solved and shouldn't pop up again (and didn't just disappear in 5.0.12 by coincidence) I'll mark it as closed. Is this okay?

Hi Beat,

Thank you very much for all your efforts in keeping us informed 
about this failure.

I think that this indeed is a duplicate of BUG#12532 and 
suggest we close this report as "Duplicate".

Hi Lars,

I think that's fine, even if BUG#12532 was actually a duplicate for this one ;-) and not the other way round. Thanks for all your efforts at MySQL! I'm really happy to finally have a Beta build to hook up to our production master and be able to really start to check out all the new features.

Beat