Bug #18505 Replication slave crashes with signal 11 using subqueries
Submitted: 25 Mar 2006 1:40 Modified: 28 Apr 2006 11:04
Reporter: Brian Wright Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.1.12 OS:Linux (Mandrakelinux release 10.1)
Assigned to: CPU Architecture:Any

[25 Mar 2006 1:40] Brian Wright
Description:
Here is everything from the binlog that leads up to the crash:

#060324 12:28:24 server id 1  log_pos 1046500738        Query   thread_id=246912723     exec_time=0     error_code=0
use database;
SET TIMESTAMP=1143232104;
SET ONE_SHOT CHARACTER_SET_CLIENT=33,COLLATION_CONNECTION=33,COLLATION_DATABASE=8,COLLATION_SERVER=8;
# at 1046500975
#060324 12:28:24 server id 1  log_pos 1046500887        Query   thread_id=246912723     exec_time=0     error_code=0
SET TIMESTAMP=1143232104;
update jr_00100713 j,
(select _receiver_email
from (
select count(_roster_ordinal) as c,
_receiver_email
from jr_00100713 jr
group by _receiver_email
) s1
where s1.c>1) s2
set j._email_status="Duplicate"
where j._receiver_email=s2._receiver_email;

I believe this query above is what causes it to spin out of control.  I didn't write this query and, thus, don't really know if it's appropriately written. Signal 11 (seg violation) usually indicates a buffer overflow error.  But, that query isn't overly long (in length). It doesn't crash the master when run and it only crashes the slave by reading over the network from the binlog.

See the stack dump below:

> resolve_stack_dump -s /tmp/mysqld.sym -n mysqld.stack
0x8132ad0 handle_segfault + 592
0xffffe420 _end + -138829652
(nil)
0x81f5eb4 _ZN18st_select_lex_unit7prepareEP3THDP13select_resultmPKc + 1636
0x81f7a50 _Z20mysql_handle_derivedP6st_lex + 208
0x818bf08 _Z23mysql_multi_update_lockP3THDP13st_table_listP4ListI4ItemEP13st_select_lex + 216
0x8146c98 _Z21mysql_execute_commandP3THD + 1400
0x814e331 _Z11mysql_parseP3THDPcj + 305
0x81968d9 _ZN15Query_log_event10exec_eventEP17st_relay_log_info + 649
0x81f240f handle_slave_sql + 1375
0x4004eb3c _end + 935241672
0x4025b93a _end + 937391558

Every time I restart the replicator, it crashes at the same exact place in the binlog.  So, I suspect this query because it's the first query that appears from the start-position set in the relay-log.info and master.info and it crashes almost immediately after running this query.

How to repeat:
I can give you the relevant portions of the replication binlog, master.info and relay-log.info to test against... or possibly, you can create the same query in a master/slave environment and see if the slave dies after the query is sent via replication.

We apparently need this query going forward and having replication die.  Right now, this is simply a test query and I can skip it and get replication moving again, but we will need to use subqueries such as this in the future.

Suggested fix:
No idea...
[27 Mar 2006 16:35] Valeriy Kravchuk
Thank yopu for a problem report. Please, try to repeat with a newer version of MySQL server as a slave, 4.1.18. Will it crash?
[28 Mar 2006 0:14] Brian Wright
I will try the .18 version of the MySQL on the replicator.  I've always tended to have problems when the slave is a different version from the master.  Will there be any problems with mismatched versions?
[28 Mar 2006 11:04] Valeriy Kravchuk
According to the manual (http://dev.mysql.com/doc/refman/4.1/en/replication-compatibility.html, for example) and experience, there should be no problems when master and slave are of the same major version (4.1 in your case) and slave is newer than master. So, please, try, and inform about the results.
[28 Apr 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".