MySQL Bugs: #10442: Server crashing bug on replication slaves

Bug #10442	Server crashing bug on replication slaves
Submitted:	7 May 2005 23:22	Modified:	28 Aug 2006 20:16
Reporter:	Beat Vontobel (Silver Quality Contributor) (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	mysql-standard-4.1.11-pc-linux-gnu-i686	OS:	Linux (Linux 2.4.21 (SuSE Professional))
Assigned to:	Elliot Murphy	CPU Architecture:	Any

Description:
I know this is probably gonna be a hard one to reproduce and fix, but I still thought I'd post a bug report after losing both my replication slaves tonight at the same time due to the same segfault. The slaves run on different hardware and different linux kernels, which probably rules out hardware/os bugs. As this is a production system, I was mainly concerned with getting it up and running again as fast as possible, but still tried to gather as much information as possible for the bug report.

Resolved stack on both of the servers (mysql2 and mysql3) was exactly the same:
0x808ae43 handle_segfault + 423
0x82def78 pthread_sighandler + 184
0x80b6a03 setup_tables__FP13st_table_list + 35
0x80bae1f prepare__4JOINPPP4ItemP13st_table_listUiP4ItemUiP8st_orderT6T4T6P13st_select_lexP18st_select_lex_unit + 127
0x812964f prepare__18st_select_lex_unitP3THDP13select_resultUlPCc + 1063
0x812abc5 mysql_derived__FP3THDP6st_lexP18st_select_lex_unitP13st_table_list + 121
0x812ab0b mysql_handle_derived__FP6st_lex + 75
0x80d6172 mysql_multi_update_lock__FP3THDP13st_table_listPt4List1Z4ItemP13st_select_lex + 174
0x80a0f01 check_multi_update_lock__FP3THDP13st_table_listPt4List1Z4ItemP13st_select_lex + 181
0x809a8c9 mysql_execute_command__FP3THD + 181
0x809f3ff mysql_parse__FP3THDPcUi + 207
0x80dea5d exec_event__15Query_log_eventP17st_relay_log_info + 401
0x81246b9 exec_relay_log_event__FP3THDP17st_relay_log_info + 625
0x812541d handle_slave_sql + 785
0x82dc72c pthread_start_thread + 220
0x83060ba thread_start + 4

After the crash the servers both got in an endless loop, being restarted and crashing again all the time. I will attach this part of the mysqld logs.

Some facts:
* All machines (master and slaves) run mysql-standard-4.1.11-pc-linux-gnu-i686
* The master was not affected and is still running perfectly
* Uptime of mysqld on both slaves was about one to three weeks before the crash (not exactly the same uptime on both of the slaves)
* The exact same query that probably crashed the slaves has been run many times before (executed on a daily basis since last year)
* This query from the relay log executed fine when fed manually via mysql command line client to the servers after deleting master.info and restarting the servers
* I could get both slaves running again after deleting master.info and a new CHANGE MASTER TO one query after the one that seemed to crash the slaves
* I did a table check on all the tables after the crash and didn't find any errors
* mysql2 (runs on dedicated hardware): Linux version 2.4.21-231-default (root@i386.suse.de) (gcc version 3.2.2) #1 Mon Jun 28 15:00:49 UTC 2004
* mysql3 (runs on VMWare): Linux version 2.4.21-273-smp4G (root@i386.suse.de) (gcc version 3.3.1 (SuSE Linux)) #1 SMP Mon Jan 17 13:19:07 UTC 2005

The query that seemed to crash the servers (last to be executed from the relay logs):
# at 260193316
#050507 19:00:01 server id 101  log_pos 260193316       Query   thread_id=2662841       exec_time=0     error_code=0
use MSSQL;
SET TIMESTAMP=1115492401;
/*
 *  immissionen2mssql.sql
 *  MeteoNews GmbH, Beat Vontobel, 2004-11-08
 *
 *  Kopiert die aktuellsten Immissions-Daten von met_data.nabel
 *  nach MSSQL.immissionen, wird täglich um 19 UTC aufgerufen.
 *
 */

UPDATE  MSSQL.immissionen,
        (       SELECT          nabel.nabel_id AS id,
                                ROUND(IFNULL(nabel.O3, -99)) AS o3,
                                ROUND(IFNULL(nabel.NO2, -99)) AS no2,
                                ROUND(IFNULL(nabel.SO2, -99)) AS so2,
                                NOW()+0 AS bearbeitet
                FROM            met_data.nabel
                WHERE           nabel.date = ( SELECT MAX(date) FROM met_data.nabel)
        ) AS t
SET     immissionen.o3 = t.o3,
        immissionen.no2 = t.no2,
        immissionen.so2 = t.so2,
        immissionen.bearbeitet = t.bearbeitet
WHERE   immissionen.id = t.id;

I will run this query again on the master tomorrow morning after the slaves will have catched up and see what happens.

How to repeat:
Difficult, as this happened in a production system with some tens to hundreds of queries per second on the master and a little bit less on the slaves. The query that possibly crashed the slaves normally runs without any problems. So I don't really have a neat test case. If it should happen again I will further investigate on that and probably post all the data from the affected tables.

Suggested fix:
none

mysqld.log extract from one of the servers (mysql3)

Attachment: mysql3d.log.txt (text/plain), 12.98 KiB.

Okay, it happened again. On exactly the same query (see above) both my slaves crashed tonight at exactly the same time.

What I verified: It's really the query above crashing the servers (I set the slaves again to the same query using CHANGE MASTER TO and they crashed again, but not if I skipped that query.). The query only crashes the servers if fed to them via replication, NOT if executed directly:

mysql> UPDATE  MSSQL.immissionen,
    ->         (       SELECT          nabel.nabel_id AS id,
    ->                                 ROUND(IFNULL(nabel.O3, -99)) AS o3,
    ->                                 ROUND(IFNULL(nabel.NO2, -99)) AS no2,
    ->                                 ROUND(IFNULL(nabel.SO2, -99)) AS so2,
    ->                                 NOW()+0 AS bearbeitet
    ->                 FROM            met_data.nabel
    ->                 WHERE           nabel.date = ( SELECT MAX(date) FROM met_data.nabel)
    ->         ) AS t
    -> SET     immissionen.o3 = t.o3,
    ->         immissionen.no2 = t.no2,
    ->         immissionen.so2 = t.so2,
    ->         immissionen.bearbeitet = t.bearbeitet
    -> WHERE   immissionen.id = t.id;
Query OK, 16 rows affected (0.06 sec)
Rows matched: 16  Changed: 16  Warnings: 0

Both tables involved seem to be fine (just to prove):

mysql3:/var/lib/mysql # myisamchk -eT MSSQL/immissionen 
Checking MyISAM file: MSSQL/immissionen
Data records:      16   Deleted blocks:       0
- check file-size
- check record delete-chain
- check key delete-chain
- check index reference
- check data record references index: 1
- check records and index references
mysql3:/var/lib/mysql # myisamchk -eT met_data/nabel
Checking MyISAM file: met_data/nabel
Data records:    4896   Deleted blocks:       0
- check file-size
- check record delete-chain
- check key delete-chain
- check index reference
- check data record references index: 1
- check records and index references

I'll have to set up a dedicated test environment with one master and one slave tomorrow with just the two tables involved and only this single query running to see if I can reproduce this.

I finally found some time to reproduce this in a testing environment: I just set up two clean new machines with mysql-standard-4.1.11 and used the default tables (with database 'test') and my-small.cnf for these tests (just uncommented 'log-bin' on the master and set 'server-id = 2' on the slave, no more changes).

On master:

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%';
Query OK, 0 rows affected (0.01 sec)

On slave:

mysql> CHANGE MASTER TO MASTER_HOST='master', MASTER_USER='root', MASTER_LOG_FILE='mysql4-bin.000001';
Query OK, 0 rows affected (0.00 sec)

mysql> START SLAVE;          
Query OK, 0 rows affected (0.00 sec)

On master:
mysql> use test 
Database changed
mysql> source ~/crash_dump.sql
Query OK, 16 rows affected (0.01 sec)
Records: 16  Duplicates: 0  Warnings: 0

Query OK, 17 rows affected (0.01 sec)
Records: 17  Duplicates: 0  Warnings: 0

mysql> source ~/crash_query.sql
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

At this point the slave will crash, verify on slave:

mysql> SHOW SLAVE STATUS\G
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2013 (HY000): Lost connection to MySQL server during query
ERROR: 
Can't connect to the server

You can find the sample data set and the crashing query used in this example as attachments.

Tables used for test case

Attachment: crash_dump.sql (application/octet-stream, text), 4.34 KiB.

Query used to crash all slaves

Attachment: crash_query.sql (application/octet-stream, text), 640 bytes.

Hi,

Thank you for the bug report!
I was able reproduce it with latest 4.1 bk tree.

(gdb) bt
#0  setup_tables (tables=0x8d46458) at mysql_priv.h:1211
#1  0x08197a3b in JOIN::prepare (this=0x8d4c330, rref_pointer_array=0x8d459c4, tables_init=0x8d46458, wild_num=0, conds_init=0x8d469b0, og_num=0,
    order_init=0x0, group_init=0x0, having_init=0x0, proc_param_init=0x0, select_lex_arg=0x8d458c0, unit_arg=0x8d459f0) at sql_select.cc:275
#2  0x0823cbcf in st_select_lex_unit::prepare (this=0x8d459f0, thd_arg=0x8d14b38, sel_result=0x8d470b8, additional_options=0,
    tmp_table_alias=0x8d46a60 "t") at sql_union.cc:234
#3  0x0823e3b3 in mysql_derived (thd=0x8d14b38, lex=0x8d14b74, unit=0x8d459f0, org_table_list=0x8d46a88) at sql_derived.cc:126
#4  0x0823e2bd in mysql_handle_derived (lex=0x8d14b74) at sql_derived.cc:59
#5  0x081b8e08 in mysql_multi_update_lock (thd=0x8d14b38, table_list=0x8d45860, fields=0x8d14cd0, select_lex=0x8d14c68) at sql_update.cc:510
#6  0x0817abd1 in check_multi_update_lock (thd=0x8d14b38, tables=0x8d45860, fields=0x8d14cd0, select_lex=0x8d14c68) at sql_parse.cc:5174
#7  0x08173cc1 in mysql_execute_command (thd=0x8d14b38) at sql_parse.cc:1952
#8  0x08178f02 in mysql_parse (thd=0x8d14b38,
    inBuf=0x8d0885d "UPDATE  immissionen,\n        (       SELECT          nabel.nabel_id AS id,\n", ' ' <repeats 32 times>, "ROUND(IFNULL(nabel.O3, -99)) AS o3,\n", ' ' <repeats 32 times>, "ROUND(IFNULL(nabel.NO2, -"..., length=639) at sql_parse.cc:4209
#9  0x081c47da in Query_log_event::exec_event (this=0x8d16440, rli=0x8d29cc4) at log_event.cc:1024
#10 0x0823596f in exec_relay_log_event (thd=0x8d14b38, rli=0x8d29cc4) at slave.cc:2960
#11 0x08236db5 in handle_slave_sql (arg=0x8d23a68) at slave.cc:3472
#12 0xb7e41f1b in pthread_start_thread () from /lib/libpthread.so.0
#13 0xb7e41f9f in pthread_start_thread_event () from /lib/libpthread.so.0
#14 0xb7d70fda in clone () from /lib/libc.so.6

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/26123

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/26262

Fix has been pushed into 4.1 tree (tests will be merged to 5.0 soon).

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented in 4.1.13 Change History; marked as Closed.

This bug did not exist in 5.0, but the regression tests for this bug fix have been pushed into the 5.0 tree (will be a part of 5.0.8).

Thanks for the bugfix!

Maybe this exact bug doesn't exist in 5.0.x - but I just posted a very similar one in 5.0.x one week ago (replication slaves crashing on certain queries). See #11354. Unfortunately other than with this bug, I wasn't yet able to find a test case that can easily reproduce that (if I just execute the exact same query that crashed the server under full load again in a clean test environment the problem doesn't appear). Maybe you can give me some hints on how to succeed with that...

Since the fix is already documented for 4.1.x and the fixed bug is not an issue in 5.0, I have closed the bug.

As this is can be used as DoS, it has been assigned a CVE id, please mention it
in the changelog: CVE-2006-4380

Added CVE number to 4.1.13 changelog entry. Thanks.