Bug #35226 RBR event crashes slave
Submitted: 11 Mar 2008 17:25 Modified: 18 Jun 2010 22:52
Reporter: Matthew Lord Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:5.1.23 OS:Any
Assigned to: Vasil Dimov CPU Architecture:Any
Tags: crash, innodb, RBR, READ-COMMITTED, row based replication, slave

[11 Mar 2008 17:25] Matthew Lord
Description:
A row based replication event crashes the slave server every time and requires
that the slave be rebuilt.

When tested with a snapshot of 5.1.24 (taken on 3/7/2008) we also encountered
the crash, here was the stack trace in the error log:
/app/mysql/bin/mysqld(print_stacktrace+0x1e)[0x74baee]
/app/mysql/bin/mysqld(handle_segfault+0x320)[0x615f20]
/lib64/libpthread.so.0[0x3a8e40de60]
/app/mysql/bin/mysqld(dict_str_starts_with_keyword+0x15)[0x7eb155]
/app/mysql/bin/mysqld(row_search_for_mysql+0x4b3)[0x8512b3]
/app/mysql/bin/mysqld(_ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function+0x1cd)[0x7c39ad]
/app/mysql/bin/mysqld(_ZN11ha_innobase7rnd_posEPhS0_+0x9b)[0x7c364b]
/app/mysql/bin/mysqld(_ZN14Rows_log_event8find_rowEPK14Relay_log_info+0x2d6)[0x6ce086]
/app/mysql/bin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEPK14Relay_log_info+0x1e)[0x6ce19e]
/app/mysql/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x172)[0x6c9cd2]
/app/mysql/bin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14Relay_log_infob+0x11d)[0x73a95d]
/app/mysql/bin/mysqld(handle_slave_sql+0x687)[0x73fba7]
/lib64/libpthread.so.0[0x3a8e4062e7]
/lib64/libc.so.6(clone+0x6d)[0x3a8d8ce3bd]
 

How to repeat:
tar xzvf csc23899_bug.tar.gz
cd csc23899_bug/mysql-5.1.23-rc-linux-x86_64-glibc23
./bin/mysqld_safe --defaults-file=../my.cnf --basedir=. --datadir=../data &
./bin/mysql -u root --socket=/tmp/mysql-csc23899.sock
show slave status\G
start slave sql_thread;

(so far only tested on x86_64)

Suggested fix:
We should hopefully always be able to process the event or die gracefully.
[14 Mar 2008 16:00] Heikki Tuuri
The problem probably is that replication does not set thd's query string to a sensible value.

The code below is a hack: a better way to do this would be to check the SQL statement type in ha_innodb.cc, from the flags MySQL has.

row0sel.c in 5.1:

        if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
            && prebuilt->select_lock_type != LOCK_NONE
            && trx->mysql_query_str && trx->mysql_thd) {

                /* Scan the MySQL query string; check if SELECT is the first
                word there */

                if (dict_str_starts_with_keyword(
                            trx->mysql_thd, *trx->mysql_query_str, "SELECT")) {
                        /* It is a plain locking SELECT and the isolation
                        level is low: do not lock gaps */

                        set_also_gap_locks = FALSE;
                }
        }
[14 Mar 2008 16:11] Heikki Tuuri
/*************************************************************************
Gets the InnoDB transaction handle for a MySQL handler object, creates
an InnoDB transaction struct if the corresponding MySQL thread struct still
lacks one. */
static
trx_t*
check_trx_exists(
/*=============*/
                        /* out: InnoDB transaction handle */
        THD*    thd)    /* in: user thread handle */
{
        trx_t*& trx = thd_to_trx(thd);

        ut_ad(thd == current_thd);

        if (trx == NULL) {
                DBUG_ASSERT(thd != NULL);
                trx = trx_allocate_for_mysql();

                trx->mysql_thd = thd;
                trx->mysql_query_str = thd_query(thd);

Ok, I guess MySQL does not set thd->query to a sensible value.

We know that in replication, SELECT is NOT processed. Thus we know the code in row0sel.c should skip the statement.

Probably the following simple change in row0sel.c fixes the crash:

Replace:
        if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
            && prebuilt->select_lock_type != LOCK_NONE
            && trx->mysql_query_str && trx->mysql_thd) {

by:

        if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
            && prebuilt->select_lock_type != LOCK_NONE
            && trx->mysql_query_str
            && *trx->mysql_query_str
            && trx->mysql_thd) {

I assume that *trx->mysql_query_str is NULL in the crash.
[26 Mar 2008 20:22] Vasil Dimov
Matthew, can you check whether the last suggestion by Heikki fixes this bug?

I.e. the following patch:

--- patch begins here ---
Index: row/row0sel.c
===================================================================
--- row/row0sel.c	(revision 2385)
+++ row/row0sel.c	(working copy)
@@ -3574,13 +3574,15 @@ shortcut_fails_too_big_rec:
 	}
 
 	trx_start_if_not_started(trx);
 
 	if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
 	    && prebuilt->select_lock_type != LOCK_NONE
-	    && trx->mysql_query_str && trx->mysql_thd) {
+	    && trx->mysql_query_str != NULL
+	    && *trx->mysql_query_str != NULL
+	    && trx->mysql_thd != NULL) {
 
 		/* Scan the MySQL query string; check if SELECT is the first
 		word there */
 
 		if (dict_str_starts_with_keyword(
 			    trx->mysql_thd, *trx->mysql_query_str, "SELECT")) {
--- patch end here ---

I will try to reproduce this tomorrow...
[28 Mar 2008 14:52] Daniel Fischer
Applied a snapshot containing a fix for this bug to 5.1-build.
[31 Mar 2008 14:53] Bugs System
Pushed into 5.1.24-rc
[2 Apr 2008 14:51] Vasil Dimov
I confirm that indeed this is the problem. I managed to crash a debug version of mysqld (without the patch in this bug report applied):

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1141680480 (LWP 5752)]
0x000000000079bb13 in dict_accept (cs=0xbe9980, ptr=0x0, string=0x9c1b49 "SELECT", 
    success=0x440c92f0) at dict/dict0dict.c:2226
2226		while (my_isspace(cs, *ptr)) {
Current language:  auto; currently c
(gdb) 
(gdb) 
(gdb) bt
#0  0x000000000079bb13 in dict_accept (cs=0xbe9980, ptr=0x0, string=0x9c1b49 "SELECT", 
    success=0x440c92f0) at dict/dict0dict.c:2226
#1  0x000000000079da3d in dict_str_starts_with_keyword (mysql_thd=0x11a4d90, str=0x0, 
    keyword=0x9c1b49 "SELECT") at dict/dict0dict.c:3289
#2  0x0000000000800836 in row_search_for_mysql (buf=0x11b17d8 "\buri_path", mode=2, 
    prebuilt=0x2aaaaaba28b8, match_mode=1, direction=0) at row/row0sel.c:3566
#3  0x000000000077c2d4 in ha_innobase::index_read (this=0x11b15f0, 
    buf=0x11b17d8 "\buri_path", key_ptr=0x11b1968 "\b", key_len=34, 
    find_flag=HA_READ_KEY_EXACT) at handler/ha_innodb.cc:4164
#4  0x000000000077bfc1 in ha_innobase::rnd_pos (this=0x11b15f0, 
    buf=0x11b17d8 "\buri_path", pos=0x11b1968 "\b") at handler/ha_innodb.cc:4578
#5  0x00000000006d66f2 in handler::rnd_pos_by_record (this=0x11b15f0, 
    record=0x11b17d8 "\buri_path") at ../../sql/handler.h:1434
#6  0x0000000000699d85 in Rows_log_event::find_row (this=0x11ae600, rli=0x1179930)
    at log_event.cc:8078
#7  0x000000000069a1f5 in Update_rows_log_event::do_exec_row (this=0x11ae600, 
    rli=0x1179930) at log_event.cc:8470
#8  0x000000000069d8e7 in Rows_log_event::do_apply_event (this=0x11ae600, rli=0x1179930)
    at log_event.cc:6650
#9  0x0000000000719f34 in Log_event::apply_event (this=0x11ae600, rli=0x1179930)
    at log_event.h:1011
#10 0x0000000000713693 in apply_event_and_update_pos (ev=0x11ae600, thd=0x11a4d90, 
    rli=0x1179930, skip=true) at slave.cc:1915
#11 0x0000000000713fb4 in exec_relay_log_event (thd=0x11a4d90, rli=0x1179930)
    at slave.cc:2057
#12 0x0000000000714555 in handle_slave_sql (arg=0x1178670) at slave.cc:2700
#13 0x00002ae636a0ef1a in start_thread () from /lib/libpthread.so.0
#14 0x00002ae6374db6c2 in clone () from /lib/libc.so.6
#15 0x0000000000000000 in ?? ()
(gdb) f 2
#2  0x0000000000800836 in row_search_for_mysql (buf=0x11b17d8 "\buri_path", mode=2, 
    prebuilt=0x2aaaaaba28b8, match_mode=1, direction=0) at row/row0sel.c:3566
3566			if (dict_str_starts_with_keyword(
(gdb) ins trx->mysql_query_str
$1 = (char **) 0x11a4df0
(gdb) ins *trx->mysql_query_str
$2 = 0x0
(gdb) 

With the patch applied it does not crash but prints this:

080402 17:54:12 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.001347' at position 106, relay log './sea1tpmysql04s2-relay-bin.000773' position: 251
[3 Apr 2008 13:01] Bugs System
Pushed into 6.0.5-alpha
[15 Apr 2008 1:46] Paul DuBois
Noted in 5.1.24, 6.0.5 changelogs.

In some cases, mysqld would crash during replication startup because
InnoDB tried to parse a NULL query and failed.
[1 Jul 2008 23:28] Jeremy Cole
Your changelog comment for this bug is grossly insufficient:

"In some cases, mysqld would crash during replication startup because InnoDB tried to parse a NULL query and failed. (Bug#35226)"

It took me a few hours to track down this bug and confirm for a customer that an upgrade really would solve their problems, largely because the changelog entry was so poor.  I would suggest:

"Replication: When using row-based replication, a slave may crash because it received a row-based replication event that InnoDB could not handle due to an incorrect test of the query string provided by MySQL, which was NULL for row-based replication events."
[2 Jul 2008 6:11] Vasil Dimov
Setting back to Documenting given the Jeremy's note.

Jeremy, your version does not mention that the crash can only happen at startup.
[14 Jul 2008 19:08] Paul DuBois
Modified changelog entry:

When using row-based replication, a slave could crash at startup
because it received a row-based replication event that InnoDB could
not handle due to an incorrect test of the query string provided by
MySQL, which was NULL for row-based replication events.
[5 May 2010 15:24] Bugs System
Pushed into 5.1.47 (revid:joro@sun.com-20100505145753-ivlt4hclbrjy8eye) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[6 May 2010 14:07] Paul DuBois
Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
Re-closing.
[28 May 2010 6:04] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (pib:16)
[28 May 2010 6:33] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[28 May 2010 7:00] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[29 May 2010 15:08] Paul DuBois
Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
Re-closing.
[17 Jun 2010 12:08] Bugs System
Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 12:55] Bugs System
Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 13:36] Bugs System
Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)