Bug #35226 RBR event crashes slave
Submitted: 11 Mar 18:25 Modified: 14 Jul 21:08
Reporter: Matthew Lord
Status: Closed
Category:Server: InnoDB Severity:S1 (Critical)
Version:5.1.23 OS:Any
Assigned to: Vasil Dimov Target Version:5.1.24
Tags: RBR, row based replication, crash, slave, innodb, READ-COMMITTED
Triage: D1 (Critical) / R1 (None/Negligible) / E3 (Medium)

[11 Mar 18:25] Matthew Lord
Description:
A row based replication event crashes the slave server every time and requires
that the slave be rebuilt.

When tested with a snapshot of 5.1.24 (taken on 3/7/2008) we also encountered
the crash, here was the stack trace in the error log:
/app/mysql/bin/mysqld(print_stacktrace+0x1e)[0x74baee]
/app/mysql/bin/mysqld(handle_segfault+0x320)[0x615f20]
/lib64/libpthread.so.0[0x3a8e40de60]
/app/mysql/bin/mysqld(dict_str_starts_with_keyword+0x15)[0x7eb155]
/app/mysql/bin/mysqld(row_search_for_mysql+0x4b3)[0x8512b3]
/app/mysql/bin/mysqld(_ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function+0x1cd)[0x7c39a
d]
/app/mysql/bin/mysqld(_ZN11ha_innobase7rnd_posEPhS0_+0x9b)[0x7c364b]
/app/mysql/bin/mysqld(_ZN14Rows_log_event8find_rowEPK14Relay_log_info+0x2d6)[0x6ce086]
/app/mysql/bin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEPK14Relay_log_info+0x1e)[0x6
ce19e]
/app/mysql/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x172)[0x6c9c
d2]
/app/mysql/bin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14Relay_log_infob+0x1
1d)[0x73a95d]
/app/mysql/bin/mysqld(handle_slave_sql+0x687)[0x73fba7]
/lib64/libpthread.so.0[0x3a8e4062e7]
/lib64/libc.so.6(clone+0x6d)[0x3a8d8ce3bd]
 

How to repeat:
tar xzvf csc23899_bug.tar.gz
cd csc23899_bug/mysql-5.1.23-rc-linux-x86_64-glibc23
./bin/mysqld_safe --defaults-file=../my.cnf --basedir=. --datadir=../data &
./bin/mysql -u root --socket=/tmp/mysql-csc23899.sock
show slave status\G
start slave sql_thread;

(so far only tested on x86_64)

Suggested fix:
We should hopefully always be able to process the event or die gracefully.
[14 Mar 17:00] Heikki Tuuri
The problem probably is that replication does not set thd's query string to a sensible
value.

The code below is a hack: a better way to do this would be to check the SQL statement type
in ha_innodb.cc, from the flags MySQL has.

row0sel.c in 5.1:

        if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
            && prebuilt->select_lock_type != LOCK_NONE
            && trx->mysql_query_str && trx->mysql_thd) {

                /* Scan the MySQL query string; check if SELECT is the first
                word there */

                if (dict_str_starts_with_keyword(
                            trx->mysql_thd, *trx->mysql_query_str, "SELECT")) {
                        /* It is a plain locking SELECT and the isolation
                        level is low: do not lock gaps */

                        set_also_gap_locks = FALSE;
                }
        }
[14 Mar 17:11] Heikki Tuuri
/*************************************************************************
Gets the InnoDB transaction handle for a MySQL handler object, creates
an InnoDB transaction struct if the corresponding MySQL thread struct still
lacks one. */
static
trx_t*
check_trx_exists(
/*=============*/
                        /* out: InnoDB transaction handle */
        THD*    thd)    /* in: user thread handle */
{
        trx_t*& trx = thd_to_trx(thd);

        ut_ad(thd == current_thd);

        if (trx == NULL) {
                DBUG_ASSERT(thd != NULL);
                trx = trx_allocate_for_mysql();

                trx->mysql_thd = thd;
                trx->mysql_query_str = thd_query(thd);

Ok, I guess MySQL does not set thd->query to a sensible value.

We know that in replication, SELECT is NOT processed. Thus we know the code in row0sel.c
should skip the statement.

Probably the following simple change in row0sel.c fixes the crash:

Replace:
        if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
            && prebuilt->select_lock_type != LOCK_NONE
            && trx->mysql_query_str && trx->mysql_thd) {

by:

        if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
            && prebuilt->select_lock_type != LOCK_NONE
            && trx->mysql_query_str
            && *trx->mysql_query_str
            && trx->mysql_thd) {

I assume that *trx->mysql_query_str is NULL in the crash.
[26 Mar 21:22] Vasil Dimov
Matthew, can you check whether the last suggestion by Heikki fixes this bug?

I.e. the following patch:

--- patch begins here ---
Index: row/row0sel.c
===================================================================
--- row/row0sel.c	(revision 2385)
+++ row/row0sel.c	(working copy)
@@ -3574,13 +3574,15 @@ shortcut_fails_too_big_rec:
 	}
 
 	trx_start_if_not_started(trx);
 
 	if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
 	    && prebuilt->select_lock_type != LOCK_NONE
-	    && trx->mysql_query_str && trx->mysql_thd) {
+	    && trx->mysql_query_str != NULL
+	    && *trx->mysql_query_str != NULL
+	    && trx->mysql_thd != NULL) {
 
 		/* Scan the MySQL query string; check if SELECT is the first
 		word there */
 
 		if (dict_str_starts_with_keyword(
 			    trx->mysql_thd, *trx->mysql_query_str, "SELECT")) {
--- patch end here ---

I will try to reproduce this tomorrow...
[28 Mar 15:52] Daniel Fischer
Applied a snapshot containing a fix for this bug to 5.1-build.
[31 Mar 16:53] Bugs System
Pushed into 5.1.24-rc
[2 Apr 16:51] Vasil Dimov
I confirm that indeed this is the problem. I managed to crash a debug version of mysqld
(without the patch in this bug report applied):

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1141680480 (LWP 5752)]
0x000000000079bb13 in dict_accept (cs=0xbe9980, ptr=0x0, string=0x9c1b49 "SELECT", 
    success=0x440c92f0) at dict/dict0dict.c:2226
2226		while (my_isspace(cs, *ptr)) {
Current language:  auto; currently c
(gdb) 
(gdb) 
(gdb) bt
#0  0x000000000079bb13 in dict_accept (cs=0xbe9980, ptr=0x0, string=0x9c1b49 "SELECT", 
    success=0x440c92f0) at dict/dict0dict.c:2226
#1  0x000000000079da3d in dict_str_starts_with_keyword (mysql_thd=0x11a4d90, str=0x0, 
    keyword=0x9c1b49 "SELECT") at dict/dict0dict.c:3289
#2  0x0000000000800836 in row_search_for_mysql (buf=0x11b17d8 "\buri_path", mode=2, 
    prebuilt=0x2aaaaaba28b8, match_mode=1, direction=0) at row/row0sel.c:3566
#3  0x000000000077c2d4 in ha_innobase::index_read (this=0x11b15f0, 
    buf=0x11b17d8 "\buri_path", key_ptr=0x11b1968 "\b", key_len=34, 
    find_flag=HA_READ_KEY_EXACT) at handler/ha_innodb.cc:4164
#4  0x000000000077bfc1 in ha_innobase::rnd_pos (this=0x11b15f0, 
    buf=0x11b17d8 "\buri_path", pos=0x11b1968 "\b") at handler/ha_innodb.cc:4578
#5  0x00000000006d66f2 in handler::rnd_pos_by_record (this=0x11b15f0, 
    record=0x11b17d8 "\buri_path") at ../../sql/handler.h:1434
#6  0x0000000000699d85 in Rows_log_event::find_row (this=0x11ae600, rli=0x1179930)
    at log_event.cc:8078
#7  0x000000000069a1f5 in Update_rows_log_event::do_exec_row (this=0x11ae600, 
    rli=0x1179930) at log_event.cc:8470
#8  0x000000000069d8e7 in Rows_log_event::do_apply_event (this=0x11ae600, rli=0x1179930)
    at log_event.cc:6650
#9  0x0000000000719f34 in Log_event::apply_event (this=0x11ae600, rli=0x1179930)
    at log_event.h:1011
#10 0x0000000000713693 in apply_event_and_update_pos (ev=0x11ae600, thd=0x11a4d90, 
    rli=0x1179930, skip=true) at slave.cc:1915
#11 0x0000000000713fb4 in exec_relay_log_event (thd=0x11a4d90, rli=0x1179930)
    at slave.cc:2057
#12 0x0000000000714555 in handle_slave_sql (arg=0x1178670) at slave.cc:2700
#13 0x00002ae636a0ef1a in start_thread () from /lib/libpthread.so.0
#14 0x00002ae6374db6c2 in clone () from /lib/libc.so.6
#15 0x0000000000000000 in ?? ()
(gdb) f 2
#2  0x0000000000800836 in row_search_for_mysql (buf=0x11b17d8 "\buri_path", mode=2, 
    prebuilt=0x2aaaaaba28b8, match_mode=1, direction=0) at row/row0sel.c:3566
3566			if (dict_str_starts_with_keyword(
(gdb) ins trx->mysql_query_str
$1 = (char **) 0x11a4df0
(gdb) ins *trx->mysql_query_str
$2 = 0x0
(gdb) 

With the patch applied it does not crash but prints this:

080402 17:54:12 [Note] Slave SQL thread initialized, starting replication in log
'mysql-bin.001347' at position 106, relay log './sea1tpmysql04s2-relay-bin.000773'
position: 251
[3 Apr 15:01] Bugs System
Pushed into 6.0.5-alpha
[15 Apr 3:46] Paul DuBois
Noted in 5.1.24, 6.0.5 changelogs.

In some cases, mysqld would crash during replication startup because
InnoDB tried to parse a NULL query and failed.
[2 Jul 1:28] Jeremy Cole
Your changelog comment for this bug is grossly insufficient:

"In some cases, mysqld would crash during replication startup because InnoDB tried to
parse a NULL query and failed. (Bug#35226)"

It took me a few hours to track down this bug and confirm for a customer that an upgrade
really would solve their problems, largely because the changelog entry was so poor.  I
would suggest:

"Replication: When using row-based replication, a slave may crash because it received a
row-based replication event that InnoDB could not handle due to an incorrect test of the
query string provided by MySQL, which was NULL for row-based replication events."
[2 Jul 8:11] Vasil Dimov
Setting back to Documenting given the Jeremy's note.

Jeremy, your version does not mention that the crash can only happen at startup.
[14 Jul 21:08] Paul DuBois
Modified changelog entry:

When using row-based replication, a slave could crash at startup
because it received a row-based replication event that InnoDB could
not handle due to an incorrect test of the query string provided by
MySQL, which was NULL for row-based replication events.