MySQL Bugs: #54286: Server crash at lock timeout with MRR

Bug #54286	Server crash at lock timeout with MRR
Submitted:	7 Jun 2010 11:32	Modified:	23 Nov 2010 3:30
Reporter:	Øystein Grøvlen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	mysql-next-mr-opt-backporting	OS:	Any
Assigned to:	Olav Sandstå	CPU Architecture:	Any
Tags:	mrr, optimizer_switch

Description:
The multi-session MTR test given in How-to-repeat section, hits an assert in Protocol::end_statement.  The assert goes away when turning off MRR by adding this statement at the top of the script:

  set global optimizer_switch='mrr=off';

Call stack:

 [1] _lwp_kill(0xe, 0x6, 0x0, 0xfeaed63a), at 0xfeaf4697 
  [2] thr_kill(0xe, 0x6), at 0xfeaed65c 
=>[3] my_write_core(sig = 6), line 326 in "stacktrace.c"
  [4] handle_segfault(sig = 6), line 2786 in "mysqld.cc"
  [5] __sighndlr(0x6, 0x0, 0xfe631f28, 0x8313bf0), at 0xfeaefdb5 
  [6] call_user_handler(0x6), at 0xfeae2d0f 
  [7] sigacthandler(0x6, 0x0, 0xfe631f28, 0xf, 0x0, 0x807007), at 0xfeae2f52 
  ---- called from signal handler with signal 6 (SIGABRT) ------
  [8] _lwp_kill(0xe, 0x6, 0xfe632164, 0xfea9bb6e), at 0xfeaf4697 
  [9] raise(0x6, 0x6, 0xfe63219c, 0xfea72474), at 0xfea9bb7a 
  [10] abort(0x65737341, 0x6f697472, 0x6166206e, 0x64656c69, 0x2c30203a, 0x6c696620, 0x682f2065, 0x2f656d6f, 0x3331676f, 0x32393736, 0x73796d2f, 0x6d2f6c71, 0x6c717379, 0x78656e2d, 0x726d2d74, 0x74706f2d, 0x6361622d, 0x726f706b, 0x676e6974, 0x6c71732f), at 0xfea724dc 
  [11] _assert(0x8dee3f1, 0x8dee3f3, 0x21c), at 0xfea727aa 
  [12] Protocol::end_statement(this = 0xce6be14), line 540 in "protocol.cc"
  [13] dispatch_command(command = COM_QUERY, thd = 0xce6ba48, packet = 0xce4f731 "", packet_length = 39U), line 1446 in "sql_parse.cc"
  [14] do_command(thd = 0xce6ba48), line 771 in "sql_parse.cc"
  [15] do_handle_one_connection(thd_arg = 0xce6ba48), line 1188 in "sql_connect.cc"
  [16] handle_one_connection(arg = 0xce6ba48), line 1127 in "sql_connect.cc"
  [17] pfs_spawn_thread(arg = 0xce06198), line 1011 in "pfs.cc"
  [18] _thrp_setup(0xfec85a00), at 0xfeaefa53 
  [19] _lwp_start(0xe, 0x6, 0x0, 0xfeaed63a, 0xfe631df8, 0x9075954), at 0xfeaefce0 
current thread: t@15
=>[1] __read(0x35, 0xce5b3d0, 0x4, 0xfeadede9), at 0xfeaf4457 
  [2] read(0x35, 0xce5b3d0, 0x4), at 0xfeadeead 
  [3] vio_read(vio = 0xce7e258, buf = 0xce5b3d0 "^G", size = 4U), line 44 in "viosocket.c"
  [4] my_real_read(net = 0xce598c4, complen = 0xfe46ff20), line 835 in "net_serv.cc"
  [5] my_net_read(net = 0xce598c4), line 1025 in "net_serv.cc"
  [6] do_command(thd = 0xce59768), line 717 in "sql_parse.cc"
  [7] do_handle_one_connection(thd_arg = 0xce59768), line 1188 in "sql_connect.cc"
  [8] handle_one_connection(arg = 0xce59768), line 1127 in "sql_connect.cc"
  [9] pfs_spawn_thread(arg = 0xce063d8), line 1011 in "pfs.cc"
  [10] _thrp_setup(0xfec86200), at 0xfeaefa53 
  [11] _lwp_start(0x35, 0xce5b3d0, 0x4, 0xfeadede9, 0xfec86200, 0x9075954), at 0xfeaefce0 

How to repeat:
--source include/have_innodb.inc
#set global optimizer_switch='mrr=off';

connect (con1,localhost,root,,);
connect (con2,localhost,root,,);

connection con1;

SET AUTOCOMMIT=0;

CREATE TABLE t1 (
  dummy INT PRIMARY KEY,
  a INT UNIQUE,
  b INT
) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1,1,1),(4,4,4),(5,5,5);
COMMIT;

SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT @@tx_isolation;

connection con2;

SET AUTOCOMMIT=0;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;

INSERT INTO t1 VALUES (3,3,3);

connection con1;

--error ER_LOCK_WAIT_TIMEOUT
SELECT * FROM t1 WHERE a > 2 FOR UPDATE;
ROLLBACK;

connection con2;

DROP TABLE t1;
COMMIT;

connection default;
disconnect con1;
disconnect con2;

Thank you for the bug report. Verified as described:

InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100607 17:41:47 InnoDB Plugin 1.0.6 started; log sequence number 0
100607 17:41:47 [Note] Event Scheduler: Loaded 0 events
100607 17:41:47 [Note] /home/miguel/bzr/5.5opt/sql/mysqld: ready for connections.
Version: '5.6.99-m4-debug-log'  socket: '/home/miguel/bzr/5.5opt/mysql-test/var/tmp/mysqld.1.sock'  port: 13000  Source distribution
mysqld: protocol.cc:540: void Protocol::end_statement(): Assertion `0' failed.
100607 17:42:38 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=3
max_threads=151
thread_count=3
connection_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 60702 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x2e64cf8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7ffc46b13e08 thread_stack 0x40000
/home/miguel/bzr/5.5opt/sql/mysqld(my_print_stacktrace+0x35)[0xbab68f]
/home/miguel/bzr/5.5opt/sql/mysqld(handle_segfault+0x2a7)[0x6c07b2]
/lib64/libpthread.so.0[0x35a240f440]
/lib64/libc.so.6(gsignal+0x35)[0x35a20329c5]
/lib64/libc.so.6(abort+0x175)[0x35a20341a5]
/lib64/libc.so.6(__assert_fail+0xf5)[0x35a202b945]
/home/miguel/bzr/5.5opt/sql/mysqld(_ZN8Protocol13end_statementEv+0x265)[0x6b27b5]
/home/miguel/bzr/5.5opt/sql/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1cb9)[0x6d1430]
/home/miguel/bzr/5.5opt/sql/mysqld(_Z10do_commandP3THD+0x27d)[0x6cf571]
/home/miguel/bzr/5.5opt/sql/mysqld(_Z24do_handle_one_connectionP3THD+0x14f)[0x6cd69e]
/home/miguel/bzr/5.5opt/sql/mysqld(handle_one_connection+0x33)[0x6cd548]
/home/miguel/bzr/5.5opt/sql/mysqld(_Z16pfs_spawn_threadPv+0xce)[0xb7e0d9]
/lib64/libpthread.so.0[0x35a2407761]
/lib64/libc.so.6(clone+0x6d)[0x35a20e150d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x7ffc34004b98 is an invalid pointer
thd->thread_id=3
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file

 - saving '/home/miguel/bzr/5.5opt/mysql-test/var/log/main.bug54286/' to '/home/miguel/bzr/5.5opt/mysql-test/var/log/main.bug54286/'
 - found 'core.10700' (0/5)

Trying 'dbx' to get a backtrace

Trying 'gdb' to get a backtrace
Core generated by '/home/miguel/bzr/5.5opt/sql/mysqld'
Output from gdb follows. The first stack trace is from the failing thread.
The following stack traces are from all threads (so the failing one is
duplicated).
--------------------------
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/9e/ef3c8dabad1ba6746e5fef74878ba17e416e38
[New Thread 10756]
[New Thread 10703]
[New Thread 10704]
[New Thread 10706]
[New Thread 10712]
[New Thread 10707]
[New Thread 10700]
[New Thread 10755]
[New Thread 10757]
[New Thread 10711]
[New Thread 10710]
[New Thread 10708]
[New Thread 10709]
[New Thread 10705]
Core was generated by `/home/miguel/bzr/5.5opt/sql/mysqld --defaults-group-suffix=.1 --defaults-file=/'.
Program terminated with signal 6, Aborted.
#0  0x00000035a240c61c in pthread_kill () from /lib64/libpthread.so.0
#0  0x00000035a240c61c in pthread_kill () from /lib64/libpthread.so.0
#1  0x0000000000bab713 in my_write_core (sig=6) at stacktrace.c:326
#2  0x00000000006c09b2 in handle_segfault (sig=6) at mysqld.cc:2786
#3  <signal handler called>
#4  0x00000035a20329c5 in raise () from /lib64/libc.so.6
#5  0x00000035a20341a5 in abort () from /lib64/libc.so.6
#6  0x00000035a202b945 in __assert_fail () from /lib64/libc.so.6
#7  0x00000000006b27b5 in Protocol::end_statement (this=0x2e651a8) at protocol.cc:540
#8  0x00000000006d1430 in dispatch_command (command=COM_QUERY, thd=0x2e64cf8, packet=0x2e67a19 "", packet_length=39)
    at sql_parse.cc:1446
#9  0x00000000006cf571 in do_command (thd=0x2e64cf8) at sql_parse.cc:771
#10 0x00000000006cd69e in do_handle_one_connection (thd_arg=0x2e64cf8) at sql_connect.cc:1188
#11 0x00000000006cd548 in handle_one_connection (arg=0x2e64cf8) at sql_connect.cc:1127
#12 0x0000000000b7e0d9 in pfs_spawn_thread (arg=0x2e42f08) at pfs.cc:1011
#13 0x00000035a2407761 in start_thread () from /lib64/libpthread.so.0
#14 0x00000035a20e150d in clone () from /lib64/libc.so.6

Thread 14 (Thread 10705):
#0  0x00000035a240b3bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000bb8ea7 in safe_cond_wait (cond=0x2d46b88, mp=0x2d46b10, file=0xd8d78d "os/os0sync.c", line=399) at thr_mutex.c:240
#2  0x00000000009ad23b in os_event_wait_low (event=0x2d46b10, reset_sig_count=0) at os/os0sync.c:399
#3  0x00000000009abf14 in os_aio_simulated_handle (global_segment=2, message1=0x7ffc4b361e60, message2=0x7ffc4b361e58, 
    type=0x7ffc4b361e50) at os/os0file.c:4197
<cut>

The crash occurs in the assert in the following code in Protocol::end_statement() (in protocol.cc):

  switch (thd->stmt_da->status()) {
  case Diagnostics_area::DA_ERROR:
  .......
  case Diagnostics_area::DA_EMPTY:
  default:
    DBUG_ASSERT(0);
    error= send_ok(thd->server_status, 0, 0, 0, NULL);
    break;

When this happens the content of the thd->stmt_da looks like:

(dbx) print *thd->stmt_da
*thd->stmt_da = {
    is_sent                = false
    can_overwrite_status   = false
    m_message              = ""
    m_sql_errno            = 0
    m_sqlstate             = ""
    m_server_status        = 0
    m_affected_rows        = 0
    m_last_insert_id       = 0
    m_statement_warn_count = 0
    m_status               = DA_EMPTY
}

This error happens when we try to read the first record from the
database. The following code in sql_select.cc does this:

int join_init_read_record(JOIN_TAB *tab)
{
  if (tab->select && tab->select->quick && tab->select->quick->reset())
    return 1;
  init_read_record(&tab->read_record, tab->join->thd, tab->table,
		   tab->select,1,1, FALSE);
  return (*tab->read_record.read_record)(&tab->read_record);
}

The first line of the above code will both with and without MRR
enabled initialize the reading of data and result in a call to:

1. QUICK_RANGE_SELECT::reset()
   ha_innobase::multi_range_read_init()
   DsMrr_impl::dsmrr_init()

while the last code line will read the first record and call the following:

2. rr_quick()
   QUICK_RANGE_SELECT::get_next()
   ha_innobase::multi_range_read_next()
   DsMrr_impl::dsmrr__next()
   handler::multi_range_read_next()

The difference between the two scenarios with MRR and without MRR is as follows:

Without MRR:
============

1. Initialization: DsMrr_impl::dsmrr_init() will do a minimal
initialization without reading anything from InnoDB.

2. The actual record will be read when executing the last line of
join_init_read_record. handler::multi_range_read_next() will return
error code 146 (due to the lock time out). This error code is
correctly handled in rr_quick() and get propagated back to the client.

With MRR:
=========

1. Initialization: DsMrr_imp::dsmrr_init() is doing a lot more during
initialization. One of the main steps is to try to fill the row id
buffer. This will include accessing the database by calling
handler::multi_range_read_next(). Also in this case
handler::multi_range_read_next() will return 146 (lock time out) error
code. Thus initialization/setup fails. The problem is that this
failure is not propagated correctly back to the client and results in
the assert.

The main difference between these two scenarios is that the lock
timeout occurs during initialization/setup when MRR is used and that
the lack of correct handling of error codes from the database. At
least the following are issues with the existing code:

1. DsMrr_impl::dsmrr_init() does not return the correct error code
when the timeout occurs. Instead it just returns 1 to indicate
failure/error.

2. In the code path from join_init_read_record() there seems to be no
code for handling such errors (opposed to the code handling done in
rr_quick() which calls rr_handle_error() to update diagnostic area).

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/111784

3209 Olav Sandstaa	2010-06-22
      Fix for Bug#54286 Server crash at lock timeout with MRR
      
      This error happens when we try to read the first record from the
      database and a lock timeout occurs. This lock timeout error message
      does not get correctly propagated. The assert happens when the reply
      is going to be sent to the client and there is no error code set in
      the diagnostic area. The reason this happens when MRR is used is that
      when using MRR more work is done during initialization of MRR. One of
      the things the MRR code does is to try to fill the row id buffer by
      reading from the database index. This result in a call to
      ha_innobase::multi_range_read() which returns the lock timeout error
      code (error 146). This error code is not propagated up to the server code.
      
      When MRR is not in use there will be no calls to the database during
      initialization. The lock timeout error will only occur when the first call
      to read a record is done. With this code path the error code is
      correctly propagated to the server code and written to the diagnostic
      area before the reply message get sent to the client.
      
      This patch fixes this problem by the following two changes:
      
      1. Fix DsMrr_impl::dsmrr_init() so that it returns the same error code
         as reported by the storage engine instead of just returning one when
         an error occurs.
      
      2. Extend join_init_read_record() (in sql_select.cc) so that it
         handles errors that occurs during initialization (in this case from
         QUICK_RANGE_SELECT::reset()) and ensures the error code gets written
         to the diagnostic area.
     @ mysql-test/include/mrr_innodb_tests.inc
        Added test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr_all.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr_icp.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr_none.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ sql/handler.cc
        Extend DsMrr_impl::dsmrr_init() so that in the case where the storage 
        engine returns an error code this error code is returned to the caller
        instead of just returning 1.
     @ sql/sql_select.cc
        Extend join_init_read_record() to handle the error code that is 
        returned from tab->select->quick->reset() by ensuring this get written
        to the diagnostic area by calling handler::print_error(). This
        fix will ensure that the error code get correctly returned to the client.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/112038

3198 Olav Sandstaa	2010-06-24
      Fix for Bug#54286 Server crash at lock timeout with MRR
      
      This error happens when we try to read the first record from the
      database and a lock timeout occurs. This lock timeout error message
      does not get correctly propagated. The assert happens when the reply
      is going to be sent to the client and there is no error code set in
      the diagnostic area. The reason this happens when MRR is used is that
      when using MRR more work is done during initialization of MRR. One of
      the things the MRR code does is to try to fill the row id buffer by
      reading from the database index. This result in a call to
      ha_innobase::multi_range_read() which returns the lock timeout error
      code (error 146). This error code is not propagated up to the server code.
      
      When MRR is not in use there will be no calls to the database during
      initialization. The lock timeout error will only occur when the first call
      to read a record is done. With this code path the error code is
      correctly propagated to the server code and written to the diagnostic
      area before the reply message get sent to the client.
      
      This patch fixes this problem by the following two changes:
      
      1. Fix DsMrr_impl::dsmrr_init() so that it returns the same error code
         as reported by the storage engine instead of just returning one when
         an error occurs.
      
      2. Extend join_init_read_record() (in sql_select.cc) so that it
         handles errors that occurs during initialization (in this case from
         QUICK_RANGE_SELECT::reset()) and ensures the error code gets written
         to the diagnostic area.
     @ mysql-test/include/mrr_innodb_tests.inc
        Added test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr_all.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr_icp.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ mysql-test/r/innodb_mrr_none.result
        Test case for Bug#54286 "Server crash at lock timeout with MRR".
     @ sql/handler.cc
        Extend DsMrr_impl::dsmrr_init() so that in the case where the storage 
        engine returns an error code this error code is returned to the caller
        instead of just returning 1.
     @ sql/sql_select.cc
        Extend join_init_read_record() to handle the error code that is 
        returned from tab->select->quick->reset() by ensuring this get written
        to the diagnostic area by calling report_error(). This
        fix will ensure that the error code get correctly returned to the client.

Patch pushed to mysql-next-mr-opt-backporting with revision id:
olav@sun.com-20100624092332-amenrmodq0ypprr6 .

Pushed into mysql-next-mr (revid:alik@sun.com-20100816062819-bluwgdq8q4xysmlg) (version source revid:alik@sun.com-20100816062612-enatdwnv809iw3s9) (pib:20)

Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)

Bug does not appear in any released 5.6.x version. No 5.6.1 changelog entry needed.