MySQL Bugs: #62037: assert fails in row_search_for_mysql on TRX_ISO_READ

Bug #62037	assert fails in row_search_for_mysql on TRX_ISO_READ_UNCOMMITTED
Submitted:	30 Jul 2011 14:14	Modified:	13 Jul 2012 18:02
Reporter:	Mark Callaghan	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: InnoDB Plugin storage engine	Severity:	S1 (Critical)
Version:	5.1.52, 5.1.58, 5.5.14	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	innodb, regression

Description:
The SQL is not in the error log.

The assertion is:
110730  2:22:17  InnoDB: Assertion failure in thread 1598527808 in file row/row0sel.c line 4460
InnoDB: Failing assertion: trx->isolation_level == TRX_ISO_READ_UNCOMMITTED

The call stack from resolve_stack_dump is:

0x7d96b5 row_search_for_mysql + 6805
0x77fa67 _ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function + 503
0x6d55b1 _ZN7handler18index_read_idx_mapEPhjPKhm16ha_rkey_function + 113
0x65058b _Z15join_read_constP13st_join_table + 235
0x65f1ee _Z21join_read_const_tableP13st_join_tableP11st_position + 78
0x65fc9c _Z20make_join_statisticsP4JOINP10TABLE_LISTP4ItemP16st_dynamic_array + 2252
0x663e01 _ZN4JOIN8optimizeEv + 1169
/usr/libexec/mysqld(mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned 
0x66d4ed _Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select + 141
0x66df28 _Z13handle_selectP3THDP6st_lexP13select_resultm + 456
0x5f0454 _Z21execute_sqlcom_selectP3THDP10TABLE_LISTPy + 308
0x5f4f41 _Z21mysql_execute_commandP3THDPy + 9969
0x5fa603 _Z11mysql_parseP3THDPcjPPKcPy + 803
0x5fbb3c _Z16dispatch_command19enum_server_commandP3THDPcj + 5276
0x5fc193 _Z10do_commandP3THD + 275
0x5eba8a handle_one_connection + 1994

Other text in the error log:

key_buffer_size=16777216
read_buffer_size=1048576
max_used_connections=2540
max_threads=5000
threads_connected=652

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x2ab985504760 is an invalid pointer
thd->thread_id=131391566
thd->killed=NOT_KILLED

And I think I have seen this before ...

Fatal signal 11 while backtracing
mysqld: misc.cpp:103: int __cxa_pure_virtual(): Assertion `!"Aborted: pure virtual method called."' failed.
mysqld: misc.cpp:103: int __cxa_pure_virtual(): Assertion `!"Aborted: pure virtual method called."' failed.

This is the source code for the assert in row_search_for_mysql

   4441         } else {
   4442                 if (prebuilt->template_type == ROW_MYSQL_DUMMY_TEMPLATE) {
   4443                         memcpy(buf + 4, result_rec
   4444                                - rec_offs_extra_size(offsets),
   4445                                rec_offs_size(offsets));
   4446                         mach_write_to_4(buf,
   4447                                         rec_offs_extra_size(offsets) + 4);
   4448                 } else {
   4449                         if (!row_sel_store_mysql_rec(buf, prebuilt,
   4450                                                      result_rec, offsets)) {
   4451                                 /* Only fresh inserts may contain
   4452                                 incomplete externally stored
   4453                                 columns. Pretend that such records do
   4454                                 not exist. Such records may only be
   4455                                 accessed at the READ UNCOMMITTED
   4456                                 isolation level or when rolling back a
   4457                                 recovered transaction. Rollback
   4458                                 happens at a lower level, not here. */
   4459                                 ut_a(trx->isolation_level
   4460                                      == TRX_ISO_READ_UNCOMMITTED);
   4461                                 goto next_rec;
   4462                         }

How to repeat:
I don't know

Suggested fix:
I don't know

There are many asserts for TRX_ISO_READ_UNCOMMITTED that are fairly new including this assert

3351.14.276 marko.makela@oracle.com 20101019 | /* Returning a row to MySQL */
|
| if (!row_sel_store_mysql_rec(buf, prebuilt, result_rec,
| result_rec != rec,
| offsets)) {
3351.14.170 marko.makela@oracle.com 20100630 | /* Only fresh inserts may contain
| incomplete externally stored
| columns. Pretend that such records do
| not exist. Such records may only be
| accessed at the READ UNCOMMITTED
3351.14.166 marko.makela@oracle.com 20100629 | isolation level or when rolling back a
3351.14.170 marko.makela@oracle.com 20100630 | recovered transaction. Rollback
| happens at a lower level, not here. */
3351.14.166 marko.makela@oracle.com 20100629 | ut_a(trx->isolation_level
| == TRX_ISO_READ_UNCOMMITTED);
| goto next_rec;

revno: 3351.14.166
committer: Marko Mäkelä <marko.makela@oracle.com>
branch nick: 5.1-innodb
timestamp: Tue 2010-06-29 15:55:18 +0300
message:
Bug#54358: READ UNCOMMITTED access failure of off-page DYNAMIC or COMPRESSED
columns

When the server crashes after a record stub has been inserted and
before all its off-page columns have been written, the record will
contain incomplete off-page columns after crash recovery. Such records
may only be accessed at the READ UNCOMMITTED isolation level or when
rolling back a recovered transaction in recv_recovery_rollback_active().
Skip these records at the READ UNCOMMITTED isolation level.

TODO: Add assertions for checking the above assumptions hold when an
incomplete BLOB is encountered.

btr_rec_copy_externally_stored_field(): Return NULL if the field is
incomplete.

row_prebuilt_t::templ_contains_blob: Clarify what "BLOB" means in this
context. Hint: MySQL BLOBs are not the same as InnoDB BLOBs.

row_sel_store_mysql_rec(): Return FALSE if not all columns could be
retrieved. Previously this function always returned TRUE. Assert that
the record is not delete-marked.

row_sel_push_cache_row_for_mysql(): Return FALSE if not all columns
could be retrieved.

row_search_for_mysql(): Skip records containing incomplete off-page
columns. Assert that the transaction isolation level is READ
UNCOMMITTED.

rb://380 approved by Jimmy Yang

And the root cause for the assert is that this function returns NULL

/*******************************************************************//**
Copies an externally stored field of a record to mem heap.
@return the field copied to heap, or NULL if the field is incomplete */
UNIV_INTERN
byte*
btr_rec_copy_externally_stored_field(
/*=================================*/
        const rec_t*    rec,    /*!< in: record in a clustered index;
                                must be protected by a lock or a page latch */
        const ulint*    offsets,/*!< in: array returned by rec_get_offsets() */
        ulint           zip_size,/*!< in: nonzero=compressed BLOB page size,
                                zero for uncompressed BLOBs */
        ulint           no,     /*!< in: field number */
        ulint*          len,    /*!< out: length of the field */
        mem_heap_t*     heap)   /*!< in: mem heap */
{
        ulint           local_len;
        const byte*     data;

        ut_a(rec_offs_nth_extern(offsets, no));

        /* An externally stored field can contain some initial
        data from the field, and in the last 20 bytes it has the
        space id, page number, and offset where the rest of the
        field data is stored, and the data length in addition to
        the data stored locally. We may need to store some data
        locally to get the local record length above the 128 byte
        limit so that field offsets are stored in two bytes, and
        the extern bit is available in those two bytes. */

        data = rec_get_nth_field(rec, offsets, no, &local_len);

        ut_a(local_len >= BTR_EXTERN_FIELD_REF_SIZE);

        if (UNIV_UNLIKELY
            (!memcmp(data + local_len - BTR_EXTERN_FIELD_REF_SIZE,
                     field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE))) {
                /* The externally stored field was not written yet.
                This record should only be seen by
                recv_recovery_rollback_active() or any
                TRX_ISO_READ_UNCOMMITTED transactions. */
                return(NULL);
        }

It would be still nice to find out exact SQL failing or, at least, information about the table accessed while this assertion happens. 

I had tried to run concurrent INSERTs, ROLLBACK WORK and SELECT (both all rows and by index on c2) for the table like this:

mysql> show create table tinno\G
*************************** 1. row ***************************
       Table: tinno
Create Table: CREATE TABLE `tinno` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `c2` int(11) DEFAULT NULL,
  `c3` blob,
  PRIMARY KEY (`id`),
  KEY `c2` (`c2`)
) ENGINE=InnoDB AUTO_INCREMENT=145390 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
1 row in set (0.06 sec)

mysql> show table status like 'tinno'\G
*************************** 1. row ***************************
           Name: tinno
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 4098
 Avg_row_length: 32883
    Data_length: 134758400
Max_data_length: 0
   Index_length: 98304
      Data_free: 131072000
 Auto_increment: 145390
    Create_time: 2011-07-30 18:51:24
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: row_format=DYNAMIC
        Comment: 
1 row in set (0.07 sec)

with innodb_file_per_table and Barracuda format set, but no assertions happened yet.

I think mysqldump SELECT statements were repeatedly crashing a server and am trying to reproduce that now

While row_search_for_mysql has good error messages for other errors -- messages that include the table_name, the error message in this case is extremely limited. So this is also a feature request to improve those error messges by including the table/database name or space ID.

Otherwise, we must guess at the reproduction case.

And please, any time new code adds an assertion to InnoDB, please think about what will be in the mysqld error log when that assertion fails. In many cases the log is insufficient unless more detail is included prior to the assert failure. Assume that the stack dump done by mysqld on asserts and segfaults doesn't always dump the SQL statement.

On a clean datadir using 5.1.58 + plugin + compact tables + innodb file per table, I got this with 3 concurrent threads doing blob stuff:

Version: '5.1.58-community'  socket: ''  port: 3306  MySQL Community Server (GPL)
110731 11:36:49  InnoDB: Assertion failure in thread 2036 in file .\row\row0sel.c line 4473
InnoDB: Failing assertion: trx->isolation_level == TRX_ISO_READ_UNCOMMITTED
InnoDB: We intentionally generate a memory trap.

 ha_innodb_plugin.dll!row_search_for_mysql()[row0sel.c:4473]
 ha_innodb_plugin.dll!ha_innodb::index_read()[ha_innodb.cc:5479]
 ha_innodb_plugin.dll!ha_innodb::index_first()[ha_innodb.cc:5783]
 mysqld.exe!handler::read_range_first()[handler.cc:4261]
 mysqld.exe!handler::read_multi_range_first()[handler.cc:4139]
 mysqld.exe!QUICK_RANGE_SELECT::get_next()[opt_range.cc:8594]
 mysqld.exe!rr_quick()[records.cc:331]
 mysqld.exe!join_init_read_record()[sql_select.cc:12189]
 mysqld.exe!sub_select()[sql_select.cc:11459]
 mysqld.exe!do_select()[sql_select.cc:11217]
 mysqld.exe!JOIN::exec()[sql_select.cc:2343]
 mysqld.exe!mysql_select()[sql_select.cc:2555]
 mysqld.exe!handle_select()[sql_select.cc:269]
 mysqld.exe!execute_sqlcom_select()[sql_parse.cc:5171]
 mysqld.exe!mysql_execute_command()[sql_parse.cc:2307]
 mysqld.exe!mysql_parse()[sql_parse.cc:6092]
 mysqld.exe!dispatch_command()[sql_parse.cc:1277]
 mysqld.exe!do_command()[sql_parse.cc:893]
 mysqld.exe!handle_one_connection()[sql_connect.cc:1427]
 mysqld.exe!pthread_start()[my_winthread.c:85]
 mysqld.exe!_callthreadstart()[thread.c:293]

Query (02517400): =select * from `blobtest_1` where `id`<=4495 limit 1

Repeating the query and dumping tables doesn't crash.

I got this repeatable in seconds by running mysqld like this:

./bin/mysqld_safe --no-defaults --skip-gr --skip-na --ignore-builtin-innodb 
--plugin-load=ha_innodb_plugin.so --innodb-flush-log-at-trx-commit=0 --core-file

then running two copies of the attached blob tester php app concurrently.

Same test case crashed 64-bit release binaries of 5.5.14 on Mac OS X:

110731 15:41:46 [Note] /usr/local/mysql-5.5.14-osx10.5-x86_64/bin/mysqld: ready for connections.
Version: '5.5.14'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)
110731 15:42:41  InnoDB: Assertion failure in thread 4492226560 in file row0sel.c line 4471
InnoDB: Failing assertion: trx->isolation_level == TRX_ISO_READ_UNCOMMITTED
InnoDB: We intentionally generate a memory trap.

From my testcase results:
5.1.47: no assertion
5.1.48: no assertion
5.1.49: no assertion
5.1.50: hits assertion
5.1.58: hits assertion
5.5.14: hits assertion
5.5.15: no assertion. not sure why.

On a side note, the SQL query was not printed in old versions due to bug #51817 .

Could this be a duplicate of the following ?
BUG 12612184 - RACE CONDITION AFTER BTR_CUR_PESSIMISTIC_UPDATE()

http://lists.mysql.com/commits/138935
http://lists.mysql.com/commits/139001

This looks like a duplicate of Oracle Bug#12612184, which I worked on in June. Since the beginning of InnoDB, there has been a race condition in updates (quoting what I posted in the bug in June):

----start quote----
When a record is updated so that its size grows, non-updated columns can be selected for external (off-page) storage. The bug is that the initially inserted updated record contains an all-zero BLOB pointer to the field that was not updated. Only after the BLOB is allocated, the pointer will be valid. This is the window of opportunity (quoting from row_upd_clust_rec() in MySQL 3.23.53):

err = btr_cur_pessimistic_update(BTR_NO_LOCKING_FLAG, btr_cur, &big_rec, node->update, node->cmpl_info, thr, mtr);
mtr_commit(mtr);
 
if (err == DB_SUCCESS && big_rec) {
	mtr_start(mtr);
	ut_a(btr_pcur_restore_position(BTR_MODIFY_TREE, pcur, mtr));

Between the release of the page latch in mtr_commit(mtr) and the re-latching of the page in btr_pcur_restore_position(), other threads can see the invalid BLOB pointer. I checked that the invalid BLOB pointer would be zeroed in dtuple_convert_big_rec() even in 3.23.53; old versions of InnoDB had the bad habit of not initializing all fields.

The problem is amplified by the ROW_FORMAT=DYNAMIC and ROW_FORMAT=COMPRESSED that were introduced in innodb_file_format=barracuda in InnoDB Plugin, but the bug does exist in all InnoDB versions.
----end quote----

Apparently, the 5.1.58 release does not contain the fix, even though the files in the source archive are dated 2011-07-01. I pushed the fix to the mysql-5.1 tree on 2011-06-16.

There is a known bug in my fix: If the server crashes between btr_cur_mtr_commit_and_start() and the final mtr_commit(mtr) after all BLOBs have been written, at crash recovery any information about the non-updated BLOBs (which were moved to off-page storage due to the record size changing in the update) will be permanently lost.

I outlined a fix for that in Bug#12612184 on 2011-06-28, and I will get back to it after I have completed more urgent tasks that have been piling up while I was on vacation.

The recovery bug is being tracked as Oracle Bug #12704861 CORRUPTION AFTER A CRASH DURING BLOB UPDATE.

12704861 describes a race condition between concurrent INSERT (or UPDATE) and SELECT. I can reproduce this with one SELECT statement. Is my table now corrupt with a zero-length blob reference?

Yes Mark, your table could be corrupted so that a BLOB pointer is full of zero bytes (sort of a null pointer). This could be a transient or a permanent condition. The condition should be transient when another transaction is trying to access an off-page column just after an UPDATE transaction has done the first mtr_commit() and is about to x-latch the clustered index page so that it can write the off-page columns.

If you get a crash and crash recovery at an unfortunate moment, you will have permanent corruption.

I am not sure if there is an easy way to detect the corruption. The following is from memory and may be wrong. "CHECK TABLE t" would only count the records, not attempt to dereference all off-page columns. The statements "CHECKSUM TABLE t" or "SELECT * FROM t" would read all columns.

Alas, the InnoDB BLOB code (or code for retrieving off-page columns a.k.a. externally stored fields) as it was originally written by Heikki Tuuri is lacking some error checks or assertions. It could silently ignore null BLOB pointers, pretending that the column contained just what it was able to retrieve.

The BLOB storage should always be an all-or-nothing deal. Either the clustered index leaf page record points to the full BLOB suffix, or it is a null pointer. If you see off-page columns that are 0 or 768 bytes long, alarm bells should ring. Before InnoDB Plugin, InnoDB would store 768 bytes locally in the clustered index page. In the Barracuda format (ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED), the clustered index record does not store any prefix.

Can Shane confirm this doesn't reproduce with 5.1.59 or 5.1 trunk?

I will run on QuadXeon under Linux tomorrow, but with recent pull from mysql-5.1 tree in bzr I am not able to get this assertion on Mac after running for more than 1 hour:

macbook-pro:5.1 openxs$ tail -f var/macbook-pro.err InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
110802 18:41:46 InnoDB Plugin 1.0.17 started; log sequence number 0
110802 18:41:46 [Note] Event Scheduler: Loaded 0 events
110802 18:41:46 [Note] /Users/openxs/dbs/5.1/libexec/mysqld: ready for connections.
Version: '5.1.59'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution
^C
macbook-pro:5.1 openxs$ date
вт  2 сер 2011 19:49:33 EEST

This is the same Mac where I reproduced assertions on both 5.5.14 and 5.1.58 (official 64-bit binaries).

this assertion was reproduced on Percona Server 5.5.18, and ROW_FORMAT=COMPACT.

i don't know if this bug affects all types of ROW_FORMAT.

below is part of error log

120503  9:20:30  InnoDB: Assertion failure in thread 140210779649792 in file row0sel.c line 4612
InnoDB: Failing assertion: trx->isolation_level == TRX_ISO_READ_UNCOMMITTED
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
120503  9:20:30 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=104857600
read_buffer_size=1048576
max_used_connections=48
max_threads=2100
thread_count=47
connection_count=47
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2488475 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x3f64eec0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f855db6ee78 thread_stack 0x40000
/u01/mysql/bin/mysqld(my_print_stacktrace+0x39)[0x7b1b69]
/u01/mysql/bin/mysqld(handle_segfault+0x43c)[0x4fa39c]
/lib64/libpthread.so.0[0x36d0a0f520]
/lib64/libc.so.6(gsignal+0x35)[0x36d0632a45]
/lib64/libc.so.6(abort+0x175)[0x36d0634225]
/u01/mysql/bin/mysqld[0x8479c1]
/u01/mysql/bin/mysqld[0x81f4fe]
/u01/mysql/bin/mysqld[0x5a27d5]
/u01/mysql/bin/mysqld(_Z10sub_selectP4JOINP13st_join_tableb+0x79)[0x59a689]
/u01/mysql/bin/mysqld[0x59f56d]
/u01/mysql/bin/mysqld(_ZN4JOIN4execEv+0xc5e)[0x5b566e]
/u01/mysql/bin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x119)
[0x5b1249]
/u01/mysql/bin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x1c5)[0x5b6e55]
/u01/mysql/bin/mysqld[0x574028]
/u01/mysql/bin/mysqld(_Z21mysql_execute_commandP3THD+0x1102)[0x577f32]
/u01/mysql/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x3a4)[0x57b674]
/u01/mysql/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1448)[0x57cac8]
/u01/mysql/bin/mysqld(_Z24do_handle_one_connectionP3THD+0x10d)[0x61a70d]
/u01/mysql/bin/mysqld(handle_one_connection+0x4c)[0x61a85c]
/lib64/libpthread.so.0[0x36d0a077e1]
/lib64/libc.so.6(clone+0x6d)[0x36d06e68ed]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f8448004bb0): SELECT * FROM `delayed_jobs` ORDER BY `id` LIMIT 0, 1000
Connection ID (thread ID): 28339
Status: NOT_KILLED

Can this get a status update?

This bug was declared a duplicate of internal Bug #13721257, that is fixed in 5.5.22+ (see http://dev.mysql.com/doc/refman/5.5/en/news-5-5-22.html) and 5.1.63+ (Plugin only).

/home/xiyu.lh/work/5.5
$ bzr log | grep 13721257

No 13721257 info in 5.5 launchpad ?

it's not easy to find the related patch, because there isn't any comment in the bzr log....
May be this one ?

http://bazaar.launchpad.net/~mysql/mysql-server/5.5/revision/3723

http://bazaar.launchpad.net/~mysql/mysql-server/5.5/revision/3730

I don't know if it's safe to patch the diff directly to MySQL5.5.18(we are using now..). and it's impossible for us to upgrade to a new version such as MySQL5.5.22(that means long time tests...)