Bug #17992 Partitions: InnoDB, somehow rotten table after UPDATE
Submitted: 6 Mar 2006 20:53 Modified: 24 Apr 2006 16:43
Reporter: Matthias Leich Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Partitions Severity:S1 (Critical)
Version:5.1 OS:
Assigned to: Mikael Ronström CPU Architecture:Any

[6 Mar 2006 20:53] Matthias Leich
Description:
# InnoDB seems to be needed, MyISAM is harmless
SET SESSION storage_engine='InnoDB';
# A table without subpartitioning is harmless.
CREATE TABLE t1 ( f_int1 INTEGER, f_int2 INTEGER,
f_char1 CHAR(10), f_char2 CHAR(10), f_charbig VARCHAR(1000))
PARTITION BY RANGE(f_int1 DIV 2) SUBPARTITION BY HASH(f_int1) SUBPARTITIONS 2 
(PARTITION parta VALUES LESS THAN (0),
PARTITION partb VALUES LESS THAN (5),
PARTITION partc VALUES LESS THAN (10),
PARTITION partd VALUES LESS THAN (2147483646));
INSERT INTO t1 SET f_int1 = 8, f_int2 = 8,
 f_char1 = '8', f_char2 = '8', f_charbig = '===8===';
INSERT INTO t1 SET f_int1 = 7, f_int2 = 7,
 f_char1 = '7', f_char2 = '7', f_charbig = '===7===';
INSERT INTO t1 SET f_int1 = 3, f_int2 = 3,
 f_char1 = '3', f_char2 = '3', f_charbig = '===3===';
INSERT INTO t1 SET f_int1 = 20, f_int2 = 20,
 f_char1 = '20', f_char2 = '20', f_charbig = '===20===';

# The following DELETE + INSERT seems to be needed
DELETE FROM t1 WHERE f_int1 = 3;
INSERT INTO t1 SET f_int1 = 3, f_int2 = 3, f_char1 = '3', f_char2 = '3', f_charbig = '===3===';

# We get here 4 rows.
SELECT * FROM t1 ORDER BY f_int1;
f_int1	f_int2	f_char1	f_char2	f_charbig
3	3	3	3	===3===
7	7	7	7	===7===
8	8	8	8	===8===
20	20	20	20	===20===
# This next update seems to cause the problem in combination 
# with the DELETE + INSERT above. 
UPDATE t1 SET f_int1 = f_int1 + 20 WHERE f_int1 BETWEEN 5 AND 9;
# Surprising fact
#             f_int1 BETWEEN 5 AND 8
#             f_int1 BETWEEN 6 AND 9
# hit the same rows, but are harmless

# Where is the row "27    7       7       7       ===7===" ?
SELECT * FROM t1 ORDER BY f_int1;
f_int1	f_int2	f_char1	f_char2	f_charbig
3	3	3	3	===3===
20	20	20	20	===20===
28	8	8	8	===8===

# Let's make some physical reorganization of the table and try the SELECT again
ALTER TABLE t1 ADD PRIMARY KEY (f_int1);
# Now the result set looks good.
SELECT * FROM t1 ORDER BY f_int1;
f_int1	f_int2	f_char1	f_char2	f_charbig
3	3	3	3	===3===
20	20	20	20	===20===
27	7	7	7	===7===
28	8	8	8	===8===

My environment:
   - Intel PC with Linux(SuSE 9.3)
   - MySQL compiled from source
         Version 5.1 last ChangeSet@1.2145, 2006-03-04

How to repeat:
Please use my attached testscript ml1111.test
  copy it to mysql-test/t
  echo "Dummy" > r/ml1111.result   # Produce a dummy file with 
                                                   # expected results
  ./mysql-test-run ml1111
[6 Mar 2006 20:54] Matthias Leich
testscript

Attachment: ml1111.test (application/test, text), 1.93 KiB.

[20 Mar 2006 18:29] Heikki Tuuri
Mikael,

the reason is that when MySQL uses the handle to insert to the InnoDB table, prebuilt->sql_stat_start == 0, and InnoDB does not know to put the current transaction id to the buffer from where it is written to the inserted row. The result is grabage in inserted rows.

Is the same table handle used earlier for something else?

We see that in ::write_row():
last_query_id == user_thd->query_id

Breakpoint 1, row_search_for_mysql (buf=0x8f8eca0 "à\024", mode=2,
    prebuilt=0x42bc2068, match_mode=1, direction=0) at row0sel.c:3097
3097            dict_index_t*   index           = prebuilt->index;
(gdb)
Continuing.

Breakpoint 7, ha_innobase::write_row (this=0x4a922a0c,
    record=0x8f8eca0 "à\034") at ha_innodb.cc:3371
3371            ibool           auto_inc_used= FALSE;
Current language:  auto; currently c++
(gdb) next
484     {
(gdb)
3367            row_prebuilt_t* prebuilt = (row_prebuilt_t*)innobase_prebuilt;
(gdb)
3366    {
(gdb)
484     {
(gdb)
3397            if (table->timestamp_field_type & TIMESTAMP_AUTO_SET_ON_INSERT)
(gdb)
3400            if ((user_thd->lex->sql_command == SQLCOM_ALTER_TABLE
(gdb)
3468            num_write_row++;
(gdb)
3470            if (last_query_id != user_thd->query_id) {
(gdb)
3468            num_write_row++;
(gdb) c
Continuing.

Breakpoint 4, row_insert_for_mysql (mysql_rec=0x8f8eca0 "à\034",
    prebuilt=0x42bc4468) at row0mysql.c:1070
1070            trx_t*          trx             = prebuilt->trx;
Current language:  auto; currently c
(gdb) print *prebuilt
$21 = {magic_n = 78540783, table = 0x42bbfe68, trx = 0x42bba468,
  sql_stat_start = 0, mysql_has_locked = 1, clust_index_was_generated = 1,
  index = 0x42bc1868, read_just_key = 0, used_in_HANDLER = 0,
  template_type = 0, n_template = 5, null_bitmap_len = 1,
  need_to_access_clustered = 1, templ_contains_blob = 0,
  mysql_template = 0x42d06668, heap = 0x42bc3e28, ins_node = 0x42bc4b08,
  ins_upd_rec_buff = 0x42d07868 "", hint_need_to_fetch_extra_cols = 0,
  upd_node = 0x0, ins_graph = 0x42d07db0, upd_graph = 0x0, pcur = 0x42bc3f68,
  clust_pcur = 0x42bc4668, sel_graph = 0x42d07f40, search_tuple = 0x42bc4868,
  row_id = "\000\000\000\bµ\n", clust_ref = 0x42bc4ac8, select_lock_type = 5,
  stored_select_lock_type = 5, row_read_type = 0, mysql_prefix_len = 1031,
  mysql_row_len = 1031, n_rows_fetched = 1, fetch_direction = 1,
  fetch_cache = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  keep_other_fields_on_keyread = 0, fetch_cache_first = 0, n_fetch_cached = 0,
  blob_heap = 0x0, old_vers_heap = 0x0, magic_n2 = 78540783}
(gdb)

I will look at this in more detail later, but this is the basic reason why there is garbage in TRX_ID in the updated and inserted rows.

This may be the same bug:

http://bugs.mysql.com/bug.php?id=17134

Regards,

Heikki
[20 Mar 2006 20:55] Heikki Tuuri
Mikael,

I have now stepped through the problematic UPDATE. Looks like MySQL opens 8 table handles to the 4 partitions, two for each partition.

It locks each partition with a write lock. Then it, for some reason, in processing the UPDATE does a search of rows in EACH of the 8 table handles. Since MySQL has done a search in those table handles, InnoDB sets prebuilt->sql_stat_start = 0 in every one of them (prebuilt is the InnoDB table handle that corresponds to the MySQL table handle). When the insert is then performed in a table handle of partd, InnoDB does not realize that it should store the TRX_ID, since it thinks the table handle has already been used in an insert.

I also noticed that in processing the SELECTs, MySQL calls strangely many times ::rnd_init(). Why is that?

How we could fix the bug:

1) InnoDB could always store the TRX_ID to the buffer every time row_insert_for_mysql() is called,

or

2) for table partd, in an SQL statement, MySQL could use one table handle for searches and updates, and use the other table handle for INSERTs.

What do you think? Is it intentional that the same table handle is used for searches and inserts in the SAME SQL statement? That has not happened in MySQL before. I am afraid it could cause also other bugs than this if this rule is no longer obeyed.

Regards,

Heikki

(gdb) print prebuilt->sql_stat_start
$68 = 1
(gdb) bt
#0  row_search_for_mysql (buf=0x8f8f3f8 "à\024", mode=1, prebuilt=0x42bc4c68,
    match_mode=0, direction=0) at row0sel.c:3099
#1  0x08301dc9 in ha_innobase::index_read (this=0x8f84918,
    buf=0x42bc4c68 "ïo®\004hλBh¬»B\001", key_ptr=0x0, key_len=1119636584,
    find_flag=HA_READ_AFTER_KEY) at ha_innodb.cc:4124
#2  0x08302324 in ha_innobase::index_first (this=0x8f84918,
    buf=0x42bc4c68 "ïo®\004hλBh¬»B\001") at ha_innodb.cc:4369
#3  0x0830247f in ha_innobase::rnd_next (this=0x8f84918, buf=0x8f8f3f8 "à\024")
    at ha_innodb.cc:4470
#4  0x0832d2d2 in ha_partition::rnd_next (this=0x8f83d60,
    buf=0x8f8f3f8 "à\024") at ha_partition.cc:3051
#5  0x08285d05 in rr_sequential (info=0x4a922c10) at records.cc:365
#6  0x082452d3 in mysql_update (thd=0x8f63030, table_list=0x8f7d600,
    fields=@0x8f63334, values=@0x8f63510, conds=0x8f7daf8,
    order_num=1119636584, order=0x0, limit=18446744073709551613,
    handle_duplicates=DUP_ERROR, ignore=false) at sql_update.cc:393
#7  0x081e4d20 in mysql_execute_command (thd=0x8f63030) at sql_parse.cc:3230
#8  0x081ed0a9 in mysql_parse (thd=0x8f63030, inBuf=0x8f6306c "°O\\\b\001",
    length=63) at sql_parse.cc:5882
#9  0x081e26e1 in dispatch_command (command=150351920, thd=0x8f63030,
    packet=0x8f75531 "UPDATE t1 SET f_int1 = f_int1 + 20 WHERE f_int1 BETWEEN 5 AND 9", packet_length=150459744) at sql_parse.cc:1741
#10 0x081e223d in do_command (thd=0x8f63030) at sql_parse.cc:1537
#11 0x081e15b2 in handle_one_connection (arg=0x8f63030) at sql_parse.cc:1179
#12 0x4016cb63 in start_thread () from /lib/tls/libpthread.so.0
#13 0x402b318a in clone () from /lib/tls/libc.so.6
(gdb) c
Continuing.
[20 Mar 2006 21:11] Mikael Ronström
Heikki,
Yep, this is the behaviour using partitioning. The reason is that an update_row that
changes partitioning is handled through
1) write_row into new partition
2) delete_row from old partition

And in this case the scan was used to read all partitions, I use rnd_init before scanning
and then rnd_end at end of scan and the call to update_row is immediately after a
rnd_next.

So I gather we have two options, either somehow use a new table handler for the write_row
or ensure that InnoDB can handle write_row on a table handle already used for scanning.
[20 Mar 2006 21:13] Mikael Ronström
Heikki,
The reason for 8 partitions is that there is also subpartitions. So there are 4 partitions with
2 subpartitions in each. Thus in total 8 partitions.

The error as I found it occurred in the last 8 partition.
[20 Mar 2006 21:19] Mikael Ronström
Heikki,
A full table scan on a partitioned table is in short handled by:
1) rnd_init on first table partition
2) call rnd_next as long as there are records remaining in table partition
3) rnd_end after last record
Continue with next partition until all partitions are handled

For index scans the same pattern is used except that we use index_init and index_end
instead
[20 Mar 2006 21:34] Heikki Tuuri
Mikael,

ok, I did not realize each named partition was divided into two subpartitions. That explains why there are 8 table handles.

Hmm... calling the below function for every inserted row does not consume much CPU time. That is an easy fix for this particular bug.

I am worried about other possible bugs that reuse of a table handle might cause. Partition code may use a handle within a single statement like this: search -> insert -> search -> update -> insert -> etc. That has not happened in earlier MySQL code.

I will study this more tomorrow.

Thank you,

Heikki

/*********************************************************************
Writes a trx id to an index page. In case that the id size changes in
some future version, this function should be used instead of
mach_write_... */
UNIV_INLINE
void
trx_write_trx_id(
/*=============*/
        byte*   ptr,    /* in: pointer to memory where written */
        dulint  id)     /* in: id */
{
        ut_ad(DATA_TRX_ID_LEN == 6);

        mach_write_to_6(ptr, id);
}

row0ins.c:

        if (node->state == INS_NODE_SET_IX_LOCK) {

                /* It may be that the current session has not yet started
                its transaction, or it has been committed: */

                if (UT_DULINT_EQ(trx->id, node->trx_id)) {
                        /* No need to do IX-locking or write trx id to buf */

                        goto same_trx;
                }

                trx_write_trx_id(node->trx_id_buf, trx->id);
########### FIX: move the above call outside the 'if'
[20 Mar 2006 21:50] Mikael Ronström
Heikki,
Great that you found a work around.
The 2 subpartitions comes from using SUBPARTITIONS 2 in the CREATE TABLE statement.
Within one statement we might do the following.

Example 1:
-------------
write_row
search
update_row/delete_row
search
delete_row/update_row
search
update_row
write_row

Since during search if we find a row requiring an update and it will need to go into another
partition it won't happen directly after the search.
However using index scan with requirements on ordered output we can have the following
pattern.

Example 2:
-------------
search
write_row
update_row/delete_row
write_row
search
update_row/delete_row
write_row
write_row

The reason here is that here we're scanning all partitions in parallel.
Then we perform a merge-sort of the output from the partitions and
send the result out from index_read/index_next. Thus we can read
one record from an index scan, keep it in the buffer of the partition
handler. Handle some records from other handlers that change
partitions => write_row on our handler, this can happen more than
one time before our record is passed to the MySQL layer.

Ok, let's continue tomorrow
[21 Mar 2006 15:57] Heikki Tuuri
I have now pushed the patch below to the Innobase Oy source tree of InnoDB-5.1. It does fix this particular bug. Marko has to review the patch, and I will continue studying what other bugs might follow the from the new way table handles are used in a single SQL statement.

Regards,

Heikki

--- trunk/row/row0ins.c	2006-03-21 10:00:30 UTC (rev 358)
+++ trunk/row/row0ins.c	2006-03-21 15:57:55 UTC (rev 359)
@@ -2451,21 +2451,27 @@
 
 	/* If this is the first time this node is executed (or when
 	execution resumes after wait for the table IX lock), set an
-	IX lock on the table and reset the possible select node. */
+	IX lock on the table and reset the possible select node. MySQL's
+	partitioned table code may also call an insert within the same
+	SQL statement AFTER it has used this tbale handle to do a search.
+	This happens, for example, when a row update moves it to another
+	partition. In that case, we have already set the IX lock on the
+	table during the search operation, and there is no need to set
+	it again here. But we must write thx->id to to node->trx_id_buf. */
 
+	trx_write_trx_id(node->trx_id_buf, trx->id);
+
 	if (node->state == INS_NODE_SET_IX_LOCK) {
 
 		/* It may be that the current session has not yet started
 		its transaction, or it has been committed: */
 
 		if (UT_DULINT_EQ(trx->id, node->trx_id)) {
-			/* No need to do IX-locking or write trx id to buf */
+			/* No need to do IX-locking */
 
 			goto same_trx;
 		}
 
-		trx_write_trx_id(node->trx_id_buf, trx->id);
-
 		err = lock_table(0, node->table, LOCK_IX, thr);
 
 		if (err != DB_SUCCESS) {
[22 Mar 2006 16:48] Heikki Tuuri
I ran a heavy parallelized stress test on range partitioned InnoDB tables (ibtest7, 7a, 9, 9a), and mysqld did not crash. I did not check the correctness of SELECT results.
[24 Apr 2006 11:38] Alexander Ivanov
Fixed in 5.1.10-beta.
[24 Apr 2006 16:43] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 5.1.10 changelog. Closed.