Bug #17992 | Partitions: InnoDB, somehow rotten table after UPDATE | ||
---|---|---|---|
Submitted: | 6 Mar 2006 20:53 | Modified: | 24 Apr 2006 16:43 |
Reporter: | Matthias Leich | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Partitions | Severity: | S1 (Critical) |
Version: | 5.1 | OS: | |
Assigned to: | Mikael Ronström | CPU Architecture: | Any |
[6 Mar 2006 20:53]
Matthias Leich
[6 Mar 2006 20:54]
Matthias Leich
testscript
Attachment: ml1111.test (application/test, text), 1.93 KiB.
[20 Mar 2006 18:29]
Heikki Tuuri
Mikael, the reason is that when MySQL uses the handle to insert to the InnoDB table, prebuilt->sql_stat_start == 0, and InnoDB does not know to put the current transaction id to the buffer from where it is written to the inserted row. The result is grabage in inserted rows. Is the same table handle used earlier for something else? We see that in ::write_row(): last_query_id == user_thd->query_id Breakpoint 1, row_search_for_mysql (buf=0x8f8eca0 "à\024", mode=2, prebuilt=0x42bc2068, match_mode=1, direction=0) at row0sel.c:3097 3097 dict_index_t* index = prebuilt->index; (gdb) Continuing. Breakpoint 7, ha_innobase::write_row (this=0x4a922a0c, record=0x8f8eca0 "à\034") at ha_innodb.cc:3371 3371 ibool auto_inc_used= FALSE; Current language: auto; currently c++ (gdb) next 484 { (gdb) 3367 row_prebuilt_t* prebuilt = (row_prebuilt_t*)innobase_prebuilt; (gdb) 3366 { (gdb) 484 { (gdb) 3397 if (table->timestamp_field_type & TIMESTAMP_AUTO_SET_ON_INSERT) (gdb) 3400 if ((user_thd->lex->sql_command == SQLCOM_ALTER_TABLE (gdb) 3468 num_write_row++; (gdb) 3470 if (last_query_id != user_thd->query_id) { (gdb) 3468 num_write_row++; (gdb) c Continuing. Breakpoint 4, row_insert_for_mysql (mysql_rec=0x8f8eca0 "à\034", prebuilt=0x42bc4468) at row0mysql.c:1070 1070 trx_t* trx = prebuilt->trx; Current language: auto; currently c (gdb) print *prebuilt $21 = {magic_n = 78540783, table = 0x42bbfe68, trx = 0x42bba468, sql_stat_start = 0, mysql_has_locked = 1, clust_index_was_generated = 1, index = 0x42bc1868, read_just_key = 0, used_in_HANDLER = 0, template_type = 0, n_template = 5, null_bitmap_len = 1, need_to_access_clustered = 1, templ_contains_blob = 0, mysql_template = 0x42d06668, heap = 0x42bc3e28, ins_node = 0x42bc4b08, ins_upd_rec_buff = 0x42d07868 "", hint_need_to_fetch_extra_cols = 0, upd_node = 0x0, ins_graph = 0x42d07db0, upd_graph = 0x0, pcur = 0x42bc3f68, clust_pcur = 0x42bc4668, sel_graph = 0x42d07f40, search_tuple = 0x42bc4868, row_id = "\000\000\000\bµ\n", clust_ref = 0x42bc4ac8, select_lock_type = 5, stored_select_lock_type = 5, row_read_type = 0, mysql_prefix_len = 1031, mysql_row_len = 1031, n_rows_fetched = 1, fetch_direction = 1, fetch_cache = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, keep_other_fields_on_keyread = 0, fetch_cache_first = 0, n_fetch_cached = 0, blob_heap = 0x0, old_vers_heap = 0x0, magic_n2 = 78540783} (gdb) I will look at this in more detail later, but this is the basic reason why there is garbage in TRX_ID in the updated and inserted rows. This may be the same bug: http://bugs.mysql.com/bug.php?id=17134 Regards, Heikki
[20 Mar 2006 20:55]
Heikki Tuuri
Mikael, I have now stepped through the problematic UPDATE. Looks like MySQL opens 8 table handles to the 4 partitions, two for each partition. It locks each partition with a write lock. Then it, for some reason, in processing the UPDATE does a search of rows in EACH of the 8 table handles. Since MySQL has done a search in those table handles, InnoDB sets prebuilt->sql_stat_start = 0 in every one of them (prebuilt is the InnoDB table handle that corresponds to the MySQL table handle). When the insert is then performed in a table handle of partd, InnoDB does not realize that it should store the TRX_ID, since it thinks the table handle has already been used in an insert. I also noticed that in processing the SELECTs, MySQL calls strangely many times ::rnd_init(). Why is that? How we could fix the bug: 1) InnoDB could always store the TRX_ID to the buffer every time row_insert_for_mysql() is called, or 2) for table partd, in an SQL statement, MySQL could use one table handle for searches and updates, and use the other table handle for INSERTs. What do you think? Is it intentional that the same table handle is used for searches and inserts in the SAME SQL statement? That has not happened in MySQL before. I am afraid it could cause also other bugs than this if this rule is no longer obeyed. Regards, Heikki (gdb) print prebuilt->sql_stat_start $68 = 1 (gdb) bt #0 row_search_for_mysql (buf=0x8f8f3f8 "à\024", mode=1, prebuilt=0x42bc4c68, match_mode=0, direction=0) at row0sel.c:3099 #1 0x08301dc9 in ha_innobase::index_read (this=0x8f84918, buf=0x42bc4c68 "ïo®\004hλBh¬»B\001", key_ptr=0x0, key_len=1119636584, find_flag=HA_READ_AFTER_KEY) at ha_innodb.cc:4124 #2 0x08302324 in ha_innobase::index_first (this=0x8f84918, buf=0x42bc4c68 "ïo®\004hλBh¬»B\001") at ha_innodb.cc:4369 #3 0x0830247f in ha_innobase::rnd_next (this=0x8f84918, buf=0x8f8f3f8 "à\024") at ha_innodb.cc:4470 #4 0x0832d2d2 in ha_partition::rnd_next (this=0x8f83d60, buf=0x8f8f3f8 "à\024") at ha_partition.cc:3051 #5 0x08285d05 in rr_sequential (info=0x4a922c10) at records.cc:365 #6 0x082452d3 in mysql_update (thd=0x8f63030, table_list=0x8f7d600, fields=@0x8f63334, values=@0x8f63510, conds=0x8f7daf8, order_num=1119636584, order=0x0, limit=18446744073709551613, handle_duplicates=DUP_ERROR, ignore=false) at sql_update.cc:393 #7 0x081e4d20 in mysql_execute_command (thd=0x8f63030) at sql_parse.cc:3230 #8 0x081ed0a9 in mysql_parse (thd=0x8f63030, inBuf=0x8f6306c "°O\\\b\001", length=63) at sql_parse.cc:5882 #9 0x081e26e1 in dispatch_command (command=150351920, thd=0x8f63030, packet=0x8f75531 "UPDATE t1 SET f_int1 = f_int1 + 20 WHERE f_int1 BETWEEN 5 AND 9", packet_length=150459744) at sql_parse.cc:1741 #10 0x081e223d in do_command (thd=0x8f63030) at sql_parse.cc:1537 #11 0x081e15b2 in handle_one_connection (arg=0x8f63030) at sql_parse.cc:1179 #12 0x4016cb63 in start_thread () from /lib/tls/libpthread.so.0 #13 0x402b318a in clone () from /lib/tls/libc.so.6 (gdb) c Continuing.
[20 Mar 2006 21:11]
Mikael Ronström
Heikki, Yep, this is the behaviour using partitioning. The reason is that an update_row that changes partitioning is handled through 1) write_row into new partition 2) delete_row from old partition And in this case the scan was used to read all partitions, I use rnd_init before scanning and then rnd_end at end of scan and the call to update_row is immediately after a rnd_next. So I gather we have two options, either somehow use a new table handler for the write_row or ensure that InnoDB can handle write_row on a table handle already used for scanning.
[20 Mar 2006 21:13]
Mikael Ronström
Heikki, The reason for 8 partitions is that there is also subpartitions. So there are 4 partitions with 2 subpartitions in each. Thus in total 8 partitions. The error as I found it occurred in the last 8 partition.
[20 Mar 2006 21:19]
Mikael Ronström
Heikki, A full table scan on a partitioned table is in short handled by: 1) rnd_init on first table partition 2) call rnd_next as long as there are records remaining in table partition 3) rnd_end after last record Continue with next partition until all partitions are handled For index scans the same pattern is used except that we use index_init and index_end instead
[20 Mar 2006 21:34]
Heikki Tuuri
Mikael, ok, I did not realize each named partition was divided into two subpartitions. That explains why there are 8 table handles. Hmm... calling the below function for every inserted row does not consume much CPU time. That is an easy fix for this particular bug. I am worried about other possible bugs that reuse of a table handle might cause. Partition code may use a handle within a single statement like this: search -> insert -> search -> update -> insert -> etc. That has not happened in earlier MySQL code. I will study this more tomorrow. Thank you, Heikki /********************************************************************* Writes a trx id to an index page. In case that the id size changes in some future version, this function should be used instead of mach_write_... */ UNIV_INLINE void trx_write_trx_id( /*=============*/ byte* ptr, /* in: pointer to memory where written */ dulint id) /* in: id */ { ut_ad(DATA_TRX_ID_LEN == 6); mach_write_to_6(ptr, id); } row0ins.c: if (node->state == INS_NODE_SET_IX_LOCK) { /* It may be that the current session has not yet started its transaction, or it has been committed: */ if (UT_DULINT_EQ(trx->id, node->trx_id)) { /* No need to do IX-locking or write trx id to buf */ goto same_trx; } trx_write_trx_id(node->trx_id_buf, trx->id); ########### FIX: move the above call outside the 'if'
[20 Mar 2006 21:50]
Mikael Ronström
Heikki, Great that you found a work around. The 2 subpartitions comes from using SUBPARTITIONS 2 in the CREATE TABLE statement. Within one statement we might do the following. Example 1: ------------- write_row search update_row/delete_row search delete_row/update_row search update_row write_row Since during search if we find a row requiring an update and it will need to go into another partition it won't happen directly after the search. However using index scan with requirements on ordered output we can have the following pattern. Example 2: ------------- search write_row update_row/delete_row write_row search update_row/delete_row write_row write_row The reason here is that here we're scanning all partitions in parallel. Then we perform a merge-sort of the output from the partitions and send the result out from index_read/index_next. Thus we can read one record from an index scan, keep it in the buffer of the partition handler. Handle some records from other handlers that change partitions => write_row on our handler, this can happen more than one time before our record is passed to the MySQL layer. Ok, let's continue tomorrow
[21 Mar 2006 15:57]
Heikki Tuuri
I have now pushed the patch below to the Innobase Oy source tree of InnoDB-5.1. It does fix this particular bug. Marko has to review the patch, and I will continue studying what other bugs might follow the from the new way table handles are used in a single SQL statement. Regards, Heikki --- trunk/row/row0ins.c 2006-03-21 10:00:30 UTC (rev 358) +++ trunk/row/row0ins.c 2006-03-21 15:57:55 UTC (rev 359) @@ -2451,21 +2451,27 @@ /* If this is the first time this node is executed (or when execution resumes after wait for the table IX lock), set an - IX lock on the table and reset the possible select node. */ + IX lock on the table and reset the possible select node. MySQL's + partitioned table code may also call an insert within the same + SQL statement AFTER it has used this tbale handle to do a search. + This happens, for example, when a row update moves it to another + partition. In that case, we have already set the IX lock on the + table during the search operation, and there is no need to set + it again here. But we must write thx->id to to node->trx_id_buf. */ + trx_write_trx_id(node->trx_id_buf, trx->id); + if (node->state == INS_NODE_SET_IX_LOCK) { /* It may be that the current session has not yet started its transaction, or it has been committed: */ if (UT_DULINT_EQ(trx->id, node->trx_id)) { - /* No need to do IX-locking or write trx id to buf */ + /* No need to do IX-locking */ goto same_trx; } - trx_write_trx_id(node->trx_id_buf, trx->id); - err = lock_table(0, node->table, LOCK_IX, thr); if (err != DB_SUCCESS) {
[22 Mar 2006 16:48]
Heikki Tuuri
I ran a heavy parallelized stress test on range partitioned InnoDB tables (ibtest7, 7a, 9, 9a), and mysqld did not crash. I did not check the correctness of SELECT results.
[24 Apr 2006 11:38]
Alexander Ivanov
Fixed in 5.1.10-beta.
[24 Apr 2006 16:43]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bugfix, yourself. More information about accessing the source trees is available at http://www.mysql.com/doc/en/Installing_source_tree.html Additional info: Documented bugfix in 5.1.10 changelog. Closed.