Bug #64284 Eliminate LRU scan when dropping a table
Submitted: 9 Feb 2012 19:33 Modified: 26 Mar 2012 23:09
Reporter: Inaam Rana Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.5, 5.6 OS:Any
Assigned to: Inaam Rana CPU Architecture:Any

[9 Feb 2012 19:33] Inaam Rana
Description:
This is an off shoot of bug#51325. There are two scans of LRU when dropping a table. This can create stalls when very large buffer pools are used.The stalls happen because we are holding the buf_pool::mutex during scan.

Fix for bug#51325 replaced one LRU scan with a flush list scan which is much smaller in size. It also periodically releases buf_pool::mutex.

This bug is to remove the other scan of LRU which happens when we attempt to drop AHI entries for the table.

How to repeat:
see above

Suggested fix:
Mark Callaghan pointed out that the LRU scan to drop AHI entries may not be required. This is because we should have already dropped these entries when freeing up btree pages internally.

Thanks to Mark for his invaluable input.
[26 Mar 2012 23:09] John Russell
Added to changelog for 5.5.23, 5.6.6: 

Improved the performance of the DROP TABLE statement for InnoDB
tables, especially on systems with a large buffer pool.
[26 Mar 2012 23:12] John Russell
Added an extra detail to the changelog entry, to distinguish it from the earlier related bug:

Improved the performance of the DROP TABLE statement for InnoDB
tables, especially on systems with a large buffer pool. The fix
speeds up the processing for freeing entries in the adaptive hash
index.
[20 Jun 2012 19:16] Mark Callaghan
After I backported this to 5.1 I get a very intermittent failure in a stress test I added. While it is possible the problem is my backport or the backport & 5.1 I think the problem is in trunk and perhaps 5.5 after reading the code. Even more interesting I think there might be different problems in 5.5 and trunk.

Prior to backporting the official change I had my own change which handled this case, so I have some experience with the failure which is why I added the test.

When my test case runs there is small chance that fil_io returns DB_TABLESPACE_DELETED. Perhaps I need to tune my test to make that more likely. The call stack is:
fil_io
buf_read_page_low
buf_read_ibuf_merge_pages
ibuf_contract_ext
ibuf_contract_for_n_pages
srv_master_thread

In 5.5 the return of DB_TABLESPACE_DELETED is not handled by buf_read_page_low. There is an assert -- ut_a(*err == DB_SUCCESS). I think this is a problem.

In trunk the return of DB_TABLESPACE_DELETE might be partially handled because of the change to add  ignore_nonexistent_pages for r3231 (Implement WL#5712 InnoDB: preload buffer pool). I am not sure if forcing a return of 0 is correct. It will be correct if something like buf_page_io_complete is still called to undo the x-latch. But I am not familiar with this code.

My patch did two things to avoid these problems:
1) fil_io called a new function I added, buf_page_io_complete_on_error, before returning DB_TABLESPACE_DELETED
2) buf_read_ibuf_merges increments pending ibuf merges for a tablespace in a wrapper around the calls to ibuf_merge_or_delete_for_page in buf_read_ibuf_merge_pages

I will attach the stress test soon and describe the debug-only my.cnf vars I added to make it more deterministic. But I think some code reading is required to determine whether my claims are correct. I spent a lot of time debugging and fixing this problem before abandoning my version of the DROP TABLE performance fix and using the one from Oracle.
[20 Jun 2012 19:22] Mark Callaghan
The test uses these which might be new in the FB patch and innodb_allow_ibuf_merges is debug-only. innodb_ibuf_max_pct_of_buffer_pool limits the size of the ibuf as a function of the buffer pool size.
set global innodb_allow_ibuf_merges=0;
set global innodb_ibuf_max_pct_of_buffer_pool=5;
set global innodb_ibuf_max_pct_of_io_capacity=5;

When srv_allow_ibuf_merges is FALSE, then ibuf_contract_for_n_pages returns immediately.

--------------------

#
# Test fast DROP TABLE and InnoDB insert buffer
#

--source include/have_innodb_plugin.inc
--source include/have_debug.inc

call mtr.add_suppression("InnoDB: Warning: Small buffer pool size.*");
call mtr.add_suppression("InnoDB: Error: trying to do.*");

--disable_warnings
drop table if exists t1;
drop table if exists t2;
--enable_warnings

create table t1(i int primary key auto_increment, b char(250), c char(250), d char(250)) engine=innodb;
alter table t1 add index xb(b);
alter table t1 add index xc(c);
alter table t1 add index xd(d);

create table t2(i int primary key auto_increment, b char(250), c char(250), d char(250)) engine=innodb;
alter table t2 add index xb(b);
alter table t2 add index xc(c);
alter table t2 add index xd(d);

set global innodb_allow_ibuf_merges=0;
set global innodb_ibuf_max_pct_of_buffer_pool=5;
set global innodb_ibuf_max_pct_of_io_capacity=5;

show global status like "Innodb_ibuf_size";

load data infile '../../std_data/innodb_drop_table_ibuf.dat' into table t1 fields terminated by ',';
load data infile '../../std_data/innodb_drop_table_ibuf.dat' into table t2 fields terminated by ',';

select (VARIABLE_VALUE > 10) from information_schema.global_status
where VARIABLE_NAME = 'Innodb_ibuf_size';

drop table t1;
set global innodb_allow_ibuf_merges=1;
set global innodb_ibuf_max_pct_of_buffer_pool=1;
set global innodb_ibuf_max_pct_of_io_capacity=50;

sleep 0.1;
drop table t2;

let $wait_timeout= 180;
let $wait_condition= select (VARIABLE_VALUE < 5) from information_schema.global_status where VARIABLE_NAME = 'Innodb_ibuf_size';
--source include/wait_condition.inc

select (VARIABLE_VALUE > 10) from information_schema.global_status
where VARIABLE_NAME = 'Innodb_drop_table_ibuf_skipped_row';

set global innodb_ibuf_max_pct_of_buffer_pool=default;
set global innodb_ibuf_max_pct_of_io_capacity=default;
[20 Jun 2012 19:35] Mark Callaghan
I also changed InnoDB to let me use a 5MB buffer pool for debug-only tests rather than the current limit of ~30+ MB. This lets the test run in much less time.
[20 Jun 2012 21:01] Mark Callaghan
In the facebook patch at least the race is in buf_read_page_low. The call to buf_page_init_for_read can determine that the tablespace was deleted and return NULL. That is handled.  But nothing is done to guarantee it is not dropped after that call and before the call to fil_io. There are a few calls earlier in the stack that also check whether the tablespace has been deleted, but again, nothing is done to guarantee that going forward.

        /* The following call will also check if the tablespace does not exist
        or is being dropped; if we succeed in initing the page in the buffer
        pool for read, then DISCARD cannot proceed until the read has
        completed */
        bpage = buf_page_init_for_read(err, mode, space, zip_size, unzip,
                                       tablespace_version, offset);
        if (bpage == NULL) {

                return(0);
        }

#ifdef UNIV_DEBUG
        if (buf_debug_prints) {
                fprintf(stderr,
                        "Posting read request for page %lu, sync %lu\n",
                        (ulong) offset,
                        (ulong) sync);
        }
#endif

        ut_ad(buf_page_in_file(bpage));

        diskio_used_for_exit = 0;
        if (sync && trx) {
                diskio_used_for_exit =
                  thd_admission_control_diskio_exit(trx->mysql_thd);
        }
        if (zip_size) {
                *err = _fil_io(OS_FILE_READ | wake_later,
                              sync, space, zip_size, offset, 0, zip_size,
                              bpage->zip.data, bpage,
                              trx ? &trx->table_io_perf : NULL);
        } else {
                ut_a(buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);

                *err = _fil_io(OS_FILE_READ | wake_later,
                              sync, space, 0, offset, 0, UNIV_PAGE_SIZE,
                              ((buf_block_t*) bpage)->frame, bpage,
                              trx ? &trx->table_io_perf : NULL);
        }
        ut_a(*err == DB_SUCCESS);
[22 Jun 2012 1:53] Inaam Rana
Mark,

Thanks for the analysis. I didn't get time to look at the code but I can see what you are saying. DB_TABLESPACE_DELETED needs to be appropriately handled. My guess on why we didn't see it before this fix is that buf_page_init_for_read() would fix the block in the buffer pool and since we used to evict all pages in LRU list before proceeding to delete the file therefore the situation never arose where fil_io() has to return DB_TABLESPACE_DELETED. In the new scheme of things that is a possibility and should be fixed.

I'll stare at the code sometime early next week to verify the above hypothesis. Thanks for the effort though.

Ben,
Can we open a separate bug to track this please?