MySQL Bugs: #51325: Dropping an empty innodb table takes a long time with large buffer pool

Bug #51325	Dropping an empty innodb table takes a long time with large buffer pool
Submitted:	19 Feb 2010 14:21	Modified:	7 Feb 2012 23:44
Reporter:	Vojtech Kurka	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB Plugin storage engine	Severity:	S3 (Non-critical)
Version:	5.1.42,5.1.47, 5.5.13	OS:	Linux (Centos 5.4)
Assigned to:	Inaam Rana	CPU Architecture:	Any
Tags:	Drop, innodb, lockup, table

Description:
InnoDB plugin 1.0.3/MySQL 5.1.37, InnoDB plugin 1.0.6/MySQL 5.1.42 (XtraDB 9)

innodb_file_per_table

Using innodb_buffer_pool_size=20G:

mysql> CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;
Query OK, 0 rows affected (0.00 sec)

mysql> DROP TABLE temp_test;
Query OK, 0 rows affected (1.10 sec)

Using innodb_buffer_pool_size=40G:

mysql> CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;
Query OK, 0 rows affected (0.41 sec)

mysql> DROP TABLE temp_test;
Query OK, 0 rows affected (2.66 sec)

Both boxes have identical config, just the buffer pool is set lower on the first one. Both are production machines, running for a few days. The second one did not get any load in the test time (it's a slave and the slave threads were stopped). There were around 20K dirty pages in the testing time.

If I restart the mysqld, during the buffer pool is filling up I see:

mysql>  show status like 'Innodb_buffer_pool_pages_data';CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;DROP TABLE temp_test;;
+-------------------------------+--------+
| Variable_name                 | Value  |
+-------------------------------+--------+
| Innodb_buffer_pool_pages_data | 290323 |
+-------------------------------+--------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.18 sec)

mysql>  show status like 'Innodb_buffer_pool_pages_data';CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;DROP TABLE temp_test;
+-------------------------------+--------+
| Variable_name                 | Value  |
+-------------------------------+--------+
| Innodb_buffer_pool_pages_data | 339110 |
+-------------------------------+--------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.31 sec)

mysql> show status like 'Innodb_buffer_pool_pages_data';CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;DROP TABLE temp_test;
+-------------------------------+--------+
| Variable_name                 | Value  |
+-------------------------------+--------+
| Innodb_buffer_pool_pages_data | 842694 |
+-------------------------------+--------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.50 sec)

mysql> show status like 'Innodb_buffer_pool_pages_data';CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;DROP TABLE temp_test;
+-------------------------------+--------+
| Variable_name                 | Value  |
+-------------------------------+--------+
| Innodb_buffer_pool_pages_data | 959377 |
+-------------------------------+--------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.42 sec)

mysql> show status like 'Innodb_buffer_pool_pages_data';CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;DROP TABLE temp_test;
+-------------------------------+---------+
| Variable_name                 | Value   |
+-------------------------------+---------+
| Innodb_buffer_pool_pages_data | 2664087 |
+-------------------------------+---------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (1.75 sec)

The problem is painful, because whole innodb engine is locked while the DROP TABLE is running (http://forge.mysql.com/worklog/task.php?id=3983). MySQL connections stay in "opening table" status.

How to repeat:
Start mysqd, run some benchmark to fill the buffer pool.

mysql> CREATE TABLE temp_test (col1 INT) ENGINE=INNODB;
mysql> DROP TABLE temp_test;

Suggested fix:
don't know

Part of this may be a duplicate of http://bugs.mysql.com/bug.php?id=41158

However, if InnoDB scans the buffer pool during drop table to remove pages for the table then this can be a new bug. It just needs more details.

poor man's profiler output

Attachment: poor_mans_profiler_output.txt (text/plain), 4.53 KiB.

Thanks, the problem stack trace is:
  buf_LRU_invalidate_tablespace
  fil_delete_tablespace
  row_drop_table_for_mysql
  ha_innobase::delete_table
  ha_delete_table
  mysql_rm_table_part2
  mysql_rm_table
  mysql_execute_command,

This looks like a duplicate of Bug #35077.

I'm still able to repeat the testcase on  5.1.47-log, this is NOT fixed.

I'd say the complaint is InnoDB scans the buffer pool on DROP.
It is by design and improving this could be difficult.
The best approach imo would be lazy invalidation, at the very least
we'll need a very fast way to tell that space id N is no longer valid.

This looks like a feature request to me. If buf_LRU_drop_page_hash_for_tablespace() is not taking too much time, we could make buf_LRU_invalidate_tablespace() do just that, and lazily evict the pages of dropped tablespaces in buf_LRU_get_free_block(). But doing that could have the penalty of doing a space_id lookup for every block encountered. Perhaps we could keep an array of recently dropped space_id?

we should account for the fact it could be a long array. I'd say we should
have a flag in whatever structure we have for the table or space that this
space is no longer valid, we keep the DROPped structures but flag them,
think separately about how to get rid of those eventually.
page hash drop could probably made lazy, too.
It's definitely a feature request, any solution would
require extensive testing and will risk adversely affecting the performance
balance e.g. in flushing.

I have a feeling that this problem could be alleviated in MySQL 5.5 by setting innodb_buffer_pool_instances to a large value and by tuning the mapping of buf_pool_get(space_id,page_number) in such a way that buf_LRU_invalidate_tablespace() would not have to scan all LRU lists, but just those that could contain the space_id of the table that is being dropped. Currently, buf_LRU_invalidate_tablespace() does scan all the LRU list entries even in MySQL 5.5.

One other possible solution discussed with Marko on IRC is:

  1. Use the purge thread(s) to do the actual scanning and invalidation of the buffer pool block lists

  2. A thread can work in parallel on a different buffer pool.

  3. The task can be run in the background

 4. The threads don't have to monopolise the buffer pool mutex they can do it piecemeal.

Bug: http://bugs.mysql.com/bug.php?id=61188 duplicate of this one.

As described in bug 61188, this problem can scale to horrific proportions (hours).

FWIW, I like the sound of lazy invalidation, even if it's only a short-term or partial solution. If I'm dropping a small or unused table, the cost of scanning the buffer pool isn't worth the few pages it will free. Even if I'm dropping a large table, I likely wouldn't see enough (if any) benefit from freeing a large chunk of the buffer pool to justify the interruption in service, so I'd still be willing to wait for the pages to age and expire through normal activity.

I will guess from #61188 that dropping a partitioned table does the invalidate scan once per partition rather than once per table. That sounds much more painful.

Jonathan, there are two different issues:

1. The in-memory scan of the buffer pool, which is done once per ibd file being dropped. This is too slow but done with a lock in the foreground and can take seconds per tablespace. Multiple partitions will presumably do it once per partition. Reports saying how long it took, how big the buffer pool is and if applicable how many partitions are involved may be helpful in prioritising this. This is not fixed. Delay time is likely to be linearly proportional to the buffer pool size and number of partitions.

It's unlikely to be possible for this to take hours with any sensible number of partitions, given that partitions currently tend to slow down normal operation if there are more than a hundred or so.

Seconds or even milliseconds of exclusive lock duration aren't really acceptable in the server: they are long enough to cause production disruption for those with high query rates or SLA breaches on response time. 100 microseconds is probably the sensible upper limit target for the server and even that might be troublesome for some places with tight SLA or very high connection rates (which could go over max_connections during the delay). Under 10 microseconds is probably a more desirable target if it's achievable

A partial workaround with the partition case would be to drop partitions individually to space out the work.

2. The freeing of the disk space and deleting the ibd file. This is disk-bound but non-blocking and could take hours or days if there are a large number of fragments in the file in the operating system. The heavy disk I/O could greatly slow down foreground tasks. A new bug for this if it's not good enough might be interesting. Avoiding the global lock was the fix for this that was added in 5.5. Time for this to complete will depend on the number of fragments and storage device speed. Longer term it'd be nice to find some way to regulate the rate at which disk work is done to prevent heavy disk I/O from causing performance issues.

I showed in bug 61188 that disk I/O was negligible during the drops, while MySQL had a thread pegging a CPU 99.9%.

When I sampled the performance schema, it showed only wait/synch/mutex/innodb/buf_pool_zip_mutex events (also documented in that bug). This feels to me like the buffer pool scan must decompress each compressed page (which is probably the majority for my case), but if that were the case, I would expect the dropping even empty tables to take O(minutes) instead of O(seconds).

Bug #61621 was marked as a duplicate of this one.

Would it be a valid workaround to shutdown the server and restart it so that the buffer pool is empty, and then do the drop?  This wouldn't be valid in a busy production box but it could work for me in dev.

Does this also apply to ALTER TABLE since it effectively does a DROP/CREATE?

Chris, I'm suffering from the same issue as you (on 5.5.13 with compressed tables. I have a 40000M buffer pool and this issue is really quite frustrating).

To answer your question, bouncing mysqld does appear to remove the long delays which is good, but having to bounce mysqld and then have a cold cache is not really a viable solution on actively used db servers.

Does it have to be all that complicated a fix?

Why even bother invalidating the LRU cache.
Those blocks in cache which reference the recently dropped table will naturally fall off the LRU list sooner or later. 

For my use cases I'd much rather have some "dead" blocks temporarily in cache than lock up the whole server for 10 seconds or so.

Or am I misunderstanding the purpose of the buffer cache scan?

I completely agree with the idea of letting the LRU algorithm handle the buffer purge.  If people are squimish about that then it can be a config item to brute purge the buffer pool or let LRU deal with it over time.

Also I disagree with this being rated S3.  This is an issue that completely owns my database for a long time.  This should be S1.

I have to drop some really large tables tomorrow, so I think the best option for me is to bounce the server and then drop tables.

Is the issue a concern for servers where innodb_file_per_table is set to 0. If yes, why is Innodb not scanning buffer pool in case of one big file? If it is letting the LRU algorithm deal with freeing up space then I guess we can do the same thing for innodb-file-per-table=1.

This does not affect tables in the shared tablespace, created there while innodb_file_per_table = 0.

Anyone affected by this might also be affected by compressed table only http://bugs.mysql.com/bug.php?id=61188 which was fixed in 5.1.15 and 5.1.59. If you're using compressed tables and having drop speed issues you should get that fix.

Please read the earlier comments by Marko Mäkelä and Michael Izioumtchenko and Sunny Bains about some ways we could handle this. All three are members of the InnoDB development team at Oracle.

LRU eviction might seem easiest but the extra checking that has to be done on most page operations would just shift a speed penalty to all page operation, resulting in a generally slower server. Better options than that are being pursued but we don't have anything to announce yet.

With a 60G buffer pool and ~3.2M pages allocated from it, creating and dropping an empty innodb table locks the buffer pool mutex for ~0.5 seconds because it scans the LRU twice. Each of the calls below locks the buffer pool mutex for ~0.25 seconds. 
* once in buf_LRU_invalidate_tablespace
* once in buf_LRU_drop_page_hash_for_tablespace called by buf_LRU_invalidate_tablespace

Note that dict_sys->mutex is X-locked for the entire time which can cause even more pileups.

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 65903001600; in additional pool allocated 0
Dictionary memory allocated 69090
Buffer pool size   3932159
Free buffers       303787
Database pages     3291425
Old database pages 1215018

Would it be possible to scan the buffer pool without iterating over that LRU? In that case the buffer pool mutex could be released/relocked more frequently. I definitely think that should be done for buf_LRU_drop_page_hash_for_tablespace as that only has to be best effort.

        for (chunk_num = 0; chunk_num < buf_pool->n_chunks; ++chunk_num) {
                ulint           block_num;
                buf_block_t*    block   = buf_get_nth_chunk_block(buf_pool, chunk_num, &chunk_size);
                int             x       = 0;

                for (block_num = 0; block_num < chunk_size; ++block_num, ++block) {
                        mutex_t*        block_mutex;

                        if (x >= srv_friendly_drop_table_batch) {
                                buf_pool_mutex_exit();
                                buf_pool_mutex_enter();
                                x = 0;
                        }

check_block_again:
                        bpage = &block->page;
                        if (!bpage) {
                                fprintf(stderr, "InnoDB: friendly drop found block "
                                        "with NULL page\n");
                                continue;
                        }

I then ran a test using 8 sessions doing fetch-1-row-by-PK via sysbench. The database was cached. Another session did nothing but create innodb table, drop innodb table in a loop. Throughput in QPS during the test:
* 40 QPS --> original mysql
* 54 QPS --> buf_LRU_drop_page_hash_for_tablespace hacked to not iterate on LRU
* 79 QPS --> don't call buf_LRU_drop_page_hash_for_tablespace

And if I replace buf_LRU_invalidate_tablespace with something that iterates over the buffer pool chunks and releases/relocks the buffer pool mutex every 10000 pages, then InnoDB sustains 130 QPS under this workload. But I don't know whether such a change is correct or can miss pages.

Mark,

One fact to note is that chunk->blocks contain buf_block_t descriptors. A compressed page which is not carrying the decompressed frame with it is represented by buf_page_t and this buf_page_t is allocated using ut_malloc() i.e.: it is not part of chunk->blocks.

Having said that, your idea of optimizing buf_LRU_drop_page_hash_for_tablespace() by using chunk->blocks instead of the LRU list may work because compressed only pages are never part of AHI. And as you mentioned that it is a best effort thing therefore we won't mind if we miss a hashed block when we release the buf_pool->mutex. Right now Marko is working on kind of redesign of AHI in a way where this may no longer be an issue. For example, if we have AHI on a per table/index basis then all we probably need to do is to drop the entire AHI for a table that is being dropped. However, this is still WIP and if we end up iterating over the LRU list then your suggestion can come in handy. Thanks for the valuable input.

About the case of buf_LRU_invalidate_tablespace() using chunks is not an option because as I mentioned above we'll miss the compressed only buf_page_t descriptors totally. I am currently working on a patch to iterate over flush_list instead of LRU list. We'll just remove the blocks from the flush_list and let them age out of the LRU list lazily. The idea is that typically flush_list is much smaller compared to the LRU list. Also I have coded in a mechanism where we'll release buf_pool->mutex and flush_list->mutex after scanning every 1K pages thus helping sustain the throughput during the scan phase.

If I use buf->chunks then that might do extra work when there are many more blocks encountered there than on the LRU. The code does two scans of the LRU today and not calling buf_LRU_drop_page_hash_for_tablespace to get rid of the first scan might help. However that increases the chance of a rescan of the LRU in buf_LRU_invalidate_tablespace because either:
1) a block with IO pending was found or
2) a hash page was found

I am skeptical that calling buf_LRU_drop_page_hash_for_tablespace provides much protection for #1 (all it does is give the IO ~.2 seconds more to finish). I also doubt that the call is need for #2 as earlier in the drop table processing there were calls from fseg_free_extent-> btr_search_drop_page_hash_when_freed that removed most/all pages from the AHI for the dropped tablespace.

So I think the call to buf_LRU_drop_page_hash_for_tablespace is an optimization that ends up doing more harm than good. So I think I will add a my.cnf option to allow that call to be disabled and then add additional monitoring to see what goes on in buf_LRU_invalidate_tablespace in production and measure:
* how many times are rescans done?
* why are rescans done
* how long is the buffer pool mutex locked 

What I hope to do is reduce the stall in half (from 0.5 seconds to 0.25 seconds).

I added an option to disable the call to buf_LRU_drop_page_hash_for_tablespace and now I want a function that flushes a table from the InnoDB LRU, flush list and adaptive hash index. This will be run prior to the DROP TABLE command and the code below is our second attempt.

Can you comment on this?

I don't understand the comments in buf_page_struct on the use of the block/page versus buffer pool mutex. Can I read "state" while holding the buffer pool mutex but not the page mutex? That appears to be allowed based on existing usage.

void
buf_uncache_tablespace(
/*==============================*/
        ulint   id)     /*!< in: space id */
{
        ulint   chunk_num;
        ulint   n_flushed = 0;

        buf_pool_mutex_enter();

        for (chunk_num = 0; chunk_num < buf_pool->n_chunks; ++chunk_num) {
                ulint           chunk_size;
                ulint           block_num;
                buf_block_t*    block   = buf_get_nth_chunk_block(buf_pool,
                                                        chunk_num, &chunk_size);
                int             n_checked = 0;

                for (block_num = 0; block_num < chunk_size; ++block_num, ++block) {
                        mutex_t*        block_mutex;
                        buf_page_t*     bpage;

                        ++n_checked;
                        if (n_checked >= srv_uncache_table_batch) {
                                buf_pool_mutex_exit();
                                os_thread_yield();
                                buf_pool_mutex_enter();
                                n_checked = 0;
                        }

                        bpage = &block->page;
                        if (!bpage)
                                continue;

                        if (buf_page_get_state(bpage) != BUF_BLOCK_FILE_PAGE ||
                            buf_page_get_space(bpage) != id) {
                                /* Nothing can be done with this page. Per comments and
                                usage elsewhere, these can be read without locking
                                block_mutex. */
                                continue;
                        }

                        block_mutex = buf_page_get_mutex(bpage);
                        mutex_enter(block_mutex);

                        ut_ad(bpage->in_LRU_list);

                        if (buf_flush_ready_for_flush(bpage, BUF_FLUSH_LRU)) {
                                /* Schedule the dirty page to be written. When the write
                                is done it will be moved to the end of the LRU.
                                buf_flush_page releases the buffer pool and block mutex. */

                                buf_flush_page(bpage, BUF_FLUSH_LRU);
                                ++n_flushed;
                                buf_pool_mutex_enter();
                                continue;
                        }

                        /* This might release & relock the buffer pool and block mutex */
                        buf_LRU_free_block(bpage, TRUE);

                        mutex_exit(block_mutex);
                }
        }

        buf_pool_mutex_exit();

        if (n_flushed)
                os_aio_simulated_wake_handler_threads();
}

Mark,

Do you intend to run this function before drop table call and then run regular drop_hash and invalidate_tablespace during drop as well?

About reading state of the block I believe it is ok to do a read while holding buf_pool->mutex. Writing to the state field carries extra conditions.

comments about the function:
* As I mentioned in my previous comment iterating through chunk->blocks you'll miss all compressed only blocks i.e.: buf_page_t structs.
* How would you deal if a page is buffer fixed for whatever reason. In that case it is not considered ready for flush or replace meaning neither buf_flush_page()  nor buf_LRU_free_block() will be successful. Would you attempt a retry?
* Come to think of it, a table which is being dropped should not cause any writes as we are going to delete the .ibd file. Imagine a very big table being dropped with significant dirty pages in the buffer pool. All the IO flushing caused by this is unnecessary. You may want to use buf_flush_remove() to just remove the page from the flush_list without actually writing it.
* I am not sure which code path in buf_flush_page() actually puts the block to the end LRU list.
* When you call buf_flush_page() typically the IO request is posted to the doublewrite buffer. The double write buffer is not flushed. You have to call buf_flush_buffered_writes() for that.

Inaam:

> As I mentioned in my previous comment iterating through chunk->blocks you'll miss all
compressed only blocks i.e.: buf_page_t structs.

It is OK to miss these for now. This function is a best-effort attempt to make DROP TABLE faster and we don't have any compressed tables in production yet. Eventually we need to fix the compressed table case but by then you might have a solution for that.

> How would you deal if a page is buffer fixed for whatever reason. In that case it is
not considered ready for flush or replace meaning neither buf_flush_page()  nor
buf_LRU_free_block() will be successful. Would you attempt a retry?

No retry will be done as this is a best-effort attempt.

> Come to think of it, a table which is being dropped should not cause any writes as we
are going to delete the .ibd file. Imagine a very big table being dropped with
significant dirty pages in the buffer pool. All the IO flushing caused by this is
unnecessary. You may want to use buf_flush_remove() to just remove the page from the
flush_list without actually writing it.

If I called this function during DROP TABLE processing then it would be safe to call buf_flush_remove. For now this is called by the FLUSH TABLES command prior to the DROP TABLE so I will write them back. Making them clean prior to DROP TABLE makes the drop processing faster as a dirty page during DROP TABLE can force a rescan of the LRU.

> I am not sure which code path in buf_flush_page() actually puts the block to the end
LRU list.

buf_page_io_complete -> buf_flush_write_complete -> buf_LRU_make_block_old

I replaced the call to os_aio_simulated_wake_handler_threads with a call to buf_flush_buffered_writes. That was definitely needed to avoid long lock waits. I also made this function behave more like an LRU flush function by using this at function start:

        while ((buf_pool->n_flush[BUF_FLUSH_LRU] > 0) ||
               (buf_pool->init_flush[BUF_FLUSH_LRU] == TRUE)) {

                buf_pool_mutex_exit();
                buf_flush_wait_batch_end(BUF_FLUSH_LRU);
                buf_pool_mutex_enter();
        }
        buf_pool->init_flush[BUF_FLUSH_LRU] = TRUE;

And this at function end:

        ut_ad(buf_pool->init_flush[BUF_FLUSH_LRU] == TRUE);
        buf_pool->init_flush[BUF_FLUSH_LRU] = FALSE;
        if (buf_pool->n_flush[BUF_FLUSH_LRU] == 0) {
                /* The running flush batch has ended */
                os_event_set(buf_pool->no_flush[BUF_FLUSH_LRU]);
        }

* When you call buf_flush_page() typically the IO request is posted to the doublewrite
buffer. The double write buffer is not flushed. You have to call
buf_flush_buffered_writes() for that.

Mark,

Your changes look fine. I still have a question. You said:

"Making them clean prior to DROP TABLE makes the drop
processing faster as a dirty page during DROP TABLE can force a rescan of the LRU."

Looking at the 5.1 plugin code I don't see where a rescan of LRU is caused by the page being dirty. I can only see a rescan if the page is hashed i.e.: there is AHI built on the page. But that state doesn't change whether or not the page is dirty.

Dirty pages don't force a rescan, but these do...

                } else if (buf_page_get_io_fix(bpage) != BUF_IO_NONE) {
                        /* We cannot remove this page during this scan
                        yet; maybe the system is currently reading it
                        in, or flushing the modifications to the file */

                        all_freed = FALSE;
                        goto next_page;
                } else {
                        block_mutex = buf_page_get_mutex(bpage);
                        mutex_enter(block_mutex);

                        if (bpage->buf_fix_count > 0) {

                                mutex_exit(block_mutex);
                                /* We cannot remove this page during
                                this scan yet; maybe the system is
                                currently reading it in, or flushing
                                the modifications to the file */

                                all_freed = FALSE;

                                goto next_page;
                        }

Now I have timing data from a production server that does DROP TABLE while serving traffic and uses a 60G buffer pool. This means:
* buf_pool mutex locked for 0.78 seconds in buf_LRU_drop_page_hash_for_tablespace
* buf_pool mutex locked for 0.70 seconds in buf_LRU_invalidate_tablespace
* there were 338 pages found during the LRU scan

InnoDB: buf_LRU_invalidate_tablespace found 338 pages with rescan: (0,0,0) (phase1,all_freed,drop_hash) and locked seconds: (0.789741,0.701635) (phase1,phase2).

So, this locked the buffer pool mutex for ~1.5 seconds total. That is a long stall by our standards.

Mark,

Interesting observations. I wouldn't have expected a table that we are dropping to result in 338 rescans. Typically a table being dropped should not have any buffer fixed pages in the buffer pool. Nor should there be any pending reads. An educated guess would be that most of these rescans are caused by pages being flushed. Do you happen to have data on what specifically caused the rescan?

You don't have the source for my cryptic debug comments, but in this case there were no rescans, just 338 pages found during the scan of the LRU. The important data from this is that it took 0.75 seconds to lock that buffer pool mutex and then scan the LRU -- and that is done twice.

Mark,

OK. Then I misinterpreted your debug numbers. How big is the flush list? If it is smaller then we can use the lazy eviction by just removing pages from the flush list (the same solution that I have for 5.6).

The flush list is usually ~500,000 pages and the LRU ~3M pages

Mark already noted above that partitioned tables walk this list multiple times - with edge cases taking minutes of big lock time. 

Maybe that should be a separate bug? Fixing this one helps a bit, but rescanning multiple times is painful by itself.

0.75 seconds stall/lockup per partition X 100 partitions == 75 seconds stall/lockup. Who can tolerate that on a production server? The workaround is to either not drop partitioned tables, not use partitioning or do failures per drop.

Mark, Domas,

The stalls are not complete 'shut down of work'. In the fix we release the mutex after every 1K pages are scanned. So it is not really as bad as it looks. I agree in principle that multiple scans should be avoided. In case of partitioned tables it means making InnoDB aware that a group of tables are actually partitions of a single table. Note that as it exists today, InnoDB treats each partition as a separate table. Or may be true tabelspace management can provide us this functionality. But in any case, as suggested by Domas, that goes beyond the scope of this bug.

Added to changelog for 5.5.20, 5.6.4: 

This fix improved the efficiency and concurrency of freeing pages in
the InnoDB buffer pool when performing a DROP TABLE for an InnoDB
table when the innodb_file_per_table option is enabled.

This change is most noticeable for systems with large buffer pools. During the
drop operation, one traversal of the buffer pool memory structure is
changed from the LRU list (the entire buffer pool) to the flush list
(a much smaller structure). The LRU scanning is reduced, but not
entirely eliminated. The buffer pool mutex is also released
periodically, so that if the drop operation takes significant time,
other threads can proceed concurrently.

I understand from the dev team there is continuing discussion about improvements to AHI freeing, that other LRU scan, overhead from partitioning, etc. If there is more work to be done there, let's do that under a new bug.

Inaam,

The buffer pool mutex is released after every 1024 pages are found that meet these conditions:
1) from the dropped tablespace
2) not compressed (buf_page_get_state() == BUF_BLOCK_FILE_PAGE
3) not IO fixed

My guess is that DROP TABLE for a compressed table or for a table with few pages in the buffer pool will continue to suffer from stalls.

Mark,

I believe I explained it before and have stated that this fix does not deal
with LRU scan to drop AHI entries. That is something we are working on right
now and will fix as and when we are sure about the correctness.

Just to reiterate buf_LRU_drop_page_hash_for_tablespace() remains untouched
by this fix.

What we have tried to fix is the eviction of pages from LRU list and the scan
for it i.e.: now we call buf_LRU_remove_dirty_pages_for_tablespace().

You said:

>>The buffer pool mutex is released after every 1024 pages are found that
meet
>>these conditions:
>>1) from the dropped tablespace
>>2) not compressed (buf_page_get_state() == BUF_BLOCK_FILE_PAGE
>>3) not IO fixed

>>My guess is that DROP TABLE for a compressed table or for a table with few
>>pages in the buffer pool will continue to suffer from stalls.

Actually, in case of buf_LRU_remove_dirty_pages_for_tablespace() the above is
not correct. If you look at the code:

* We release buf_pool::mutex every 1024 pages where:
  * The page can belong to any tablespace
  * The page can be buf_fixed or IO fixed

Note that how we deal with a page is different from whether or not we count
it as an iteration. We count all pages traversed as an iteration:

                /* Every BUF_LRU_DROP_SEARCH_SIZE iterations in the
                loop we release buf_pool->mutex to let other threads
                do their job. */
                if (i < BUF_LRU_DROP_SEARCH_SIZE) {
                        continue;
                }

I believe either I am missing something obvious or we are talking about two
different things. Please let me know if you have any questions.

My comments are about buf_LRU_drop_page_hash_for_tablespace. You have not fixed the problem until that code is changed. You cited code from buf_LRU_remove_dirty_pages_for_tablespace. That isn't the problem in 5.5 and trunk.

We are talking about two different functions.

Mark,

I have never said that we have fully fixed the problem. buf_LRU_remove_dirty_pages_for_tablespace() went in as a partial fix for this problem. Previously we used to have buf_LRU_invalidate_tablespace() which incurred multiple scans of LRU. I'd say we have partially solved the problem i.e.: replaced LRU scan with staggered flush_list scan by introducing buf_LRU_remove_dirty_pages_for_tablespace().

About buf_LRU_drop_page_hash_for_tablespace() I agree that we have not made any changes to it and I agree that the rest of the problem exists there. Based on some very valuable input from you we are trying to get rid of this scan as well. But that has to go through the process of internal testing and review. Not pushed yet.

Mark,

FYI: I opened bug#64284 to resolve the LRU scan for AHI issue.