Bug #40603 Innodb background IO rate limiting kills performance
Submitted: 9 Nov 2008 18:22 Modified: 12 Aug 2009 17:01
Reporter: Mark Callaghan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:5.0,5.1 OS:Any
Assigned to: Inaam Rana CPU Architecture:Any
Tags: Contribution, innodb, io, limit, rate

[9 Nov 2008 18:22] Mark Callaghan
Description:
InnoDB limits the background IO thread to 100 writes per second. This is the thread that flushes most dirty pages from the buffer cache. The limit is 100 regardless of the server IO capacity. The limit of 100 is good for servers with 1 disk. It is horrible otherwise.

How to repeat:
Run Innodb on a server with many disks

Suggested fix:
Use the Google patch -- http://bazaar.launchpad.net/%7Emdcallag/mysql-patch/5.0-map/revision/{2685,2686}
[10 Nov 2008 15:48] Heikki Tuuri
InnoDB does not really sleep 1 second between buffer pool flushes. It sets skip_sleep = TRUE.

But the main thread does a lot of things besides doing the buffer pool flushes. Ideally, we should have several main threads, and a way to tune the resources we allocate to the insert buffer merge, etc.

AIO will speed up flushes, but it introduces another problem: the main thread may exhaust the AIO queue by putting too many writes to it.

Assigning this feature request to Inaam, who is our AIO man.

srv0srv.c in 5.1:

        /* ---- We run the following loop approximately once per second
        when there is database activity */

        skip_sleep = FALSE;

        for (i = 0; i < 10; i++) {
                n_ios_old = log_sys->n_log_ios + buf_pool->n_pages_read
                        + buf_pool->n_pages_written;
                srv_main_thread_op_info = "sleeping";

                if (!skip_sleep) {

                        os_thread_sleep(1000000);
                }

                skip_sleep = FALSE;
...

                if (UNIV_UNLIKELY(buf_get_modified_ratio_pct()
                                  > srv_max_buf_pool_modified_pct)) {

                        /* Try to keep the number of modified pages in the
                        buffer pool under the limit wished by the user */

                        n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 100,
                                                          ut_dulint_max);

                        /* If we had to do the flush, it may have taken
                        even more than 1 second, and also, there may be more
                        to flush. Do not sleep 1 second during the next
                        iteration of this loop. */

                        skip_sleep = TRUE;
                }

                if (srv_activity_count == old_activity_count) {

                        /* There is no user activity at the moment, go to
                        the background loop */

                        goto background_loop;
                }
        }

        /* ---- We perform the following code approximately once per
        10 seconds when there is database activity */

#ifdef MEM_PERIODIC_CHECK
        /* Check magic numbers of every allocated mem block once in 10
        seconds */
        mem_validate_all_blocks();
#endif
        /* If there were less than 200 i/os during the 10 second period,
        we assume that there is free disk i/o capacity available, and it
        makes sense to flush 100 pages. */

        n_pend_ios = buf_get_n_pending_ios() + log_sys->n_pending_writes;
        n_ios = log_sys->n_log_ios + buf_pool->n_pages_read
                + buf_pool->n_pages_written;
        if (n_pend_ios < 3 && (n_ios - n_ios_very_old < 200)) {

                srv_main_thread_op_info = "flushing buffer pool pages";
                buf_flush_batch(BUF_FLUSH_LIST, 100, ut_dulint_max);

                srv_main_thread_op_info = "flushing log";
                log_buffer_flush_to_disk();
        }

...

        /* Flush a few oldest pages to make a new checkpoint younger */

        if (buf_get_modified_ratio_pct() > 70) {

                /* If there are lots of modified pages in the buffer pool
                (> 70 %), we assume we can afford reserving the disk(s) for
                the time it requires to flush 100 pages */

                n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 100,
                                                  ut_dulint_max);
        } else {
                /* Otherwise, we only flush a small number of pages so that
                we do not unnecessarily use much disk i/o capacity from
                other work */

                n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 10,
                                                  ut_dulint_max);
        }

        srv_main_thread_op_info = "making checkpoint";

        /* Make a new checkpoint about once in 10 seconds */

        log_checkpoint(TRUE, FALSE);

        srv_main_thread_op_info = "reserving kernel mutex";

        mutex_enter(&kernel_mutex);

        /* ---- When there is database activity, we jump from here back to
        the start of loop */

        if (srv_activity_count != old_activity_count) {
                mutex_exit(&kernel_mutex);
                goto loop;
        }
[6 Jul 2009 23:24] Mark Callaghan
Heikki,

For many workloads I will agree with you -- it skips sleep. And that creates a different problem. It is difficult to understand the rate at which IO occurs in that case. It can call fsync() and do other things much more than expected. To tune the server, I prefer as system that is more predictable so that when I configure the server to do 1000 IOPs from the background threads, then the server does no more than that.
[12 Aug 2009 17:01] Inaam Rana
Fixed in plugin 1.0.4.

Documentations and source available at www.innodb.com