MySQL Bugs: #98974: InnoDB temp table could hurt InnoDB perf badly

Bug #98974	InnoDB temp table could hurt InnoDB perf badly
Submitted:	17 Mar 2020 14:30	Modified:	23 Apr 2020 17:43
Reporter:	Fungo Wang (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.19, 5.7.29	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	checkpoint, flush_list_mutex, page cleaner, temp

Description:
Check the logic in buf_pool_get_oldest_modification_approx()

```
370  for (ulint i = 0; i < srv_buf_pool_instances; i++) {
371    buf_pool_t *buf_pool;
372
373    buf_pool = buf_pool_from_array(i);
374
375    buf_flush_list_mutex_enter(buf_pool);
376
377    buf_page_t *bpage;
378
379    /* We don't let log-checkpoint halt because pages from system
380    temporary are not yet flushed to the disk. Anyway, object
381    residing in system temporary doesn't generate REDO logging. */
382    for (bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
383         bpage != NULL && fsp_is_system_temporary(bpage->id.space());
384         bpage = UT_LIST_GET_PREV(list, bpage)) {
385      /* Do nothing. */
386    }
```

All the temporary table pages are skipped, and continue to find the the next one, at the same time the flush_list_mutex is kept held.

It's really bad to hold flush_list_mutex for a long time, as it'll block mtr_commit().
So when there are a lot of temp table buffer pages at the tail of FLUSH list, InnoDB become very inefficient.

How to repeat:
To show how bad it is, you need a big buffer pool, then create a bunch of temp InnoDB tables and fill the temp tables with lots of data.

In my case:

1. innodb_buffer_pool_size = 256G
2. Create and fill about 50G temp table
   $ du -sh  \#innodb_temp/
   50G     #innodb_temp/

   and keep them in BP as long as possible, I set a very low io_capacity to help this 
   | innodb_io_capacity     | 100   |
   | innodb_io_capacity_max | 100   |

   There will be about 3300000 dirty pages in the flush list, and hopefully most of them are belong to temp tablesapces. 

3. Then run a sysbench oltp_rw workload, you will see how bad the number is
Below is a snippet my sysbench output, the QPS fluctuate badly and could even drop to zero!!!

[ 68s ] thds: 126 tps: 368.04 qps: 7380.71 (r/w/o: 5176.50/1467.14/737.07) lat (ms,95%): 2539.17 err/s: 0.00 reconn/s: 0.00
[ 69s ] thds: 126 tps: 1164.83 qps: 23242.51 (r/w/o: 16281.56/4620.31/2340.65) lat (ms,95%): 707.07 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 126 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 71s ] thds: 126 tps: 1401.85 qps: 27872.09 (r/w/o: 19484.97/5573.42/2813.71) lat (ms,95%): 1803.47 err/s: 0.00 reconn/s: 0.00
[ 72s ] thds: 126 tps: 898.98 qps: 18049.58 (r/w/o: 12663.71/3574.92/1810.96) lat (ms,95%): 746.32 err/s: 0.00 reconn/s: 0.00
[ 73s ] thds: 126 tps: 327.98 qps: 6166.70 (r/w/o: 4141.80/1370.93/653.97) lat (ms,95%): 1973.38 err/s: 0.00 reconn/s: 0.00
[ 74s ] thds: 126 tps: 691.10 qps: 14296.08 (r/w/o: 10174.48/2730.40/1391.20) lat (ms,95%): 272.27 err/s: 0.00 reconn/s: 0.00
[ 75s ] thds: 126 tps: 538.03 qps: 10719.56 (r/w/o: 7507.39/2127.11/1085.06) lat (ms,95%): 995.51 err/s: 0.00 reconn/s: 0.00
[ 76s ] thds: 126 tps: 841.99 qps: 16891.82 (r/w/o: 11828.87/3370.96/1691.98) lat (ms,95%): 1771.29 err/s: 0.00 reconn/s: 0.00
[ 77s ] thds: 126 tps: 867.96 qps: 17338.15 (r/w/o: 12160.40/3431.83/1745.91) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 78s ] thds: 126 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 79s ] thds: 126 tps: 203.00 qps: 4042.92 (r/w/o: 2806.95/826.98/408.99) lat (ms,95%): 2198.52 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 126 tps: 123.99 qps: 2436.88 (r/w/o: 1673.91/512.97/249.99) lat (ms,95%): 3448.53 err/s: 0.00 reconn/s: 0.00
[ 81s ] thds: 126 tps: 569.01 qps: 11504.23 (r/w/o: 8086.16/2269.05/1149.02) lat (ms,95%): 3706.08 err/s: 0.00 reconn/s: 0.00
[ 82s ] thds: 126 tps: 115.99 qps: 2274.86 (r/w/o: 1592.90/449.97/231.99) lat (ms,95%): 995.51 err/s: 0.00 reconn/s: 0.00

Get pstack and perf during sysbench running

The pstack is as below, I mainly show the blocking and blocked stacks.

20 pthread_cond_wait,wait(os0event.cc:189),os_event::wait_low(os0event.cc:189),os_event_wait_low(os0event.cc:608),sync_array_wait_event(sync0arr.cc:444),TTASEventMutex::wait(ut0mutex.ic:89),TTASEventMutex::spin_and_try_lock(ib0mutex.
h:662),enter(ib0mutex.h:556),enter(ib0mutex.h:556),buf_flush_insert_into_flush_list(ib0mutex.h:556),buf_flush_note_modification(buf0flu.ic:104),add_dirty_page_to_flush_list(buf0flu.ic:104),operator()(buf0flu.ic:104),operator()(buf0flu.ic:
104),for_each_block_in_reverse<Iterate<Add_dirty_blocks_to_flush_list>(buf0flu.ic:104),mtr_t::Command::add_dirty_blocks_to_flush_list(buf0flu.ic:104),mtr_t::Command::execute(mtr0mtr.cc:655),mtr_t::commit(mtr0mtr.cc:485),row_upd_del_mark_c
lust_rec(row0upd.cc:2963),row_upd_clust_step(row0upd.cc:2963),row_upd(row0upd.cc:3172),row_upd_step(row0upd.cc:3172),row_update_for_mysql_using_upd_graph(row0mysql.cc:2348),row_update_for_mysql(row0mysql.cc:2444),ha_innobase::delete_row(h
a_innodb.cc:9359),handler::ha_delete_row(handler.cc:7915),Sql_cmd_delete::delete_from_single_table(sql_delete.cc:528),Sql_cmd_dml::execute(sql_select.cc:715),mysql_execute_command(sql_parse.cc:3469),Prepared_statement::execute(sql_prepare
.cc:3232),Prepared_statement::execute_loop(sql_prepare.cc:2806),mysqld_stmt_execute(sql_prepare.cc:1928),dispatch_command(sql_parse.cc:1684),do_command(sql_parse.cc:1275),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_t
hread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)

1 buf_pool_get_oldest_modification_approx(libc.so.6),page_cleaner_flush_pages_recommendation(buf0flu.cc:2493),buf_flush_page_coordinator_thread(buf0flu.cc:3149),void(buf0flu.cc:3149),__invoke<void(buf0flu.cc:3149),__call<void,(buf0flu.cc:3149),operator()<>(buf0flu.cc:3149),operator()<void(buf0flu.cc:3149),__invoke_impl<void,(buf0flu.cc:3149),__invoke<Runnable,(buf0flu.cc:3149),_M_invoke<0,(buf0flu.cc:3149),operator()(buf0flu.cc:3149),std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable,(buf0flu.cc:3149),execute_native_thread_routine,start_thread(libpthread.so.0),clone(libc.so.6)

There are mainly 2 stacks that invoke buf_pool_get_oldest_modification_approx() regularly, one is checkpoint, the other is page cleaner. Below is the perf result

-   31.87%  mysqld   mysqld                     [.] buf_pool_get_oldest_modification_approx
   - buf_pool_get_oldest_modification_approx
      - 73.05% buf_pool_get_oldest_modification_lwm
           log_update_available_for_checkpoint_lsn
           log_checkpointer
           std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)(log_t*), log_t*> > >::_M_run
           execute_native_thread_routine
      - 26.95% page_cleaner_flush_pages_recommendation
           buf_flush_page_coordinator_thread
           std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)(unsigned long), unsigned long> > >::_M_run
           execute_native_thread_routine

  0.45 │128:   mov    0x40(%r14),%r14
       │              bpage != NULL && fsp_is_system_temporary(bpage->id.space());
  6.08 │       test   %r14,%r14
       │     ↓ je     145
  0.17 │131:   mov    (%r14),%edi
 93.26 │     → callq  fsp_is_system_temporary(unsigned int)
  0.05 │       test   %al,%al
       │     ↑ jne    128        
       

fsp_is_system_temporary() use most of the cpu cycles.

Suggested fix:
Maybe we should manage temp tablespace pages separately in buffer pool, and don't mix them with normal pages ...

Hi Mr. Wang,

Thank you for your bug report.

However, this is clearly not a bug. Because, what if there are none temporary tables ??? Then there is no additional time taken to hold that mutex.

At best, this is a feature request. But, we would need a well defined feature request. Your idea of having pages of temporary tables held separately is not good enough. Simply, buffer pool pages are allocated based on the LRU algorithm. 

The only solution to the problem is the introduction of a new variable, which would define number of instances dedicated to temporary tables, with a minimum being one instance.

If you have any better idea for this feature request, let us know.

Hi Sinisa,

> However, this is clearly not a bug. Because, what if there are none temporary tables ??? Then there is no additional time taken to hold that mutex.

This bug is dedicated to InnoDB temp table. 

You can not stop customers from using temp table, as long as large temp table is used, the bug could be triggered.

So I don't agree with you * this is clearly not a bug*, it is. :(

> At best, this is a feature request. But, we would need a well defined feature request. Your idea of having pages of temporary tables held separately is not good enough. Simply, buffer pool pages are allocated based on the LRU algorithm. 

I have reported several bugs about InnoDB tmp table, and I also find some other bugs related to temptable recently (I'll report later). Filing feature request to improve temptable is in my plan, not yet in this bug report.

> The only solution to the problem is the introduction of a new variable, which would define number of instances dedicated to temporary tables, with a minimum being one instance.

The idea is to manage temptable pages separately, how to implement this deserve a detail design. I don't think this is *the only solution*, to create a new temp table buffer pool...

We could, such as, maintain a dedicated LRU_tmp and FLUSH_list_tmp in the original buf pool instance.

Hi,

This is definitely a feature request.

However, in order to verify it, we need this feature request to be well defined. We do not need design, but a precise definition of the new feature. We do not think that you have yet provided one. However, we would be happy if you would provide us with one.

Thanks in advance.

Hi Sinisa,

I still think this report is perf bug about InnoDB temp table, and sincerely suggest you reconsider it.

About the feature request to handle temp tablespace pages separately, I'll file a new one later.

Thank you.

Hi,

I do not think that this report can be verified as a performance bug, simply because the behaviour conforms to the current design. Hence, we could consider only a feature request, for which you are filing a separate report.

Sinisa, this actually represents a set of problems we were experiencing online involving temp table. Another serious one is https://bugs.mysql.com/bug.php?id=98869. In both cases, the impact on buffer pool lru or flush_list from temp table pages has caused serious and hard to understand jitter problem on the system. We are asserting these are bugs because they have cause real pain to MySQL customers as well as DBAs to figure out the cause of these performance jitter. And in the 98869 case, we have to fall back to MyISAM temp table (we were still on 8.0.13), otherwise the system fluctuate so much, users won’t be able to tolerate .  So these are real usable problems, not feature.
And possible solution is to separate lru or flush list handling for temp tablespace pages, so it won’t affect user tables, or at least check the loop while holding flush list mutex in uf_pool_get_oldest_modification_approx()

Hi Jimmy,

A bug on the performance of the new temporary tables already exists and is verified some time ago. And it was reported by the company in which you are employed.

Regarding this bug, reporter has promised a properly defined feature request. When that one is verified, it will cover all necessary remaining issues.

Hi,

The bug #98869, if it's the bug you mentioned already exists, affects only 8.0, it's a regression perf bug for 8.0.

> "I believe this regression was introuced by WL#11613, which aimed to reclaim disk space used by temporary table."
As I already pointed out in that bug report.

But this one also affects 5.7 if you check the func `buf_pool_get_oldest_modification()` in 5.7

```
410lsn_t
411buf_pool_get_oldest_modification(void)
412/*==================================*/
413{
414        lsn_t           lsn = 0;
415        lsn_t           oldest_lsn = 0;
416
417        /* When we traverse all the flush lists we don't want another
418        thread to add a dirty page to any flush list. */
419        log_flush_order_mutex_enter();
420
421        for (ulint i = 0; i < srv_buf_pool_instances; i++) {
422                buf_pool_t*     buf_pool;
423
424                buf_pool = buf_pool_from_array(i);
425
426                buf_flush_list_mutex_enter(buf_pool);
427
428                buf_page_t*     bpage;
429
430                /* We don't let log-checkpoint halt because pages from system
431                temporary are not yet flushed to the disk. Anyway, object
432                residing in system temporary doesn't generate REDO logging. */
433                for (bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
434                     bpage != NULL
435                        && fsp_is_system_temporary(bpage->id.space());
436                     bpage = UT_LIST_GET_PREV(list, bpage)) {
437                        /* Do nothing. */
438                }
```

Also they are different bugs from the customer perspective, as they triggered by different usage/scenario.

From the developer perspective, they are also different code paths. 
This bug shows another view of the defect of InnoDB temp table. Which make the improvement/enhancement more urgent...

So I don't know why this bug did not get verified as a perf bug.

HI Mr. Wang,

The measurements that you have provided are for 8.0, not 5.7.

If you would come with a  benchmark for 5.7 with a heavy performance impact, which we could repeat, then this could be a performance bug that can be verified.

Hi,

1. I think the code logic of buf_pool_get_oldest_modification() in 5.7, which I have posted, shows the issue apparently, the loop logic about temp table space (which is the culprit) is same with 8.0, you can do analysis.

2. The repro steps I provided also applies to 5.7, you can try to repro this in 5.7, and 8.0.

3. It is a perf bug in 8.0 anyway, no matter whether it applies to 5.7 or not.

And I still don't know why it did not get verified at the first place :(

I have just ran the repro steps, in case you won't bother to do it :)

Using the same repro steps, and this is the result for 5.7.29.

1. innodb_buffer_pool_size = 256G
2. Create and fill about 50G temp table
  with configuration
   | innodb_io_capacity     | 100   |
   | innodb_io_capacity_max | 100   |

3. Then run a sysbench oltp_rw workload   

[ 50s ] thds: 126 tps: 13217.53 qps: 263457.72 (r/w/o: 184624.48/52399.18/26434.06) lat (ms,95%): 16.12 err/s: 0.00 reconn/s: 0.00
[ 51s ] thds: 126 tps: 11551.90 qps: 231128.97 (r/w/o: 161688.58/46338.59/23101.80) lat (ms,95%): 16.41 err/s: 0.00 reconn/s: 0.00
[ 52s ] thds: 126 tps: 8823.84 qps: 176627.81 (r/w/o: 123727.76/35250.36/17649.68) lat (ms,95%): 17.01 err/s: 0.00 reconn/s: 0.00
[ 53s ] thds: 126 tps: 12007.16 qps: 240238.28 (r/w/o: 168118.30/48106.66/24013.33) lat (ms,95%): 15.83 err/s: 0.00 reconn/s: 0.00
[ 54s ] thds: 126 tps: 11842.77 qps: 237035.38 (r/w/o: 165890.77/47457.08/23687.54) lat (ms,95%): 16.12 err/s: 0.00 reconn/s: 0.00
[ 55s ] thds: 126 tps: 10420.60 qps: 208067.03 (r/w/o: 145663.42/41565.41/20838.20) lat (ms,95%): 17.63 err/s: 0.00 reconn/s: 0.00
[ 56s ] thds: 126 tps: 13342.07 qps: 266797.34 (r/w/o: 186852.94/53257.27/26687.13) lat (ms,95%): 15.55 err/s: 0.00 reconn/s: 0.00
[ 57s ] thds: 126 tps: 13366.68 qps: 267270.60 (r/w/o: 187004.52/53533.72/26732.36) lat (ms,95%): 15.55 err/s: 0.00 reconn/s: 0.00
[ 58s ] thds: 126 tps: 10809.26 qps: 215986.19 (r/w/o: 151211.63/43159.04/21615.52) lat (ms,95%): 16.41 err/s: 0.00 reconn/s: 0.00
[ 59s ] thds: 126 tps: 8389.98 qps: 168364.67 (r/w/o: 117820.77/33760.93/16782.97) lat (ms,95%): 17.32 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 126 tps: 10673.95 qps: 213192.02 (r/w/o: 149322.31/42521.80/21347.90) lat (ms,95%): 16.12 err/s: 0.00 reconn/s: 0.00
[ 61s ] thds: 126 tps: 11027.06 qps: 220392.16 (r/w/o: 154221.81/44116.23/22054.12) lat (ms,95%): 17.01 err/s: 0.00 reconn/s: 0.00

You can see the fluctuation of QPS numbers.

The reason it's much better than 8.0 is buf_pool_get_oldest_modification() is called at a much low frequency than 8.0 (which means another regression point in 8.0).

the perf result during run above sysbench:

-    0.66%  mysqld   mysqld               [.] buf_pool_get_oldest_modification
   - buf_pool_get_oldest_modification
      - 88.86% buf_flush_page_cleaner_coordinator
           start_thread
      - 11.14% log_checkpoint
           srv_master_thread
           start_thread

This is definitely known issue. Whatever you call it 'bug' or 'feature request'.
"not a bug" is not acceptable.

Dedicated temporary buffer pool instance -> not good idea. All temp tables will use same instance and there will be contention

Dedicated temporary buffer pools -> If user has large spare memory, good idea. Easy to implement.

Dedicated temporary buffer pools with auto-resize. Start with lower size and resize based on needs. Downsizing algorithm could be little complex.. May be

Separate lists for temp pages in each buffer pool -> Best interms of memory usage but complex solution.

To simulate the frequent calling of buf_pool_get_oldest_modification() behaviour, which is triggered by log_checkpoint background thread in 8.0.

I create a sysbench lua script as below, to call "SHOW ENGINE INNODB STATUS" repeatedly, which will call buf_pool_get_oldest_modification().

```
cat show_engine.lua
#!/usr/bin/env sysbench
require("oltp_common")
function prepare_statements()
end
function event()
  con:query("SHOW ENGINE INNODB STATUS")
end
```

As the same time run oltp_rw workload, also run this show_engine script with 1 thread.

[ 25s ] thds: 126 tps: 167.00 qps: 2908.02 (r/w/o: 2330.02/244.00/334.00) lat (ms,95%): 2680.11 err/s: 0.00 reconn/s: 0.00
[ 26s ] thds: 126 tps: 24.00 qps: 704.00 (r/w/o: 344.00/312.00/48.00) lat (ms,95%): 1618.78 err/s: 0.00 reconn/s: 0.00
[ 27s ] thds: 126 tps: 37.00 qps: 818.00 (r/w/o: 518.00/226.00/74.00) lat (ms,95%): 2449.36 err/s: 0.00 reconn/s: 0.00
[ 28s ] thds: 126 tps: 65.00 qps: 1125.00 (r/w/o: 862.00/133.00/130.00) lat (ms,95%): 3639.94 err/s: 0.00 reconn/s: 0.00
[ 29s ] thds: 126 tps: 86.00 qps: 1780.01 (r/w/o: 1243.01/365.00/172.00) lat (ms,95%): 4128.91 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 126 tps: 10.00 qps: 397.00 (r/w/o: 149.00/228.00/20.00) lat (ms,95%): 2045.74 err/s: 0.00 reconn/s: 0.00
[ 31s ] thds: 126 tps: 149.00 qps: 2554.01 (r/w/o: 2049.01/207.00/298.00) lat (ms,95%): 2632.28 err/s: 0.00 reconn/s: 0.00
[ 32s ] thds: 126 tps: 5.00 qps: 399.00 (r/w/o: 106.00/283.00/10.00) lat (ms,95%): 787.74 err/s: 0.00 reconn/s: 0.00
[ 33s ] thds: 126 tps: 0.00 qps: 150.00 (r/w/o: 1.00/149.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 34s ] thds: 126 tps: 145.00 qps: 2187.00 (r/w/o: 1793.00/104.00/290.00) lat (ms,95%): 3448.53 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 126 tps: 28.00 qps: 1114.99 (r/w/o: 628.99/430.00/56.00) lat (ms,95%): 204.11 err/s: 0.00 reconn/s: 0.00
[ 36s ] thds: 126 tps: 121.00 qps: 2493.01 (r/w/o: 1685.00/566.00/242.00) lat (ms,95%): 1235.62 err/s: 0.00 reconn/s: 0.00

Now the numbers are pretty close to 8.0.

The perf result:

-   71.61%  mysqld   mysqld               [.] buf_pool_get_oldest_modification
   - buf_pool_get_oldest_modification
      - 80.37% log_print
           srv_printf_innodb_monitor
           innobase_show_status
           ha_show_status
           mysql_execute_command
           mysql_parse
           dispatch_command
           do_command
           handle_connection
           pfs_spawn_thread
           start_thread
      - 17.74% buf_flush_page_cleaner_coordinator
           start_thread
      - 1.88% log_checkpoint
           srv_master_thread
           start_thread

Another idea if user don't care so much about rollback, May be expose Temptable engine to users.(It is used for internal optimizer tables already)

CREATE TEMPORARY TABLE t1(a INT) ENGINE=TEMPTABLE?

This is a known issue. It will be fixed in due time.

Hi Mr. Wang,

I have run your LUA script on latest 5.7 and got similar QPS variation.

Verified as a perforance issue.

I filed bug #99020 for the feature request.

Posted by developer:
 
Fixed as of the upcoming 8.0.22 release, and here's the proposed changelog entry from the documentation team:

Numerous system temporary table pages at the tail of the buffer pool
flush list caused a performance degradation. The flush_list_mutex was held
while the flush list scan traversed over system temporary table pages. The
flush list scan now excludes system temporary table pages.

Thank you, Daniel.

So this issue doesn't affect 5.7?

Posted by developer:
 
Fixed as of the upcoming 5.7.35, 8.0.22 releases, and here's the proposed changelog
entry from the documentation team:

Numerous system temporary table pages at the tail of the buffer pool
flush list caused a performance degradation. The flush_list_mutex was held
while the flush list scan traversed over system temporary table pages. The
flush list scan now excludes system temporary table pages.

Thank you, Daniel ......