MySQL Bugs: #98869: Temp ibt tablespace truncation at disconnection stuck InnoDB under large BP

Bug #98869	Temp ibt tablespace truncation at disconnection stuck InnoDB under large BP
Submitted:	8 Mar 2020 15:46	Modified:	25 Nov 2020 18:25
Reporter:	Fungo Wang (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.19	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	ibt, intrinsic, LRU_list_mutex, truncate

Description:
This is a deja-vu of bug #51325, when thread local intrinsic InnoDB tablespace is truncated at disconnection, the full LRU list scan lead to InnoDB choke. 

How to repeat:
To repeat this, you need a large BP (the larger, the more serious). In my case, I use a innodb_buffer_pool_size = 256G, and load about 180G data into BP, i.e. about 12,000,000 16K pages.

Run sysbench OLTP_RW workload with 512 threads.

Then open another session, just executing "select count(*) from information_schema.processlist" and quit.

```
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select count(*) from information_schema.processlist;
+----------+
| count(*) |
+----------+
|      513 |
+----------+
1 row in set (0.00 sec)

mysql> quit;
```

The sysbench QPS drop sharply to zero, and last for few seconds.

```
[ 31s ] thds: 512 tps: 10533.92 qps: 210736.41 (r/w/o: 147526.88/11168.86/52040.67) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 32s ] thds: 512 tps: 10596.63 qps: 211878.56 (r/w/o: 148200.78/11245.67/52432.11) lat (ms,95%): 87.56 err/s: 0.00 reconn/s: 0.00
[ 33s ] thds: 512 tps: 10586.07 qps: 212040.44 (r/w/o: 148592.02/11199.13/52249.28) lat (ms,95%): 94.10 err/s: 0.00 reconn/s: 0.00
[ 34s ] thds: 512 tps: 3144.68 qps: 66512.31 (r/w/o: 46491.00/3329.72/16691.59) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 512 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 36s ] thds: 512 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 37s ] thds: 512 tps: 2408.44 qps: 44397.71 (r/w/o: 31100.79/2567.40/10729.51) lat (ms,95%): 3639.94 err/s: 0.00 reconn/s: 0.00
[ 38s ] thds: 512 tps: 10578.25 qps: 211348.99 (r/w/o: 147956.49/11250.33/52142.16) lat (ms,95%): 87.56 err/s: 0.00 reconn/s: 0.00
[ 39s ] thds: 512 tps: 10562.65 qps: 211394.07 (r/w/o: 147829.14/11269.70/52295.23) lat (ms,95%): 89.16 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 512 tps: 10520.20 qps: 210493.02 (r/w/o: 147494.79/11104.10/51894.13) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 41s ] thds: 512 tps: 10431.86 qps: 208579.15 (r/w/o: 145978.01/11045.79/51555.35) lat (ms,95%): 90.78 err/s: 0.00 reconn/s: 0.00
```

Note: If you use mysql 8.0 before 8.0.16, make sure the on-disk internal temporary table use InnoDB by configuring  this "internal_tmp_disk_storage_engine = InnoDB".

Suggested fix:
I believe this regression was introuced by WL#11613, which aimed to reclaim disk space used by temporary table.
This is a pre-allocted ibt tablespace pool, and each session could use at most 2 ibt tablespace, one for "External Temporary Table", and the other for "Intrinsic Temporary Table". Once assigned, the ibt tablespaces are local to session/connection, and cleaned up at disconnection through tablespace truncation.

The culprit is the way tablespace truncation is carried out, it will iterate the LRU list twice, the 1st time for dropping AHI entry for current tablespace, the 2nd time for evicting all the page from LRU list. In both iterations, the LRU_list_mutex is holding for a long time (could be released/re-acuqired temporarily).

During the LRU list scanning, if other threads need to adjust LRU list (make page young, load new pages from disk etc.), they will get blocked.

The pt-pmp result:
```
138 pthread_cond_wait,os_event::wait(os0event.cc:189),os_event::wait_low(os0event.cc:189),os_event_wait_low(os0event.cc:608),sync_array_wait_event(sync0arr.cc:444),TTASEventMutex::wait(ut0mutex.ic:89),TTASEventMutex::spin_and_try_lock(ib0mutex.h:662),TTASEventMutex::enter(ib0mutex.h:556),PolicyMutex<TTASEventMutex<GenericPolicy>(ib0mutex.h:556),buf_page_make_young(buf0buf.cc:2945),ibuf_update_free_bits_if_full(ibuf0ibuf.ic:301),btr_cur_optimistic_insert(ibuf0ibuf.ic:301),row_ins_sec_index_entry_low(row0ins.cc:3003),row_ins_sec_index_entry(row0ins.cc:3200),row_ins_index_entry(row0ins.cc:3300),row_ins_index_entry_step(row0ins.cc:3300),row_ins(row0ins.cc:3300),row_ins_step(row0ins.cc:3300),row_insert_for_mysql_using_ins_graph(row0mysql.cc:1584),row_insert_for_mysql(row0mysql.cc:1714),ha_innobase::write_row(ha_innodb.cc:8517),handler::ha_write_row(handler.cc:7858),write_record(sql_insert.cc:2108),Sql_cmd_insert_values::execute_inner(sql_insert.cc:621),Sql_cmd_dml::execute(sql_select.cc:715),mysql_execute_command(sql_parse.cc:3469),Prepared_statement::execute(sql_prepare.cc:3232),Prepared_statement::execute_loop(sql_prepare.cc:2806),mysqld_stmt_execute(sql_prepare.cc:1928),dispatch_command(sql_parse.cc:1684),do_command(sql_parse.cc:1275),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)

122 pthread_cond_wait,os_event::wait(os0event.cc:189),os_event::wait_low(os0event.cc:189),os_event_wait_low(os0event.cc:608),sync_array_wait_event(sync0arr.cc:444),TTASEventMutex::wait(ut0mutex.ic:89),TTASEventMutex::spin_and_try_lock(ib0mutex.h:662),TTASEventMutex::enter(ib0mutex.h:556),PolicyMutex<TTASEventMutex<GenericPolicy>(ib0mutex.h:556),buf_page_make_young(buf0buf.cc:2945),Buf_fetch::single_page(buf0buf.cc:4111),buf_page_get_gen(buf0buf.cc:4216),btr_cur_search_to_nth_level(btr0cur.cc:981),row_ins_sec_index_entry_low(row0ins.cc:2884),row_ins_sec_index_entry(row0ins.cc:3200),row_ins_index_entry(row0ins.cc:3300),row_ins_index_entry_step(row0ins.cc:3300),row_ins(row0ins.cc:3300),row_ins_step(row0ins.cc:3300),row_insert_for_mysql_using_ins_graph(row0mysql.cc:1584),row_insert_for_mysql(row0mysql.cc:1714),ha_innobase::write_row(ha_innodb.cc:8517),handler::ha_write_row(handler.cc:7858),write_record(sql_insert.cc:2108),Sql_cmd_insert_values::execute_inner(sql_insert.cc:621),Sql_cmd_dml::execute(sql_select.cc:715),mysql_execute_command(sql_parse.cc:3469),Prepared_statement::execute(sql_prepare.cc:3232),Prepared_statement::execute_loop(sql_prepare.cc:2806),mysqld_stmt_ex
ecute(sql_prepare.cc:1928),dispatch_command(sql_parse.cc:1684),do_command(sql_parse.cc:1275),handle_connection(connection_handler_per_thread.cc:302),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)

1 buf_LRU_drop_page_hash_for_tablespace,buf_LRU_flush_or_remove_pages,Fil_shard::space_truncate(fil0fil.cc:4293),ibt::Tablespace::truncate(srv0tmp.cc:142),ibt::Tablespace_pool::free_ts(srv0tmp.cc:229),ibt::free_tmp(srv0tmp.cc:377),innodb_session_t::~innodb_session_t(sess0sess.h:94),ut_allocator<innodb_session_t>::destroy(sess0sess.h:94),ut_delete<innodb_session_t>(sess0sess.h:94),innobase_close_connection(sess0sess.h:94),closecon_handlerton(handler.cc:916),plugin_foreach_with_mask(sql_plugin.cc:2687),plugin_foreach_with_mask(sql_plugin.cc:2700),ha_close_connection(handler.cc:928),THD::release_resources(sql_class.cc:1042),handle_connection(connection_handler_per_thread.cc:309),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)
```

=========
Above stack is for the 1st iteration, I also managed to capture the 2nd one, if wait 1-2s compared to when to capture the 1st one.

```
1 buf_page_in_file(buf0buf.ic:337),buf_LRU_remove_all_pages(buf0buf.ic:337),buf_LRU_remove_pages(buf0buf.ic:337),buf_LRU_flush_or_remove_pages(buf0buf.ic:337),Fil_shard::space_truncate(fil0fil.cc:4293),ibt::Tablespace::truncate(srv0tmp.cc:142),ibt::Tablespace_pool::free_ts(srv0tmp.cc:229),ibt::free_tmp(srv0tmp.cc:377),innodb_session_t::~innodb_session_t(sess0sess.h:94),ut_allocator<innodb_session_t>::destroy(sess0sess.h:94),ut_delete<innodb_session_t>(sess0sess.h:94),innobase_close_connection(sess0sess.h:94),closecon_handlerton(handler.cc:916),plugin_foreach_with_mask(sql_plugin.cc:2687),plugin_foreach_with_mask(sql_plugin.cc:2700),ha_close_connection(handler.cc:928),THD::release_resources(sql_class.cc:1042),handle_connection(connection_handler_per_thread.cc:309),pfs_spawn_thread(pfs.cc:2854),start_thread(libpthread.so.0),clone(libc.so.6)
```

"select count(*) from information_schema.processlist" use intrinsic tmp InnoDB, and AHI is disabled for intrinsic tmp InnoDB table (index->disable_ahi = true), I guess we may not need to scan and free page from LRU list, free from FLUSH list is enough. Which is already optimized and normally won't hurt performance. And FLUSH list is usually much shorted than LRU list.

But explicit created "External Temporary Table" also face the same issue. A simple "create temporary table t1(id int) engine = innodb;" is enough to repro. For explicit temporary table, AHI is not disabled and could still be used. 

We may also disable AHI for it, or use other strategies, such as don't do truncation at each disconnection, but after the ibt tablespace size reach a threshold, this may alleviate this issue.

This bug can also be illustrated in another perspective. Under idle work load, the cpu usage will bump to 100% at disconnection.

11:14:19 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
11:14:20 PM 60343     60409    1.00    0.00    0.00    1.00    18  mysqld
11:14:21 PM 60343     60409    0.00    0.00    0.00    0.00     2  mysqld
11:14:22 PM 60343     60409    1.00    0.00    0.00    1.00    18  mysqld
11:14:23 PM 60343     60409    0.00    0.00    0.00    0.00     2  mysqld
11:14:24 PM 60343     60409    1.00    0.00    0.00    1.00    18  mysqld
11:14:25 PM 60343     60409    1.00    0.00    0.00    1.00     2  mysqld
11:14:26 PM 60343     60409   52.00    1.00    0.00   53.00    20  mysqld // disconection
11:14:27 PM 60343     60409  100.00    0.00    0.00  100.00     4  mysqld
11:14:28 PM 60343     60409  101.00    0.00    0.00  101.00     4  mysqld
11:14:29 PM 60343     60409   99.00    0.00    0.00   99.00     4  mysqld
11:14:30 PM 60343     60409  101.00    0.00    0.00  101.00     4  mysqld
11:14:31 PM 60343     60409   57.00    0.00    0.00   57.00     4  mysqld
11:14:32 PM 60343     60409    0.00    0.00    0.00    0.00     2  mysqld
11:14:33 PM 60343     60409    1.00    0.00    0.00    1.00    18  mysqld
11:14:34 PM 60343     60409    0.00    1.00    0.00    1.00    18  mysqld

And the on-cpu perf result shows this stack is the hottest.

-   56.55%  mysqld       mysqld               [.] buf_LRU_flush_or_remove_pages
     buf_LRU_flush_or_remove_pages
     Fil_shard::space_truncate
     ibt::Tablespace::truncate
     ibt::Tablespace_pool::free_ts
     innobase_close_connection
     closecon_handlerton
     plugin_foreach_with_mask
     plugin_foreach_with_mask
     THD::release_resources
     handle_connection
     pfs_spawn_thread
     start_thread

Hi Mr. Wang,

Thank you for your bug report.

I do have only one question and it is related to your second method of observing this behaviour.

What exactly do we have to do to observe the behaviour, except for having an idle server and we disconnect a client from the server. Also, is that output from the program `top` or some other ???

Waiting on your feedback.

Hi Sinisa,

It's my negligence, I should have write it more clearly.

You need execute 
either
1. select count(*) from information_schema.processlist;  // Intrinsic Temporary Table
or
2. create a temporary innodb table explicitly.  // External Temporary Table

After that disconnection will repro.

The cpu usage is the result of `pidstat -u -p <mysql pid> 1`.
The stack is from `perf record -ag -p <pid>` and `perf report`.

Hi Mr. Wang,

After following your last instructions, I managed to repeat the behaviour,

Verified as reported.

Also see https://bugs.mysql.com/bug.php?id=98974

Thank you, Jimmy.

If AHI helps for user temporary tables (CREATE TEMPORARY TABLE) has to be evaluated. 

And the truncation of ibt tablespaces should follow the new undo style truncation.  New undo style truncation is drop + create and uses new space_ids.

AHI in general is an issue with dropping indexes, tables. See https://jira.percona.com/browse/PS-5769

fil_replace_tablespace() instead of fil_truncate_tablespace()

Hi Satya,

> And the truncation of ibt tablespaces should follow the new undo style truncation.  New undo style truncation is drop + create and uses new space_ids.

We also talked about this solution  internally with my colleagues, but this may not work very well if there are many short connections (say 64K, as mysql can not support more than 64K connections, 100K is a fake number, check bug #98624 and #98911), the space id for ibt tables will be exhausted quickly.

Check this in WL#11613, 

> A higher range of 400K space_ids are reserved for session temporary tablespaces.
400K is chosen because max_connections is 100K and we will reserve 4 space_ids
per session.  With this work, only 2 will be needed. We reserve more for some
future purpose. So the total number of space_ids needed is 100K * 4 = 400K.

Only 400K space id is reserved for ibt tablespaces.

@Fungo Wang

I see that MySQL is limiting at 65k connections at MDL levels. When I implemented the feature, I choose 100K because it was the theoretical connection limit.

But how IBT space_ids ran out quickly at 65K connections or less? This is bit surprising. 

Coming to the space_ids. Undo space_id reservation logic is bit different. They are just managed with 7bit space_id values (2^7 space_ids).

There is space_id bank and groups.

The IBT space_ids are just in-memory and the limit can be increased if required.

BTW, why not thread-pool or proxysql or other techniques to reduce number of total connections to mysql?

Thank you, Satya, for your comments.

And I agree with. you completely. Especially on the last comment.

Hi Satya,

> I see that MySQL is limiting at 65k connections at MDL levels. When I implemented the feature, I choose 100K because it was the theoretical connection limit.

What's "the theoretical connection limit" mean, this 100K number is based on ?

> But how IBT space_ids ran out quickly at 65K connections or less? This is bit surprising. 

If we choose to drop + create, and use new space ids on creation.

Each connection will use 2 space ids at max, and after disconnection the old space ids can not bee used, cause the potentail AHI issues, we don't know when it's safe to reuse.

> Coming to the space_ids. Undo space_id reservation logic is bit different. They are just managed with 7bit space_id values (2^7 space_ids).

> There is space_id bank and groups.

Thanks for the info, I have not noticed this space_id logic before:)

While checking the undo truncate logic, looks like it's same with ibt truncation, buf_remove_t is BUF_REMOVE_ALL_NO_WRITE, and LRU will be scanned. This is not needed, smells like a new bug :)

4343bool fil_replace_tablespace(space_id_t old_space_id, space_id_t new_space_id,
4344                            page_no_t size_in_pages) {
4345  fil_space_t *space = fil_space_get(old_space_id);
4346  std::string space_name(space->name);
4347  std::string file_name(space->files.front().name);
4348
4349  /* Delete the old file and space object. */
4350  dberr_t err = fil_delete_tablespace(old_space_id, BUF_REMOVE_ALL_NO_WRITE);

> The IBT space_ids are just in-memory and the limit can be increased if required.

This may not be an ideal option, as said above, short connections could quickly exhaust the range.

> BTW, why not thread-pool or proxysql or other techniques to reduce number of total connections to mysql?

Yeah, 3rd party products could help.

I guess thread-pool is aimed to limit the the number of threads, not connections.

Proxysql may help to solve connections issue, I'm not familiar with it...

"While checking the undo truncate logic, looks like it's same with ibt truncation, buf_remove_t is BUF_REMOVE_ALL_NO_WRITE, and LRU will be scanned. This is not needed, smells like a new bug :)"

Yes Indeed, REMOVE_ALL_NO_WRITE is not necessary

4 row0mysql.cc row_drop_tablespace          4281 err = fil_delete_tablespace(space_id, BUF_REMOVE_FLUSH_NO_WRITE);

1 fil0fil.cc   fil_replace_tablespace       4624 dberr_t err = fil_delete_tablespace(old_space_id, BUF_REMOVE_ALL_NO_WRITE);

clearly no AHI exists for UNDO, so FLUH_NO_WRITE is sufficient.

RENAME TABLE also has same problem (ALL_NO_WRITE used)

Hi Satya,

I have just filed bug #99021 for the new bug we discussed here about BUF_REMOVE_ALL_NO_WRITE,  if you are interested and may want to follow.

Thanks!

Thank you both for your contribution.

I will attend to both of the new bugs.

Posted by developer:
 
Fixed in 8.0.23 by WL14100. 
Bug numbers added to the 8.0.23 changelog entry for WL14100.

Thank you, Daniel.

I check the implementaion. https://dev.mysql.com/worklog/task/?id=14100
Does it worth to check every page when read?

I check the implementaion. https://dev.mysql.com/worklog/task/?id=14100
Does it worth to check every page when read?

HI Mr. Wang,

It depends on the user.

In your case we would say that it would be worth it.

Bug #105926 marked as duplicate of this one