Bug #99609 Resource_group_mgr heap-use-after-free
Submitted: 18 May 4:17 Modified: 9 Jun 15:42
Reporter: xiaoyu wang (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Options Severity:S1 (Critical)
Version:MySQL Server 8.0.3 Community OS:Any
Assigned to: Thayumanavar Sachithanantham CPU Architecture:Any

[18 May 4:17] xiaoyu wang
Description:
An OCA was signed with the name of Tencent Technology (Shenzhen) Company Limited.

The mysql-test case main.table_open_cache_functionality reported heap-use-after-free.
Originally, it happened once and doesn't repeat. We make a minor change to source code to make it easier to repeat.
Build with asan, and the attachment asan.txt is the stack.

Causes:
At the bootstrap phase, main thread and child threads access and release the memory of Resource_group_mgr concurrently.
The following codes lead to this bug. (`X-->Y` denotes `X invokes Y`)
1. [main thread] init_server_components-->Dictionary_impl::init-->run_bootstrap_thread
   Dictionary_impl::init() creates several child threads, which we call `dd thread` for illustration.
2.1 [main thread] init_server_components-->Resource_group_mgr::disable_resource_group-->deinit-->delete m_resource_group_hash-->Resource_group dtor
    Afterwards, m_sys_default_resource_group and m_usr_default_resource_group are released as their owner, m_resource_group_hash, is deleted in Resource_group_mgr::deinit().
2.2 [dd thread] run_bootstrap_thread-->mysql_thread_create-->pfs_spawn_thread_v1 --> pfs_spawn_thread --> pfs_notify_thread_create --> thread_create_callback --> set_res_grp_in_pfs --> pfs_set_thread_resource_group_by_id_v1 --> set_thread_resource_group-->memcpy
    set_thread_resource_group() tries copying from freed memory. From functions below, we see that m_sys_default_resource_group->name().c_str() is the source of memcpy() and Resource_group::name() returns a reference. m_sys_default_resource_group was released by `main thread` in 2.1, so that heap-use-after-free happens to `dd thread`.

    int set_thread_resource_group(PFS_thread *pfs, const char *group_name, int group_name_len, void *user_data) {
      ...
      memcpy(pfs->m_groupname, group_name, group_name_len); // group_name was freed
      ...
    }
    group_name of set_thread_resource_group() comes from  thread_create_callback().

    void thread_create_callback(const PSI_thread_attrs *thread_attrs) {
      ...
      res_grp_mgr->set_res_grp_in_pfs(res_grp->name().c_str(), //-->pfs_set_thread_resource_group_by_id_v1
            res_grp->name().length(), thread_attrs->m_thread_internal_id); //Resource_group::name returns a reference
      ...
    }

The stack in asan.txt shows a heap-use-after-free in dd child threads. Besides, we've met child threads created by plugin_register_dynamic_and_init_all() reporting heap-use-after-free for the same reason.

How to repeat:
How to repeat:
1. Apply the attachment repeat.patch. It's to be explained later why the patch works.
2. Build with -DWITH_ASAN=1.
3. Run mysql-test: ./mtr main.table_open_cache_functionality --mysqld="--thread_handling=no-threads".

Explanation: 
0. Check repeat.patch first.
1. Why do we need repeat.patch?
    As said, this bug happened only once and doesn't repeat, so we change source code to make repeat easier.
    We added several log printing for positioning the problem and found that an extra sql_print_error() in find_thread() made heap-use-after-free happen again. So we think thread concurrency might be the key and the attachment repeat.patch slows down child threads to make the problem observable.

    Is repeat.patch reasonable?
    Threads' execution speed should not cause a concurrency problem. Slowing down makes the problem easier to be observed.

    How to determine the sleeping time?
    It depends on the hardware and you might need to find an appropriate sleeping time.
    For reference, on our machine, slowing down by 1 second in each loop iteration could repeat heap-use-after-free steadily. (Slowing down for too long would make mysql server fail to start up within the mysql-test limited time and slowing down for too short could not expose the problem.)
    Our machine:
        CPU:Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz   x   80 Cores
        Mem:186G
        Kernel:4.14.105
        OS:CentOS 7.2
    
2. The repeat requires:
    a) performance_schema_max_thread_classes != 0. This variable is 100 in main.table_open_cache_functionality.
    b) thread_handling != one-thread-per-connection. Give it `no-threads` in mysql-test cmd line.

Suggested fix:
See attachment fix.patch.
m_sys_default_resource_group.name() always returns a string "SYS_default" and it's given in Resource_group_mgr::init(), so we introduce a function `const char* get_sys_default_resource_group_name()` returning the constant string "SYS_default". thread_create_callback() uses get_sys_default_resource_group_name() instead of m_sys_default_resource_group.name().
[18 May 4:18] xiaoyu wang
asan stack

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: asan.txt (text/plain), 6.23 KiB.

[18 May 4:18] xiaoyu wang
a minor change for repeating

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: repeat.patch (application/octet-stream, text), 451 bytes.

[18 May 4:19] xiaoyu wang
fix

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: fix.patch (application/octet-stream, text), 4.41 KiB.

[18 May 12:42] MySQL Verification Team
Hi Mr. wang,

Thank you for your bug report.

However, we can not verify this bug. You quote that you were testing it on the MySQL 8.0.3 release of the server, while the current release is 8.0.20.

Please, check whether your report is valid for 8.0.20 as well.
[19 May 13:41] xiaoyu wang
Hi Sinisa,
Thanks for replying.
I just repeated it on 8.0.20 and attachments for 8.0.20 are uploaded.

The key to repeat this problem is to find an appropriate sleeping time. Applying repeat-8.0.20.patch directly most likely won't reproduce it in your environment. `usleep(10000)` in repeat-8.0.20.patch only means it can reproduce the problem in my virtual machine. You might need to make some experiments, sleeping for 500ms, 1sec, 2sec and so on, to determine the value that could reproduce the problem.

Environment(A virtual machine):
    CPU: Intel(R) Xeon(R) Gold 61xx CPU @ 2.50GHz  x  8 Cores
    Mem: 16G
    Kernel: 3.10.107
    OS: CentOS 7.2
    MySQL Version : Ver 8.0.20-debug-asan for Linux on x86_64 (Source distribution)
[19 May 13:41] xiaoyu wang
asan stack on 8.0.20

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: asan-8.0.20.txt (text/plain), 13.07 KiB.

[19 May 13:42] xiaoyu wang
a minor change for repeating on 8.0.20

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: repeat-8.0.20.patch (application/octet-stream, text), 479 bytes.

[19 May 13:42] xiaoyu wang
fix on 8.0.20

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: fix-8.0.20.patch (application/octet-stream, text), 5.24 KiB.

[20 May 12:11] MySQL Verification Team
Hi Mr. wang,

Thank you for your feedback and your patch.

However, we need a test case that we can reproduce, without changing our code in any way.

We can run test case as many times as necessary, but it should be reproduced on the original code.

Thank you in advance.
[30 May 8:34] xiaoyu wang
Hi Sinisa,

Thanks for your replying.

I understand reproducing on the original source code would be the perfection.
But as said, this is about thread concurrency and it may never happen on a well-performed machine.

So, I'm gonna talk about this problem in four aspects:
a) The problematic source code. b) Why mtr on original code works well. c) Why repeat-8.0.20 is reasonable. d) Why fix-8.0.20.patch works

An description with figures would be better, so please see the attachment description.pdf. :)
[30 May 8:35] xiaoyu wang
description with figures

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: description.pdf (application/pdf, text), 1.02 MiB.

[1 Jun 12:19] MySQL Verification Team
Hi Mr. wang,

Your PDF is quite convincing.

This bug is now verified and assigned.
[8 Jun 13:15] Paul Dubois
Posted by developer:
 
Fixed in 8.0.22.

mysql-test-run.pl tests under no-threads connection handling failed
with ASAN builds due to improper resource group initialization. This
has been fixed. Thanks to Xiaoyu Wang, Tencent Technology for the
contribution.
[9 Jun 12:09] MySQL Verification Team
Thank you, Paul.
[9 Jun 15:42] xiaoyu wang
Thanks :)
[20 Oct 6:56] Frederic Descamps
Thank you for your contribution that has been added to 8.0.22: https://lefred.be/content/mysql-8-0-22-thank-you-for-the-contributions/
[20 Oct 12:36] MySQL Verification Team
Thank you, Frederic.