Bug #101789 CPU 100% MySQL 8 on Windows 2019
Submitted: 28 Nov 2020 21:07 Modified: 18 Jan 2021 17:27
Reporter: Alex K. Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:8.0.21, 8.0.22 OS:Windows (2019)
Assigned to: CPU Architecture:x86 (32 cores)

[28 Nov 2020 21:07] Alex K.
Description:
Hi,

We run MySQL community 8.0.22 on Windows 2019. It's only InnoDB tables. Normally, our CPU usage is about 10-15%. Maybe 20% on peak time, as this server is used by a busy website.

About 2 or 3 times per day, we get the following issue:

1. CPU is completely bottlenecked at 100%, will not resolve unless we restart mysql
2. We see a number of very simple queries taking forever to load. For example:

SELECT count(id)
FROM g3hy2_usergroups

g3hy2_usergroups is a very small table with 37 rows. Normally this query takes a few milliseconds to return results.

3. Trying to kill the queries, does not work, CPU remains at 100%, and/or we get this message:

Error Killing Connection

Error executing KILL QUERY on thread 2211214: Error executing 'KILL QUERY 2211214'
Unknown thread id: 2211214.
SQL Error: 1094

While we initially thought that this is due to bad queries in our application, some of which we actually fixed, we're beginning to think that this is actually a MySQL bug.

Images:

https://ibb.co/dBNJ5tq
https://ibb.co/FWtn3sd

Thank you
Alex

How to repeat:
We can give access to our database server so that you can see what's going on when this happens. We can only replicate it in our live environment.
[29 Nov 2020 16:28] MySQL Verification Team
Can you check if my.ini parameter will solve this?

[mysqld]
innodb_parallel_read_threads=1
[29 Nov 2020 16:33] Alex K.
Hi, the default value is 4.

Should i try to lower it to 1 ?
[30 Nov 2020 15:02] MySQL Verification Team
Hi Mr. K.

Thank you for your bug report.

Please, do set that system variable to 1 and try again. Let us know if you can repeat this behaviour.

If you manage to do it, then, please send us the full output from the innodb status command. This command is described in our Reference Manual.

We are waiting on your feedback.
[30 Nov 2020 18:32] Alex K.
Hi there,

I've added innodb_parallel_read_threads=1 to my.ini
Since this change, there hasn't been a bottleneck. It's almost 30 hours now (usually we had a bottleneck every 8 hours).

So is this related to a bug that remains to be fixed? or something related to the parallel threads?

Does innodb_parallel_read_threads=1 means our SELECTS will be slower?

Thanks
Alex
[1 Dec 2020 17:36] Sunny Bains
Hi,

Can you please post an equivalent table definition (ideally with anonymized data) on which you are see this problem.

Regards.
[2 Dec 2020 13:15] MySQL Verification Team
Hi Mr. K,

Can you also send us contents of  your my.ini file ????
[2 Dec 2020 14:13] Alex K.
Hi, I have uploaded my.ini file.

Please help me with "equivalent table definition" as I don't know exactly what you mean.

Thanks.
[2 Dec 2020 14:14] Alex K.
By the way, still no bottleneck! it's been several days now.
[2 Dec 2020 14:28] Alex K.
Unbelievable, I had to say it.

We're having another bottleneck right now. CPU is completely bottlenecked at 100%.
Simple queries like this one:

SELECT count(id)
FROM g3hy2_usergroups

Taking forever to load, a lot of them in the SHOW PROCESSLIST.

At the same time when the bottleneck happens, when I manually run this query, it returns results just fine.
[2 Dec 2020 14:31] MySQL Verification Team
Hi,

Please, do note that we also asked for  equivalent table definition with as many rows as you can provide.

Waiting on your feedback.
[2 Dec 2020 19:29] Alex K.
Hi,

I have uploaded the table that we mostly see on the process list during bottleneck, including the small data that it contains. We have replaced some sensitive data with ******.

Let me know if that helps.

Thanks.
[2 Dec 2020 20:59] MySQL Verification Team
On windows 8.0.22, I did a test with the sample table having 37 rows.
Ran the query: select count(id) from table;

in 1 thread:  ~2000 qps
in 2 threads: ~3000 qps
in 3 threads: ~3500 qps
in 4 threads: 78 qps

at 4+ threads there's a major performance hit.
This was running with --innodb-parallel-read-threads=0 --innodb-thread-concurrency=0

I think something needs investigating here.
[2 Dec 2020 21:41] MySQL Verification Team
thread stacks taken at a point in time with slow QPS

Attachment: bug101789_8.0.22_thread_stacks.txt (text/plain), 124.01 KiB.

[2 Dec 2020 21:43] MySQL Verification Team
I've seen enough to verify this on 8.0.22 on Windows.

mysql> show full processlist;
+-----+-----------------+----------------+------+---------+------+------------------------+-------------------------------------------+
| Id  | User            | Host           | db   | Command | Time | State                  | Info                                      |
+-----+-----------------+----------------+------+---------+------+------------------------+-------------------------------------------+
|   5 | event_scheduler | localhost      | NULL | Daemon  |  846 | Waiting on empty queue | NULL                                      |
|  17 | root            | localhost:5735 | test | Query   |    0 | init                   | show full processlist                     |
| 178 | root            | localhost:9392 | test | Query   |    1 | executing              | select  count(id)  from  g3hy2_usergroups |
| 179 | root            | localhost:9394 | test | Query   |    3 | executing              | select  count(id)  from  g3hy2_usergroups |
| 180 | root            | localhost:9395 | test | Query   |    1 | executing              | select  count(id)  from  g3hy2_usergroups |
| 181 | root            | localhost:9396 | test | Query   |    2 | executing              | select  count(id)  from  g3hy2_usergroups |
| 182 | root            | localhost:9397 | test | Query   |    0 | executing              | select  count(id)  from  g3hy2_usergroups |
| 183 | root            | localhost:9398 | test | Query   |    2 | executing              | select  count(id)  from  g3hy2_usergroups |
| 184 | root            | localhost:9399 | test | Query   |    3 | executing              | select  count(id)  from  g3hy2_usergroups |
| 185 | root            | localhost:9400 | test | Query   |    3 | executing              | select  count(id)  from  g3hy2_usergroups |
| 186 | root            | localhost:9402 | test | Query   |    2 | executing              | select  count(id)  from  g3hy2_usergroups |
| 187 | root            | localhost:9403 | test | Query   |    1 | executing              | select  count(id)  from  g3hy2_usergroups |
| 188 | root            | localhost:9404 | test | Query   |    1 | executing              | select  count(id)  from  g3hy2_usergroups |
| 189 | root            | localhost:9405 | test | Query   |    2 | executing              | select  count(id)  from  g3hy2_usergroups |
+-----+-----------------+----------------+------+---------+------+------------------------+-------------------------------------------+
14 rows in set (0.00 sec)

It becomes VERY slow at 10 threads;

# Runtime      QPS       Qrun    RowRcvd   RowAfctd
#-------------------------------------------------------
#00:11:13        8       4802       1207          0
#00:11:16       11       4836       1241          0
#00:11:19        8       4862       1267          0
#00:11:23        8       4887       1292          0
#00:11:26        8       4911       1316          0
#00:11:29        8       4935       1340          0
#00:11:32        7       4958       1363          0
#00:11:35        8       4983       1388          0
#00:11:38        8       5009       1414          0
#00:11:41        8       5033       1438          0
#00:11:45        8       5059       1464          0
#00:11:48        8       5084       1489          0
#00:11:51        8       5110       1515          0
#00:11:54        9       5137       1542          0
#00:11:57        9       5165       1570          0
#00:12:01        9       5192       1597          0
#00:12:04        9       5219       1624          0
#00:12:07        8       5245       1650          0
#00:12:10        8       5270       1675          0
#00:12:13        8       5296       1701          0
#-------------------------------------------------------
[2 Dec 2020 21:47] Alex K.
Hi, I'm glad this is verified.

On our server, those selects take forever to return results (they never stop).

Alex
[2 Dec 2020 21:47] MySQL Verification Team
same behavior on 8.0.21.
[2 Dec 2020 21:59] MySQL Verification Team
5.7.21 with 4 threads also ate CPU, but that's because it had a huge throughput, so that's fine.

#------------------------------------------------
# Runtime      QPS       Qrun    RowRcvd   RowAfc
#------------------------------------------------
#00:00:00      831       2493       2493

#00:00:05    32159      98971      98971
#00:00:10    35906     206689     206689
#00:00:16    40470     328101     328101
#00:00:23    50768     480407     480407
#00:00:33    71166     693907     693907
#00:00:36    20790     756277     756277
#00:00:39    20781     818621     818621
#00:00:42    20579     880360     880360
#00:00:45    20628     942246     942246
[2 Dec 2020 22:30] MySQL Verification Team
Similar problem on Linux 8.0.22.  The moment the number of threads goes above the number of CPU cores,  the overall performance gets ~10x slower than a single thread.  It falls off a cliff fast.

This seems unexpected to me. 5.7.31 didn't hit such a hard ceiling.

At least devs can profile this with "perf" on linux easier now.  
I tested with --innodb-parallel-read-threads=1.
[2 Dec 2020 22:39] Sunny Bains
Looks like the issue is in this code:

void IB_thread::start() {
  ut_a(state() == State::NOT_STARTED);
  m_state->store(State::ALLOWED_TO_START);
  while (state() == State::ALLOWED_TO_START) {
    UT_RELAX_CPU(); 
  }
  const auto state_after_start = state();
  
  ut_a(state_after_start == State::STARTED ||
       state_after_start == State::STOPPED);
}

I suspect that a memory barrier will be required in the check for state. Why this problem only manifests on Windows (assuming it's a memory barrier issue), I can't say.
[3 Dec 2020 12:51] MySQL Verification Team
Thank you  very much, Shane and Sunny.

I took a look myself and it seems that problem started when we stopped using configuration #define's HAVE_PAUSE_INSTRUCTION and HAVE_FAKE_PAUSE_INSTRUCTION.
[3 Dec 2020 12:53] MySQL Verification Team
Increasing severity.
[11 Dec 2020 21:51] Alex K.
Hi there,

Is there any update on this?

Thanks
Alex
[11 Dec 2020 23:23] Sunny Bains
We have figured out the underlying issue.The fix will be pushed soon. Thank you for helping us debug this issue.

Regards,
-sunny
[14 Dec 2020 13:49] MySQL Verification Team
Thank you, Sunny and Alex.
[14 Dec 2020 13:55] Alex K.
Hi, is this going to be included in the next MySQL community upgrade? for example 8.0.23?

Is this the way that I can get the fix?

Thanks
Alex
[6 Jan 2021 12:15] Alex K.
Hi and happy new year.

is there any update on this?

Thanks
Alex
[9 Jan 2021 5:56] Darshan M N
Hi, the fix will be pushed in a few days as internal testing is still going on. It would be available in the 8.0.24 release.
[11 Jan 2021 15:02] MySQL Verification Team
Thank you, Darshan ...........
[18 Jan 2021 17:18] Daniel Price
Posted by developer:
 
commit 3b36c7dad05c288c03208c8b10863f1c53e0e153
Date:   Mon Jan 11 19:09:01 2021 +0530

    Bug#32224707 CPU 100% MYSQL 8 ON WINDOWS 2019
    
    Issue
    =====
    1. On windows, running SELECT COUNT(*) queries parallely from various client
    threads, with the number of client threads exceeding the number of cores of the
    machine, can cause stalls for a very long time. The stall is seen in
    IB_thread::start(), while spawning threads for parallel read, where we spin-wait
    in a loop waiting for the state to be set.
    
    It is seen that we cannot just rely on UT_RELAX_CPU() when the number of threads
    to be spawned exceeds the number of cores of a machine. This is because
    UT_RELAX_CPU uses the YieldProcesser() system call,  which only makes the
    processor available to other logical processors. But if all the logical
    processors are busy in spin wait then no real work is done causing the
    hang-like, situation.
    
    2. The parallel reader infrastructure, allows the user to set the parameter
    innodb_parallel_reader_threads (default being 4), to configure the number of
    threads to be spawned to carry out the parallel read operation. This parameter
    is supposed to be an upper limit, with the infrastructure using only as many
    threads required as possible. But we currently spawn as many threads as the
    parameter value even though it's not necessary.
    
    Fix
    ===
    1. If we are still stuck in spin-wait, after calling UT_RELAX_CPU multiple
    times, allow the thread to release its time slice.
[18 Jan 2021 17:27] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 8.0.24 release, and here's the proposed changelog entry from the documentation team:

On Windows, stalls were caused by concurrent SELECT COUNT(*) queries from
multiple client threads where the number of client threads exceeded the
number of machine cores.
[19 Jan 2021 14:05] MySQL Verification Team
Thank you, Daniel .....
[19 Mar 2021 7:01] MySQL Verification Team
On linux this manifested too, here's the perf output of 8.0.23.

-  100.00%   100.00%  mysqld
   - 93.64% start_thread
      - 93.55% pfs_spawn_thread
         - 93.55% handle_connection
            - 93.54% do_command
               - 93.40% dispatch_command
                  - 93.20% dispatch_sql_command
                     - 92.81% mysql_execute_command
                        - 92.56% Sql_cmd_dml::execute
                           - 91.82% SELECT_LEX_UNIT::execute
                              - 91.80% SELECT_LEX_UNIT::ExecuteIteratorQuery
                                 - 91.64% UnqualifiedCountIterator::Read
                                    - 91.63% get_exact_record_count
                                       - 91.62% ha_innobase::records
                                          - 91.58% row_scan_index_for_mysql
                                             - 91.50% row_mysql_parallel_select_count_star
                                                - 91.05% Parallel_reader::run
                                                   - 91.04% Parallel_reader::parallel_read
                                                        77.14% IB_thread::start
                                                        13.61% IB_thread::wait

Confirmed it's fixed in current trunk for linux too.
[13 Apr 2021 14:11] Daniel Price
Posted by developer:
 
Revised entry:

Fixed as of the upcoming 8.0.24 release:

On Windows, stalls were caused by concurrent SELECT COUNT(*) queries
where the number of parallel read threads exceeded the number of machine
cores. 

Patch applied to other platforms in 8.0.25:
Bug #32678019 	BUSY WAIT IN IB_THREAD::START CAN CAUSE PERFORMANCE ISSUES
[13 Apr 2021 14:11] Daniel Price
Posted by developer:
 
Revised entry:

Fixed as of the upcoming 8.0.24 release:

On Windows, stalls were caused by concurrent SELECT COUNT(*) queries
where the number of parallel read threads exceeded the number of machine
cores. 

Patch applied to other platforms in 8.0.25:
Bug #32678019 	BUSY WAIT IN IB_THREAD::START CAN CAUSE PERFORMANCE ISSUES
[14 Apr 2021 12:58] MySQL Verification Team
Thank you, Daniel.