Bug #115623 | Is it necessary to access leaves when generating Ranges in parallel scan(count*) | ||
---|---|---|---|
Submitted: | 17 Jul 2024 8:58 | Modified: | 17 Jul 2024 12:21 |
Reporter: | Ruyi Zhang (OCA) | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: DML | Severity: | S5 (Performance) |
Version: | 8.0.38 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | count (*) |
[17 Jul 2024 8:58]
Ruyi Zhang
[17 Jul 2024 9:51]
MySQL Verification Team
Hi Mr. Zhang, Thank you for your bug report. We do not see which storage engine you are discussing here, but we suppose that it is about the InnoDB SE. Since InnoDB SE is MVCC SE, then SELECT COUNT(*) has to literally count all rows that is visible in that session. That is why you can get the approximate count number in our Information Schema. Not a bug.
[17 Jul 2024 10:23]
Ruyi Zhang
Thank you very much. What I described is indeed InnoDB. More specifically, count(*) is implemented through the function row_mysql_parallel_select_count_star. As mentioned above, when BTREE is split into Ranges, the split thread may spend more time loading leaf pages to buffer-pool, if access to leaf pages is given to the worker thread, this part of the IO time may be saved.
[17 Jul 2024 10:55]
MySQL Verification Team
Hi Mr. Zhang, According to MVCC standards, each concurrent session sees it's own values of the number of rows, regardless if there is a condition or not. Also, InnoDB has a clustered primary key. If a primary key is not specified, then an invisible primary key is generated. Hence, primary key is stored with the entire record. Count of the rows is done by the primary key entries. Hence, there you do not have separate pages for the leaf pages or node pages. Due to that organisation of the InnoDB, it is much faster to count the rows sequentially then to jump through the BTREE. Counting on some condition, that can be met by some index is a totally different affair. We recommend that when you need a total count of the rows in some InnoDB table, you use Information Schema. Not a bug.
[17 Jul 2024 11:29]
Ruyi Zhang
Thanks, yes it is necessary for InnoDB to access all leaves when executing count(*). But what I want to confirm is: the Parallel_reader::Scan_ctx::partition function splits BTREE (clustered) into multiple Ranges, and each Range will scan its corresponding leaf pages in parallel through different threads (innodb_parallel_read_threads). Since Parallel_reader::Scan_ctx::partition can obtain Range information by accessing non-leaf nodes, is it still necessary to access leaf pages at the same time?
[17 Jul 2024 12:15]
MySQL Verification Team
Hi, With primary index, leaf pages do not exist. Only node pages exist. Leaf pages are the records themselves, kept in the data pages of the InnoDB. Also, it is a single operation, not performed in parallel, due to MVCC restrictions.
[17 Jul 2024 12:21]
Ruyi Zhang
Thanks, In fact, 8.0.38 already supports the feature of parallel execution of COUNT(*) https://dev.mysql.com/worklog/task/?id=11720