Bug #117164 Range estimation (index dives) overestimates the rows count on large ranges. It almost doubles the actual number of rows
Submitted: 9 Jan 6:17 Modified: 9 Jan 7:04
Reporter: Alexander Baschuk Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.36, 8.0.40 OS:Any
Assigned to: CPU Architecture:Any
Tags: Optimizer, range optimizer, rows estimate, Rows_examined

[9 Jan 6:17] Alexander Baschuk
Description:
When estimating range rows count (index dives) for small ranges, the value is accurate.
However, starting from a certain range size, the estimation becomes significantly inaccurate and roughly double the actual number of records within the range.

How to repeat:
I used MySQL v8.0.36.

- Download the test employees database from: https://dev.mysql.com/doc/employee/en/employees-installation.html

- Initialize the database using the script employees.sql.

- Run the following query on different data ranges. For each range, execute EXPLAIN and compare the estimated row count (column "Rows") with the actual count (COUNT(*)). For small ranges, the estimate will match the actual count exactly.
However, starting from a certain range size (approximately 500–1000 records), the estimated value becomes inaccurate, typically double the actual number of records.

EXPLAIN
SELECT COUNT(*)
FROM salaries
WHERE emp_no BETWEEN 11000 AND 11400 -- estimated: 3879, actual: 3879 - exactly counted (less than 10 index pages)
-- WHERE emp_no BETWEEN 11000 AND 12000 -- estimated: 17856, actual: 9613
-- WHERE emp_no BETWEEN 12000 AND 14000 -- estimated: 38640, actual: 18995
-- WHERE emp_no BETWEEN 13000 AND 16000 -- estimated: 52932, actual: 28280 
-- WHERE emp_no BETWEEN 14000 AND 18000 -- estimated: 75920, actual: 37974
-- WHERE emp_no BETWEEN 15000 AND 20000 -- estimated: 94720, actual: 47325
-- WHERE emp_no BETWEEN 16000 AND 22000 -- estimated: 121392, actual: 57007
-- WHERE emp_no BETWEEN 17000 AND 24000 -- estimated: 127378, actual: 66294
-- WHERE emp_no BETWEEN 18000 AND 26000 -- estimated: 150760, actual: 75819
-- WHERE emp_no BETWEEN 19000 AND 28000 -- estimated: 167876, actual: 85577
-- WHERE emp_no BETWEEN 20000 AND 30000 -- estimated: 197434, actual: 95212
;

Suggested fix:
The cause of the 2x overestimation is the following piece of code in the btr_estimate_n_rows_in_range_low function:
https://github.com/mysql/mysql-server/blob/trunk/storage/innobase/btr/btr0cur.cc, line 5235

if (i > divergence_level + 1 && !is_n_rows_exact) {
    /* In trees whose height is > 1 our algorithm
    tends to underestimate: multiply the estimate
    by 2: */

    n_rows = n_rows * 2;
}

Indeed, in all my measurements, the estimated "Rows" values for larger ranges were even numbers.
This multiplication by 2 does not improve accuracy but instead worsens it.

I suspect the reason for this multiplication was an earlier bug:
https://bugs.mysql.com/bug.php?id=71199 (2013, versions 5.5, 5.6).

It seems the developer avoided investigating the root cause of the issue and simply hardcoded this multiplication.

It’s plausible that in earlier MySQL versions, there was indeed underestimation as mentioned in bug 71199. However, in later versions, MySQL seems to calculate estimates more accurately, making the multiplication by 2 unnecessary and resulting in doubled estimates.

I analyzed the btr_estimate_n_rows_in_range_on_level(...) function and found other possible causes of inaccurate estimates.

Cause 1

Line 5329:
When calling btr_estimate_n_rows_in_range_on_level(), the n_rows value passed does not include the borders. However, within the function, at line 4997, when multiplying the number of pages by the average number of rows, the n_rows_on_prev_level value is used as if it includes the borders.

n_rows = n_rows_on_prev_level * n_rows / n_pages_read;

Cause 2

When calculating n_rows within btr_estimate_n_rows_in_range_on_level(), rows from the first (partial) page (line 4897) and the last page (line 4904) are added, even in the inexact case.

Line 4997:
In the final formula, when calculating the average number of rows per page, we use the value n_rows, which at this point contains:
  - Part of the rows from the first page,
  - All rows from the next 9 examined pages,
  - Part of the rows from the last page (which is not part of the examined range of the first 10 pages).

As a result, the expression n_rows / n_pages_read _in average_ indeed gives the number of rows per 10 pages. However, this number is often highly inaccurate because the first and last pages may contain significantly fewer or more records within the specified range.

Proposed solution

1.
Remove the hardcoded multiplication by 2.

2.
In the btr_estimate_n_rows_in_range_on_level() function, for inexact scenarios, the total number of rows can be calculated more accurately if the average number of rows per page is based only on "full" pages, excluding the first and last pages.

Additionally, the formula should account for the fact that n_rows_on_prev_level does not include the borders — it contains the number of "full" pages in the original range and excludes the first and last pages. These pages can simply be added separately to the result.

Here’s an improved calculation:

n_rows_on_examined_full_pages = n_rows - n_rows_on_first_page - n_rows_on_last_page;

examined_full_pages_count = N_PAGES_READ_LIMIT - 1;
// because the first examined page is counted "partially"
// but the last examined page (page 10) is counted entirely

avg_rows_on_full_page = n_rows_on_examined_full_pages / examined_full_pages_count;

n_rows_on_full_pages = n_rows_on_prev_level * avg_rows_on_full_page;
// because n_rows_on_prev_level doesn't include the edges (partially counted pages)

n_rows_total = n_rows_on_full_pages + n_rows_on_first_page + n_rows_on_last_page;

return (n_rows_total);

This algorithm should provide a more accurate estimate of the total number of rows within the specified range.
[9 Jan 6:18] Alexander Baschuk
visualization of the issue

Attachment: mysql-rows-estimation.png (image/png, text), 1.39 MiB.

[9 Jan 7:04] MySQL Verification Team
Hello Alexander Baschuk,

Thank you for the report and feedback.

regards,
Umesh