MySQL Bugs: #120021: Add heuristic IN-list sampling (eq_range_index_dive_sample_count/method) to replace statistics fallback

Bug #120021	Add heuristic IN-list sampling (eq_range_index_dive_sample_count/method) to replace statistics fallback
Submitted:	11 Mar 1:41	Modified:	26 Mar 1:37
Reporter:	JIHYUN BAHN (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	8.4.x	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	eq_range_index_dive_limit, IN-list

Description:
1. FEATURE SUMMARY
------------------
When an IN-list exceeds eq_range_index_dive_limit (default 200), MySQL falls
back to static index statistics for row estimation (EXPLAIN shows type:range).
Benchmarks on identical hardware show this statistics path is consistently
SLOWER than the full-dive path (type:index), and the overhead grows as the
IN-list length increases.

I propose two new SESSION-scoped system variables:

  eq_range_index_dive_sample_count  (uint, default 10, range 0-200)
  eq_range_index_dive_sample_method (ENUM: RANDOM|UNIFORM|FIRST, default RANDOM)

When the IN-list exceeds eq_range_index_dive_limit and sample_count > 0,
MySQL performs actual index dives on only K sampled values, extrapolates the
result to the full list, and returns that estimate — replacing the statistics
fallback entirely. Benchmarks on a patched build confirm up to a 5× improvement in 
query execution time vs. the statistics path on the same hardware.

2. PROBLEM DESCRIPTION
----------------------
MySQL's range optimizer uses eq_range_index_dive_limit (default 200) to decide
whether to perform exact index dives for each IN-list value. When the IN-list
length N exceeds this limit, the optimizer falls back to index statistics
(handler::index_statistics), which sets use_index_statistics = true and
produces a type:range plan.

The problem is that the statistics fallback is not a performance optimization —
it is actually slower than the full-dive path. The reasons are:

  a) Statistics-based estimation triggers a different execution plan
     (type:range vs type:index) that incurs more random I/O per row.
  b) The statistics path reads metadata from the index but still executes
     a full index scan over the matched ranges.
  c) The penalty grows with N: the larger the IN-list, the worse the
     statistics path performs relative to full dives.

This behavior means that users who increase eq_range_index_dive_limit beyond
its default, or who encounter IN-lists that barely exceed the limit, receive
worse performance — not better.

Example query:
  SELECT COUNT(*) FROM t1 WHERE val IN (1, 2, 3, ..., N);

With eq_range_index_dive_limit = N-1  (forces statistics): type:range, SLOW.
With eq_range_index_dive_limit = N+1  (forces full dive):   type:index, FAST.

3. BENCHMARK RESULTS
--------------------

--- Scenario D: Statistics vs. Full Dive (Vanilla EC2 only — same hardware) ---

This is the strongest evidence. On the SAME machine, forcing statistics
(limit = N-1) is always slower than full dives (limit = N+1), and the gap
widens as N grows.

SQL (full dive):  SET SESSION eq_range_index_dive_limit = <N+1>;
                  SELECT COUNT(*) FROM t1 WHERE val IN (1, 2, ..., N);

SQL (statistics): SET SESSION eq_range_index_dive_limit = <N-1>;
                  SELECT COUNT(*) FROM t1 WHERE val IN (1, 2, ..., N);

+------+----------------+------------------+------------------+
|  N   | Full dive (ms) | Statistics (ms)  | Stats overhead   |
+------+----------------+------------------+------------------+
|  10  |      4,960.6   |      5,736.7     |    +15.7%        |
|  20  |      4,761.4   |      5,567.5     |    +16.9%        |
|  50  |      4,515.7   |      5,430.6     |    +20.3%        |
| 100  |      4,487.4   |      5,653.6     |    +26.0%        |
+------+----------------+------------------+------------------+

Interpretation: The statistics overhead is not constant — it increases with N.
At N=100 the statistics path is 26% slower than full dives on the same machine.

--- Scenario A: sample_count Effect (Patch EC2 only — same hardware) ---

With eq_range_index_dive_limit = 5 and N = 10 (so N > limit → heuristic
fires), we vary K (sample_count). count=0 is the baseline statistics path.

SQL:
  SET SESSION eq_range_index_dive_limit = 5;
  SET SESSION eq_range_index_dive_sample_count = <K>;
  SET SESSION eq_range_index_dive_sample_method = 'RANDOM';
  SELECT COUNT(*) FROM t1 WHERE val IN (1,2,3,4,5,6,7,8,9,10);

+-----+----------+--------------------+
|  K  | avg (ms) | vs count=0 (stats) |
+-----+----------+--------------------+
|   0 |  1,678.0 | baseline (stats)   |
|   1 |  1,489.1 |       -11.3%       |
|   3 |  1,499.4 |       -10.7%       |
|   5 |  1,493.7 |       -10.9%       |
|   7 |  1,520.2 |        -9.4%       |
|  10 |  1,487.1 |       -11.4%       |
|  20 |  1,503.3 |       -10.4%       |
|  50 |  1,494.6 |       -10.9%       |
| 100 |  1,505.1 |       -10.3%       |
| 200 |  1,490.6 |       -11.2%       |
+-----+----------+--------------------+

Key finding: Even K=1 (sampling just 1 out of 10 values) eliminates the
statistics penalty. K=5 gives ~10.9% improvement. The gain plateaus past K=5,
suggesting K=5 is a practical sweet spot.

OPTIMIZER_TRACE confirms: count=0 → index_dives_for_range_access: 0 (stats);
                          count≥1 → index_dives_for_range_access: K (heuristic).

--- Scenario B: sample_method Effect (Patch EC2 only — same hardware) ---

With N=10, limit=5, count=5, we compare the three sampling strategies.

SQL:
  SET SESSION eq_range_index_dive_sample_method = '<METHOD>';
  (all other settings same as Scenario A)

+----------+----------+
| method   | avg (ms) |
+----------+----------+
| RANDOM   |  1,506.3 |
| UNIFORM  |  1,495.2 |
| FIRST    |  1,508.6 |
+----------+----------+

All three methods outperform the statistics fallback (1,678 ms) by ~10-11%.
UNIFORM is marginally best (deterministic, evenly-spaced sample positions).

--- Scenario C: Vanilla vs. Patch Summary (different hardware — for reference) ---

NOTE: These EC2 instances have different hardware. This table is for context
only; do not use it to compare absolute values between Vanilla and Patch.

+-------------------------------+--------+----------+---------------------------+
| Case                          | EC2    | avg (ms) | Notes                     |
+-------------------------------+--------+----------+---------------------------+
| Vanilla full dive (limit=200) | Vanilla| 5,038.5  | N=10 < limit → full dive  |
| Vanilla stats    (limit=5)    | Vanilla| 5,981.1  | N=10 > limit → statistics |
| Patch K=1  (count=1, limit=5) | Patch  | 1,494.9  | heuristic, 1 sample       |
| Patch K=3  (count=3, limit=5) | Patch  | 1,502.3  | heuristic, 3 samples      |
| Patch K=5  (count=5, limit=5) | Patch  | 1,492.6  | heuristic, 5 samples      |
| Patch K=7  (count=7, limit=5) | Patch  | 1,525.7  | heuristic, 7 samples      |
| Patch K=10 (count=10,limit=5) | Patch  | 1,496.5  | full dive (K >= N)        |
+-------------------------------+--------+----------+---------------------------+

On Vanilla, statistics (5,981 ms) is 18.7% slower than full dive (5,039 ms),
consistent with Scenario D findings.

How to repeat:
TEST SETUP
-------------
All benchmarks use the following table and data distribution:

  CREATE TABLE t1 (
    id  INT NOT NULL AUTO_INCREMENT,
    val INT NOT NULL,
    col2 VARCHAR(100),
    PRIMARY KEY (id),
    KEY idx_val (val)
  ) ENGINE=InnoDB;

  Rows       : ~10,000,000
  val 1-10   : ~800,000 rows each  (total ~80% of table, high cardinality)
  val 11-500 : ~4,000 rows each    (total ~20% of table)

Two EC2 instances were used:
  - Vanilla EC2 : stock MySQL 8.4, unmodified
  - Patch EC2   : MySQL 8.4 rebuilt RelWithDebInfo with patch

NOTE: The Vanilla and Patch EC2s are different machines, 
both running on AWS EC2 t2.xlarge instances (4 vCPUs, 16 GiB memory). 
Cross-machine comparisons (Scenario C) are provided for reference only. 
Intra-machine comparisons (Scenario A, B, D) are the primary evidence.

Timing method: 3-run average.
  SET @s = NOW(6);
  SELECT COUNT(*) FROM t1 WHERE val IN (...);
  SELECT ROUND(TIMESTAMPDIFF(MICROSECOND, @s, NOW(6)) / 1000, 1);  -- ms

patch_code_diff

Attachment: patch_code_diff.txt (text/plain), 19.23 KiB.

See attached file: patch_code_diff.txt

Files changed (8):
  sql/range_optimizer/index_range_scan_plan.cc  (+209 lines)
  sql/range_optimizer/index_range_scan_plan.h   (+ 10 lines)
  sql/range_optimizer/index_skip_scan_plan.h    (+  7 lines)
  sql/range_optimizer/range_opt_param.h         (+ 11 lines)
  sql/range_optimizer/range_optimizer.cc        (+ 43 lines)
  sql/range_optimizer/range_optimizer.h         (+  9 lines)
  sql/sys_vars.cc                               (+ 49 lines)
  sql/system_variables.h                        (+ 13 lines)

Core logic summary:

  1. system_variables.h: Add eq_range_index_dive_sample_count (uint) and
     eq_range_index_dive_sample_method (ulong/enum) to System_variables.
     Add enum enum_index_dive_sample_method {RANDOM=0, UNIFORM=1, FIRST=2}.

  2. sys_vars.cc: Register Sys_var_uint for sample_count (SESSION, 0-200,
     default 10) and Sys_var_enum for sample_method (SESSION, default RANDOM).

  3. range_opt_param.h: Add used_heuristic_sampling (bool) and
     heuristic_sample_rows (ha_rows) to RANGE_OPT_PARAM for diagnostics.

  4. range_optimizer.cc: Add qualifies_for_heuristic_sampling() with guards
     for single-table, FULLTEXT, subquery, and partition thresholds.

  5. index_range_scan_plan.cc: Add heuristic_sample_records_in_range():
       - collect_sel_arg_leaves(): builds vector of SEL_ARG leaf nodes.
       - random_sample_indices(): Fisher-Yates partial shuffle.
       - uniform_sample_indices(): evenly-spaced index selection.
       - setup_key_range_from_sel_arg(): converts SEL_ARG to key_range.
       - Main function: selects K nodes, calls records_in_range() for each,
         extrapolates, returns ha_rows estimate.
     Intercept in check_quick_select(): after eq_ranges_exceeds_limit() sets
     use_index_statistics = true, try heuristic sampling first. On success,
     reset use_index_statistics = false and return the sampled estimate.

  6. index_range_scan_plan.h: Declare heuristic_sample_records_in_range().
  7. range_optimizer.h: Declare qualifies_for_heuristic_sampling().

Hi,

Thank you for your contribution. This looks like a very useful contribution.  However, before we can look at your patch, you need to upload it via the "Contribution" tab.

From your description it is not clear to me whether the reported timings refer to the statistics look-up time or the execution time of the entire query.
Fall-back to statistics in itself should not be slower than index dives, but in cases where it leads to a less efficient query plan it may of course reduce 
performance.  Whether this happens or not, will depend on the query and the statistical distributions of the column values. From my understanding, your description fail to distinguish between efficiency and accuracy of the statistical method.

That said, for a large class of queries your approach will definitely be beneficial, and once completed the formalities mentioned above, we will definitely consider it for inclusion in a future release.

patch_code_diff file (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: patch_code_diff.txt (text/plain), 19.44 KiB.

Hello.
I have uploaded the code to the contributions tab. Regarding the code, I would appreciate it 
if you could refer to the contributions tab, which was newly updated today, rather than the comments.

To clarify: the two proposed parameters are not about statistics look-up
overhead.  They introduce a heuristic path that calls records_in_range()
on K sampled IN-list values and extrapolates — the same API as a full
index dive, just O(K) instead of O(N).  The gain is in query execution
time because a more accurate cardinality estimate produces a better plan.

--- Measured query execution time (same hardware, same dataset) ---

Table: t1 (~10M rows), InnoDB, INDEX(val).
Skewed distribution: val 1–10 covers 80% of rows, val 11–500 covers 20%.

Scenario D — vanilla MySQL, full-dive reference:
  Query:
    SET SESSION eq_range_index_dive_limit = N+1;   -- force full dive
    SELECT COUNT(*) FROM t1 WHERE val IN (<N values>);

  N (IN-list)  | Full dive (ms)
  -------------|---------------
       10      |    4,960.6
       20      |    4,761.4
       50      |    4,515.7
      100      |    4,487.4

  This is the best achievable execution time for each IN-list size —
  the target that heuristic sampling aims to approach.

Scenario A — patched MySQL, heuristic sampling vs. no sampling (K=0):
  Query:
    SET SESSION eq_range_index_dive_limit = 5;     -- trigger fallback path
    SET SESSION eq_range_index_dive_sample_count = K;
    SELECT COUNT(*) FROM t1 WHERE val IN (1,2,3,4,5,6,7,8,9,10);

  K (sample_count) | avg_ms | improvement over K=0
  -----------------|--------|----------------------
        0          | 1,678  | baseline
        1          | 1,489  | -11.3%
        3          | 1,499  | -10.7%
        5          | 1,494  | -10.9%  ← plateau
       10          | 1,500  | -10.6%

All three methods (RANDOM / UNIFORM / FIRST) at K=5 yield equivalent
improvement (~10–11%).  K=5 is the practical sweet spot; the proposed
default of 10 provides a safe margin at negligible extra cost.

--- Parameters proposed ---

  eq_range_index_dive_sample_count  (uint, default 10, range 0–200)
    Number of IN-list values to sample.  0 = feature disabled; behaviour
    is identical to current code (fully backward-compatible).

  eq_range_index_dive_sample_method (ENUM: RANDOM | UNIFORM | FIRST,
                                     default RANDOM, session-scope)
    Sampling strategy.  RANDOM covers skewed distributions best; UNIFORM
    is deterministic; FIRST is cheapest when the list is already sorted.

Both are session-scoped and HINT_UPDATEABLE, so they can be tuned or
disabled per query without a server restart.

eq_range_idx_dive_sample.test : Functional test for the heuristic IN-list sampling feature. Verifies that EXPLAIN row estimates

Attachment: mysql-test.zip (application/x-zip-compressed, text), 5.78 KiB.

Posted by developer:

Hi Jihyun Bahn,

Thank you for signing the OCA and uploading your contributions.

As I said, your idea looks promising, and we will definitely consider it for inclusion in a future release.

There are few things you could do to speed up this process. Your patch currently does not include any tests for the feature. If not provided, we will need to develop this ourselves. Also, we use doxygen syntax to document functions and variables.

I am not convince that we need to support different sample methods. Maybe random is sufficient?

We have also got another contribution in the same area, see Bug#117826 for details. It will use histogram statistics instead of index dives. We will need to see how be can integrate both contributions. Ideally, there would be some automatic decision as to whether use full index dives, sampling, histogram statistics, or index statistics. Relying on system variables is not very user-friendly. However, that could be added later.

I still do not think your performance results prove much. AFAICT from looking at the query, there are basically two possible access methods; index range scan and full table scan. The goal of the optimizer is to pick which access method is most optimal for the query. If the statistics is inaccurate, one may pick the wrong access method. However, the optimizer's cost model may also be inaccurate, so that the wrong access method is chosen even when statistics is accurate. In fact, with an inaccurate cost model, an inaccurate row estimate may give a more optimal plan than an accurate row estimate. Without knowing more about the details of the data distribution, it is hard to know the benefit for a particular query. If key values are uniformly distributed, index statistics and sampling will both be as accurate as full index dives. If key values are skewed, sampling will probably be better than index statistics which are based on the average of all values. So, in practice, it depends ...

Thanks,
Øystein Grøvlen
MySQL Optimizer Team Lead

Hi Øystein Grøvlen,

Thank you for the detailed review.

Regarding the tests: I had uploaded a test file (eq_range_idx_dive_sample.test)
separately on March 14, but I understand it should be bundled with the patch.
I will include tests and doxygen-style documentation in the next patch revision.

On the sample method: I agree that simplifying to RANDOM only is a reasonable
starting point. In my benchmarks, UNIFORM showed slightly lower error (~8.9%
vs ~10.0% for RANDOM in skewed data), but the difference is modest. I am happy
to reduce the scope to RANDOM only if that makes the contribution easier to
integrate.

Regarding the integration with Bug#117826: I agree that an automatic decision
framework (full dive → sampling → histogram → index statistics) is the right
long-term direction. I see this contribution as providing one piece of that
framework — the sampling path — which can be integrated alongside the histogram
approach from #117826 when the time comes.

On the performance results: You raise a fair point about cost model inaccuracy.
I agree the benefit is data-distribution-dependent. In my test case (heavily
skewed data, 10 IN-list values covering ~80% of the index), the row estimate
error dropped from 97.5% (index statistics) to ~10% (sampling with K=5),
which was sufficient to change the optimizer's plan choice to a more efficient
one. I acknowledge this does not generalize to all cases, and uniform
distributions would see little benefit.

Thanks.