Bug #120021 Add heuristic IN-list sampling (eq_range_index_dive_sample_count/method) to replace statistics fallback
Submitted: 11 Mar 1:41 Modified: 11 Mar 2:09
Reporter: JIHYUN BAHN Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.4.x OS:Any
Assigned to: CPU Architecture:Any
Tags: eq_range_index_dive_limit, IN-list

[11 Mar 1:41] JIHYUN BAHN
Description:
1. FEATURE SUMMARY
------------------
When an IN-list exceeds eq_range_index_dive_limit (default 200), MySQL falls
back to static index statistics for row estimation (EXPLAIN shows type:range).
Benchmarks on identical hardware show this statistics path is consistently
SLOWER than the full-dive path (type:index), and the overhead grows as the
IN-list length increases.

I propose two new SESSION-scoped system variables:

  eq_range_index_dive_sample_count  (uint, default 10, range 0-200)
  eq_range_index_dive_sample_method (ENUM: RANDOM|UNIFORM|FIRST, default RANDOM)

When the IN-list exceeds eq_range_index_dive_limit and sample_count > 0,
MySQL performs actual index dives on only K sampled values, extrapolates the
result to the full list, and returns that estimate — replacing the statistics
fallback entirely. Benchmarks on a patched build confirm up to a 5× improvement in 
query execution time vs. the statistics path on the same hardware.

2. PROBLEM DESCRIPTION
----------------------
MySQL's range optimizer uses eq_range_index_dive_limit (default 200) to decide
whether to perform exact index dives for each IN-list value. When the IN-list
length N exceeds this limit, the optimizer falls back to index statistics
(handler::index_statistics), which sets use_index_statistics = true and
produces a type:range plan.

The problem is that the statistics fallback is not a performance optimization —
it is actually slower than the full-dive path. The reasons are:

  a) Statistics-based estimation triggers a different execution plan
     (type:range vs type:index) that incurs more random I/O per row.
  b) The statistics path reads metadata from the index but still executes
     a full index scan over the matched ranges.
  c) The penalty grows with N: the larger the IN-list, the worse the
     statistics path performs relative to full dives.

This behavior means that users who increase eq_range_index_dive_limit beyond
its default, or who encounter IN-lists that barely exceed the limit, receive
worse performance — not better.

Example query:
  SELECT COUNT(*) FROM t1 WHERE val IN (1, 2, 3, ..., N);

With eq_range_index_dive_limit = N-1  (forces statistics): type:range, SLOW.
With eq_range_index_dive_limit = N+1  (forces full dive):   type:index, FAST.

3. BENCHMARK RESULTS
--------------------

--- Scenario D: Statistics vs. Full Dive (Vanilla EC2 only — same hardware) ---

This is the strongest evidence. On the SAME machine, forcing statistics
(limit = N-1) is always slower than full dives (limit = N+1), and the gap
widens as N grows.

SQL (full dive):  SET SESSION eq_range_index_dive_limit = <N+1>;
                  SELECT COUNT(*) FROM t1 WHERE val IN (1, 2, ..., N);

SQL (statistics): SET SESSION eq_range_index_dive_limit = <N-1>;
                  SELECT COUNT(*) FROM t1 WHERE val IN (1, 2, ..., N);

+------+----------------+------------------+------------------+
|  N   | Full dive (ms) | Statistics (ms)  | Stats overhead   |
+------+----------------+------------------+------------------+
|  10  |      4,960.6   |      5,736.7     |    +15.7%        |
|  20  |      4,761.4   |      5,567.5     |    +16.9%        |
|  50  |      4,515.7   |      5,430.6     |    +20.3%        |
| 100  |      4,487.4   |      5,653.6     |    +26.0%        |
+------+----------------+------------------+------------------+

Interpretation: The statistics overhead is not constant — it increases with N.
At N=100 the statistics path is 26% slower than full dives on the same machine.

--- Scenario A: sample_count Effect (Patch EC2 only — same hardware) ---

With eq_range_index_dive_limit = 5 and N = 10 (so N > limit → heuristic
fires), we vary K (sample_count). count=0 is the baseline statistics path.

SQL:
  SET SESSION eq_range_index_dive_limit = 5;
  SET SESSION eq_range_index_dive_sample_count = <K>;
  SET SESSION eq_range_index_dive_sample_method = 'RANDOM';
  SELECT COUNT(*) FROM t1 WHERE val IN (1,2,3,4,5,6,7,8,9,10);

+-----+----------+--------------------+
|  K  | avg (ms) | vs count=0 (stats) |
+-----+----------+--------------------+
|   0 |  1,678.0 | baseline (stats)   |
|   1 |  1,489.1 |       -11.3%       |
|   3 |  1,499.4 |       -10.7%       |
|   5 |  1,493.7 |       -10.9%       |
|   7 |  1,520.2 |        -9.4%       |
|  10 |  1,487.1 |       -11.4%       |
|  20 |  1,503.3 |       -10.4%       |
|  50 |  1,494.6 |       -10.9%       |
| 100 |  1,505.1 |       -10.3%       |
| 200 |  1,490.6 |       -11.2%       |
+-----+----------+--------------------+

Key finding: Even K=1 (sampling just 1 out of 10 values) eliminates the
statistics penalty. K=5 gives ~10.9% improvement. The gain plateaus past K=5,
suggesting K=5 is a practical sweet spot.

OPTIMIZER_TRACE confirms: count=0 → index_dives_for_range_access: 0 (stats);
                          count≥1 → index_dives_for_range_access: K (heuristic).

--- Scenario B: sample_method Effect (Patch EC2 only — same hardware) ---

With N=10, limit=5, count=5, we compare the three sampling strategies.

SQL:
  SET SESSION eq_range_index_dive_sample_method = '<METHOD>';
  (all other settings same as Scenario A)

+----------+----------+
| method   | avg (ms) |
+----------+----------+
| RANDOM   |  1,506.3 |
| UNIFORM  |  1,495.2 |
| FIRST    |  1,508.6 |
+----------+----------+

All three methods outperform the statistics fallback (1,678 ms) by ~10-11%.
UNIFORM is marginally best (deterministic, evenly-spaced sample positions).

--- Scenario C: Vanilla vs. Patch Summary (different hardware — for reference) ---

NOTE: These EC2 instances have different hardware. This table is for context
only; do not use it to compare absolute values between Vanilla and Patch.

+-------------------------------+--------+----------+---------------------------+
| Case                          | EC2    | avg (ms) | Notes                     |
+-------------------------------+--------+----------+---------------------------+
| Vanilla full dive (limit=200) | Vanilla| 5,038.5  | N=10 < limit → full dive  |
| Vanilla stats    (limit=5)    | Vanilla| 5,981.1  | N=10 > limit → statistics |
| Patch K=1  (count=1, limit=5) | Patch  | 1,494.9  | heuristic, 1 sample       |
| Patch K=3  (count=3, limit=5) | Patch  | 1,502.3  | heuristic, 3 samples      |
| Patch K=5  (count=5, limit=5) | Patch  | 1,492.6  | heuristic, 5 samples      |
| Patch K=7  (count=7, limit=5) | Patch  | 1,525.7  | heuristic, 7 samples      |
| Patch K=10 (count=10,limit=5) | Patch  | 1,496.5  | full dive (K >= N)        |
+-------------------------------+--------+----------+---------------------------+

On Vanilla, statistics (5,981 ms) is 18.7% slower than full dive (5,039 ms),
consistent with Scenario D findings.

How to repeat:
TEST SETUP
-------------
All benchmarks use the following table and data distribution:

  CREATE TABLE t1 (
    id  INT NOT NULL AUTO_INCREMENT,
    val INT NOT NULL,
    col2 VARCHAR(100),
    PRIMARY KEY (id),
    KEY idx_val (val)
  ) ENGINE=InnoDB;

  Rows       : ~10,000,000
  val 1-10   : ~800,000 rows each  (total ~80% of table, high cardinality)
  val 11-500 : ~4,000 rows each    (total ~20% of table)

Two EC2 instances were used:
  - Vanilla EC2 : stock MySQL 8.4, unmodified
  - Patch EC2   : MySQL 8.4 rebuilt RelWithDebInfo with patch

NOTE: The Vanilla and Patch EC2s are different machines, 
both running on AWS EC2 t2.xlarge instances (4 vCPUs, 16 GiB memory). 
Cross-machine comparisons (Scenario C) are provided for reference only. 
Intra-machine comparisons (Scenario A, B, D) are the primary evidence.

Timing method: 3-run average.
  SET @s = NOW(6);
  SELECT COUNT(*) FROM t1 WHERE val IN (...);
  SELECT ROUND(TIMESTAMPDIFF(MICROSECOND, @s, NOW(6)) / 1000, 1);  -- ms
[11 Mar 1:42] JIHYUN BAHN
patch_code_diff

Attachment: patch_code_diff.txt (text/plain), 19.23 KiB.

[11 Mar 2:09] JIHYUN BAHN
See attached file: patch_code_diff.txt

Files changed (8):
  sql/range_optimizer/index_range_scan_plan.cc  (+209 lines)
  sql/range_optimizer/index_range_scan_plan.h   (+ 10 lines)
  sql/range_optimizer/index_skip_scan_plan.h    (+  7 lines)
  sql/range_optimizer/range_opt_param.h         (+ 11 lines)
  sql/range_optimizer/range_optimizer.cc        (+ 43 lines)
  sql/range_optimizer/range_optimizer.h         (+  9 lines)
  sql/sys_vars.cc                               (+ 49 lines)
  sql/system_variables.h                        (+ 13 lines)

Core logic summary:

  1. system_variables.h: Add eq_range_index_dive_sample_count (uint) and
     eq_range_index_dive_sample_method (ulong/enum) to System_variables.
     Add enum enum_index_dive_sample_method {RANDOM=0, UNIFORM=1, FIRST=2}.

  2. sys_vars.cc: Register Sys_var_uint for sample_count (SESSION, 0-200,
     default 10) and Sys_var_enum for sample_method (SESSION, default RANDOM).

  3. range_opt_param.h: Add used_heuristic_sampling (bool) and
     heuristic_sample_rows (ha_rows) to RANGE_OPT_PARAM for diagnostics.

  4. range_optimizer.cc: Add qualifies_for_heuristic_sampling() with guards
     for single-table, FULLTEXT, subquery, and partition thresholds.

  5. index_range_scan_plan.cc: Add heuristic_sample_records_in_range():
       - collect_sel_arg_leaves(): builds vector of SEL_ARG leaf nodes.
       - random_sample_indices(): Fisher-Yates partial shuffle.
       - uniform_sample_indices(): evenly-spaced index selection.
       - setup_key_range_from_sel_arg(): converts SEL_ARG to key_range.
       - Main function: selects K nodes, calls records_in_range() for each,
         extrapolates, returns ha_rows estimate.
     Intercept in check_quick_select(): after eq_ranges_exceeds_limit() sets
     use_index_statistics = true, try heuristic sampling first. On success,
     reset use_index_statistics = false and return the sampled estimate.

  6. index_range_scan_plan.h: Declare heuristic_sample_records_in_range().
  7. range_optimizer.h: Declare qualifies_for_heuristic_sampling().