MySQL Bugs: #112280: Incorrect costing of group by loose index scans

Bug #112280	Incorrect costing of group by loose index scans
Submitted:	6 Sep 2023 20:38	Modified:	7 Sep 2023 4:23
Reporter:	Manuel Ung	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	8.0.34	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In cost_group_min_max, there seems to be a lot of complicated calculations done:

  if (used_key_parts > group_key_parts) {
    // Average number of keys in sub-groups formed by a key infixes
    rec_per_key_t keys_in_subgroup;
    if (index_info->has_records_per_key(used_key_parts - 1))
      // Use index statistics
      keys_in_subgroup = index_info->records_per_key(used_key_parts - 1);
    else {
      // If no index statistics then we use a guessed records per key value.
      keys_in_subgroup = guess_rec_per_key(table, index_info, used_key_parts);
      keys_in_subgroup = std::min(keys_in_subgroup, keys_per_group);
    }

    /*
      Compute the probability that two ends of subgroups are inside
      different blocks. Keys in subgroup need to be increases by the number
      of infix ranges possible.
    */
    keys_in_subgroup = keys_in_subgroup * infix_factor;
    if (keys_in_subgroup >= keys_per_block) /* If a subgroup is bigger than */
      p_overlap = 1.0; /* a block, it will overlap at least two blocks. */
    else {
      double blocks_per_group = (double)num_blocks / (double)num_groups;
      p_overlap = (blocks_per_group * (keys_in_subgroup - 1)) / keys_per_group;
      p_overlap = min(p_overlap, 1.0);
    }
    io_blocks = min<double>(num_groups * (1 + p_overlap), num_blocks);
  } else {
    io_blocks = (keys_per_group > keys_per_block)
                    ? (have_min && have_max) ? (double)(num_groups + 1)
                                             : (double)num_groups
                    : (double)num_blocks;
  }

I don't think the calculations are correct, and it's easier to illustrate with an example:

Let's say we have key(a, b, c) and query SELECT a FROM t WHERE a <= 10 AND b in (1, 2, 3) GROUP BY a
Then used_key_parts=2, group_key_parts=1, infix_factor=3

Note the following problems:

1. It's not clear why we care about ends of subgroups in cost calculation (unless have_min && have_max is true). Using the example query, the execution simply does exactly 3 index reads on (1, 2, 3) for every group, whereas the code implies we may read between 3-6 blocks from this query.

2. It's not clear why we calculate keys_in_subgroup * infix_factor since infix_factor doesn't increase the number keys in your subgroup. For example, if there are 100 keys per (a, b), you're doing 3 lookups within those 100 keys, not looking at 300 keys.

3. The blocks_per_group calculation is wrong. num_groups has already been multiplied by quick_prefix_selectivity (due to the a <= 10 condition), whereas num_blocks is for the whole table. If there are 10000 blocks, 100 total groups and only 1 group matching (a <= 10) then I would expect blocks_per_group to be 10000 / 100 = 100 blocks_per_group. The current calculation gives 10000 / 1 = 10000 blocks_per_group. Similarly, io_blocks = min(num_groups * (...), num_blocks) or io_blocks = .. ? .. : num_blocks is not doing the correct comparison.

4. In the "else" case, it's not clear why io_blocks (num_groups+1) if have_min && have_max is true. If we have both MIN/MAX functions, then the number of lookups double (you have to seek to both beginning and end of range). It doesn't just increase by 1.

5. It's not clear to me why we care about why we only do this for subgroups (can't we also have overlap in the actual groups?).

How to repeat:
Read the code.

Suggested fix:
It's really unclear if the complexity is providing much value. I would simplify this code such that I'm only counting the number of index seeks I have to perform.

The formula would simply be:
num_seeks = num_groups * infix_factor * minmax_factor

eg. If we have 10 groups on a where a<=10 and query SELECT a FROM t WHERE a <= 10 AND b in (1, 2, 3) GROUP BY a

num_seeks = 10 * 3 * 1 = 30

For SELECT a, MIN(c), MAX(c) FROM t WHERE a <= 10 AND b in (1, 2, 3) GROUP BY a
num_seeks = 10 * 3 * 2 = 60

io_blocks is simply min(num_blocks, num_seeks) (where num_blocks is number of blocks matching a<=10 not number of blocks total)

Hello Manuel,

Thank you for the report and detailed analysis.

regards,
Umesh