MySQL Bugs: #117776: Index statistics is unexpectedly used to estimate eq ranges with null part

Bug #117776	Index statistics is unexpectedly used to estimate eq ranges with null part
Submitted:	24 Mar 5:01	Modified:	24 Mar 7:46
Reporter:	Kaiwang CHen (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	8.0, 8.4, 9.2	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution

Description:
WL#5957 stated that ranges of the form "x IS NULL" will not use 
index statistics because the number of rows with this value are 
likely to be very different than the values in the index
statistics.

Let's name "x IS NULL" as null part in the following discussion.
So WL#5957 refers to any range without null part, and using index
statistics further restricts it to equality ranges. It is exactly
reflected in the precheck function eq_ranges_exceeds_limit().

However, equality ranges with at least one null part can be over
any key, be it unique or non-unique, full or prefix. They were not 
always flagged with NULL_RANGE which assumed only full unique key.
For example, an equality range containing null part over a
non-unique key does not count in the precheck function, but passed
the not-NULL_RANGE test, thus are estimated by index statistics.
Obviously it is not intended.

How to repeat:
Take this patch to reveal the problem:

diff --git a/sql/handler.cc b/sql/handler.cc
index d7d947e481f..6138729a014 100644
--- a/sql/handler.cc
+++ b/sql/handler.cc
@@ -6424,6 +6424,17 @@ ha_rows handler::multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq,
       if ((range.range_flag & EQ_RANGE) &&
           (keyparts_used = std::popcount(range.start_key.keypart_map)) &&
           table->key_info[keyno].has_records_per_key(keyparts_used - 1)) {
+        DBUG_EXECUTE_IF("crash_index_statistics_null_part", {
+          // It is an equality range, both ends are present, inclusive and same.
+          const uchar *p = range.start_key.key;
+          KEY *k = &table->key_info[keyno];
+          for (uint kp = 0; kp < actual_key_parts(k); kp++) {
+            if (k->key_part[kp].field->is_nullable() && *p) {
+              DBUG_SUICIDE();
+            }
+            p += k->key_part[kp].store_length;
+          }
+        });
         rows = static_cast<ha_rows>(
             table->key_info[keyno].records_per_key(keyparts_used - 1));
       } else {

CREATE TABLE t1 (c1 INT, c2 INT, KEY k12(c1, c2));
INSERT INTO t1 VALUES (1,1), (2,1), (3,1), (4,1), (1,NULL), (NULL,NULL), (NULL,NULL), (NULL,NULL), (NULL,NULL), (NULL,NULL);
ANALYZE TABLE t1;

SET SESSION DEBUG="+d,crash_index_statistics_null_part";

SET eq_range_index_dive_limit = 2;
EXPLAIN SELECT * FROM t1 WHERE c1 = 1 and c2 IS NULL OR c1 = 1 and c2 = 1 OR c1 = 1 and c2 = 2;

Suggested fix:
Restricting NULL_RANGE to equality ranges over a full unique key is
non-sense, it is better redefined for range estimation as equality
range, or even any range, containing at least one null part.

Then an equality range without null part could be expressed as
"((flag & (EQ_RANGE | NULL_RANGE)) == EQ_RANGE)", be it unique or
not, while a real unique range, which by definition produces at
most one row, could be expressed as
"((flag & (UNIQUE_RANGE | NULL_RANGE)) == UNIQUE_RANGE)".

See enclosed in the next comment.

Reinterprete NULL_RANGE for range estimation as the range has at least one null part "kp_i IS NULL" (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug_117776.patch (application/octet-stream, text), 14.17 KiB.

Hello Kaiwang,

Thank you for the report and contribution.

regards,
Umesh

Besides, I found the comment key_range_flags are not consistent in their usage sites.

I guess a range can be classified by three factors

(1) all parts are equality, or not
(2) one or more null parts ("x IS NULL"), or no null part
(3) all the parts are equality and contains a full unique key, or not

I would suggest (1) EQ_RANGE, (2) NULL_RANGE, (3) UNIQUE_RANGE. I guess these flags could be redefined as stated above. EQ_RANGE and NULL_RANGE is primary properties, while UNIQUE_RANGE is a combined property (EQ_RANGE | all parts are over a full unique key). The full unique key property can also be a primary property.

Then,

1. an equality range over a full unique index is ((flag & UNIQUE_RANGE) == UNIQUE_RANGE), or in long form ((flag & (EQ_RANGE | UNIQUE_RANGE) == (EQ_RANGE | UNIQUE_RANGE)). Here UNIQUE_RANGE is a sub flag to EQ_RANGE.
2. a real unique range is ((flag & (UNIQUE_RANGE | NULL_RANGE)) == UNIQUE_RANGE)

3. an equality range is ((flag & EQ_RANGE) == EQ_RANGE)
4. an equality range without null part is ((flag & (EQ_RANGE | NULL_RANGE)) == EQ_RANGE )
5. an equality range with some null part is ((flag & (EQ_RANGE | NULL_RANGE)) == (EQ_RANGE | NULL_RANGE))

6. a non-equality range is ((flag & EQ_RANGE) == 0)
7. a non-equality range without null part is ((flag & (EQ_RANGE | NULL_RANGE)) == 0)
8. a non-equality range with some null part is ((flag & (EQ_RANGE | NULL_RANGE)) == NULL_RANGE)

Then there could be a refactor to introduce some inline functions with clear names to unify the interpretations is needed.

Currently they are

1. UNIQUE_RANGE: a real unique range. With new def, _RANGE removed, it is ((flag & (UNIQUE | NULL)) == UNIQUE)
2. EQ_RANGE: an equality range over a full key, with zero or more null parts. Is it really useful?
3. NULL_RANGE: an equality range over a full unique key, with one or more null part. With new def, it is ((flag & (UNIQUE | NULL)) == (UNIQUE | NULL)).

  /*
    This flag means that index is a unique index, and the interval is
    equivalent to "AND(keypart_i = const_i)", where all of const_i are
    not NULLs.
  */
  UNIQUE_RANGE = 1 << 4,
  /*
    This flag means that the interval is equivalent to
    "AND(keypart_i = const_i)", where not all key parts may be used
    but all of const_i are not NULLs.
  */
  EQ_RANGE = 1 << 5,
  /*
    This flag has the same meaning as UNIQUE_RANGE, except that for at
    least one keypart the condition is "keypart IS NULL".
  */
  NULL_RANGE = 1 << 6,