Bug #111007 Incorrect rows estimate
Submitted: 11 May 2023 20:30 Modified: 12 May 2023 12:44
Reporter: Ilya Kantor Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[11 May 2023 20:30] Ilya Kantor
Description:
For a table full of NULLs and the index, optimizer estimates only half of its rows to be NULL.

Here table `a` has a single indexed column `d`, which has 100000 NULLs.

Optimizer estimates only 50000:

mysql>  explain format=tree select * from a where d is null;
+------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (a.d is null)  (cost=5053 rows=50362)
    -> Covering index lookup on a using d (d=NULL)  (cost=5053 rows=50362)
 |
+------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

By the way, NOT NULL works fine:

mysql>  explain format=tree select * from a where d is not null;
+-------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (a.d is not null)  (cost=0.46 rows=1)
    -> Covering index range scan on a using d over (NULL < d)  (cost=0.46 rows=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------+

Is it because of some internal optimizations with indexes and NULLs?

How to repeat:
drop table if exists a;
CREATE TABLE a(d int, index(d));

INSERT INTO a (d)
WITH RECURSIVE seq AS (SELECT 1 as n UNION ALL SELECT 1 + n FROM seq LIMIT 100000)
SELECT null FROM seq;

analyze table a;

-- 50k rows (should be 100k, big mistake)
explain format=tree select * from a where d is null;

-- 1 row  (actually 0, but mistake is small)
explain format=tree select * from a where d is not null;
[12 May 2023 11:55] MySQL Verification Team
Hi Mr. Kantor,

Thank you for your bug report.

However, we have some additional questions. 

In your opinion , what should be the plan for the optimiser ??

Next, in the optimiser stage, a number of dives are made to calculate approximately, and only approximately, distribution of the values within the index.

Currently, it is not possible to get more exact estimate at all.

Hence, this looks much more like a feature request then a bug. As such, if accepted, it would be scheduled for 8.1 or some other versions ..... Would you agree to make this as a feature request ???

We are waiting on your feedback.
[12 May 2023 12:07] MySQL Verification Team
Hi,

After further deliberation, we concluded that this can be accepted as a low priority bug.

Verified as reported.
[12 May 2023 12:23] Ilya Kantor
The error it very big, it's x2 times.

I hardly believe that an index dive can be mistaken so much, because ALL VALUES ARE NULL.

So it's a bug, not a feature.
[12 May 2023 12:30] MySQL Verification Team
Hi Mr. Kantor,

But this report is verified as a bug, not as a feature request. If it were feature request, then field "Severity" would have had a value of S4 !!!!!
[12 May 2023 12:44] Ilya Kantor
Thank you, because such mistake indeed seems to be a bug. 

Because of incorrect row estimates, some queries go awry, with totally wrong plans.

I hope this one will be a simple one to solve.