MySQL Bugs: #111007: Incorrect rows estimate

Bug #111007	Incorrect rows estimate
Submitted:	11 May 2023 20:30	Modified:	12 May 2023 12:44
Reporter:	Ilya Kantor	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
For a table full of NULLs and the index, optimizer estimates only half of its rows to be NULL.

Here table `a` has a single indexed column `d`, which has 100000 NULLs.

Optimizer estimates only 50000:

mysql>  explain format=tree select * from a where d is null;
+------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (a.d is null)  (cost=5053 rows=50362)
    -> Covering index lookup on a using d (d=NULL)  (cost=5053 rows=50362)
 |
+------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

By the way, NOT NULL works fine:

mysql>  explain format=tree select * from a where d is not null;
+-------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (a.d is not null)  (cost=0.46 rows=1)
    -> Covering index range scan on a using d over (NULL < d)  (cost=0.46 rows=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------+

Is it because of some internal optimizations with indexes and NULLs?

How to repeat:
drop table if exists a;
CREATE TABLE a(d int, index(d));

INSERT INTO a (d)
WITH RECURSIVE seq AS (SELECT 1 as n UNION ALL SELECT 1 + n FROM seq LIMIT 100000)
SELECT null FROM seq;

analyze table a;

-- 50k rows (should be 100k, big mistake)
explain format=tree select * from a where d is null;

-- 1 row  (actually 0, but mistake is small)
explain format=tree select * from a where d is not null;

Hi Mr. Kantor,

Thank you for your bug report.

However, we have some additional questions. 

In your opinion , what should be the plan for the optimiser ??

Next, in the optimiser stage, a number of dives are made to calculate approximately, and only approximately, distribution of the values within the index.

Currently, it is not possible to get more exact estimate at all.

Hence, this looks much more like a feature request then a bug. As such, if accepted, it would be scheduled for 8.1 or some other versions ..... Would you agree to make this as a feature request ???

We are waiting on your feedback.

Hi,

After further deliberation, we concluded that this can be accepted as a low priority bug.

Verified as reported.

The error it very big, it's x2 times.

I hardly believe that an index dive can be mistaken so much, because ALL VALUES ARE NULL.

So it's a bug, not a feature.

Hi Mr. Kantor,

But this report is verified as a bug, not as a feature request. If it were feature request, then field "Severity" would have had a value of S4 !!!!!

Thank you, because such mistake indeed seems to be a bug. 

Because of incorrect row estimates, some queries go awry, with totally wrong plans.

I hope this one will be a simple one to solve.