MySQL Bugs: #115149: Optimizer produces a poor query plan for impossible WHERE condition

Bug #115149	Optimizer produces a poor query plan for impossible WHERE condition
Submitted:	28 May 2024 13:05	Modified:	29 May 2024 12:01
Reporter:	Michael Adelson	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The optimizer issues a full table scan for an impossible WHERE IN clause that does not reference a table.

Gives expected outcome (Impossible WHERE):
```
SELECT COUNT(*) FROM my_table WHERE id IN (SELECT NULL FROM INFORMATION_SCHEMA.TABLES WHERE FALSE);
```

Gives full table scan (bad)
```
SELECT COUNT(*) FROM my_table WHERE id IN (SELECT NULL WHERE FALSE);
```

Gives full table scan (bad)
```
SELECT COUNT(*) FROM my_table WHERE id IN (SELECT NULL FROM DUAL WHERE FALSE);
```

For context, the second (bad) query gets generated by SQL frameworks like Dapper, where you can do:
```
connection.QuerySingle<int>(
    "SELECT COUNT(*) FROM my_table WHERE id IN @ids",
    new { ids = /* some collection of ints, might be empty */ });
```

While in the toy cases shown here it would obviously be better to change the code not to issue the query at all, in real-world examples the IN clause might be just part of a more complex condition (e.g. WHERE name = ... OR id IN (...)).

It would be great if IN (SELECT NULL WHERE FALSE) were properly optimized here so that frameworks like Dapper have something reasonable to generate as a stand-in for an empty in clause.

Query plan visuals shown here: https://dba.stackexchange.com/questions/339787/why-does-mysql-choose-such-a-poor-query-pla...

How to repeat:
Generate a query plan for:

```
SELECT COUNT(*) FROM any_large_table WHERE id IN (SELECT NULL WHERE FALSE);
```

Suggested fix:
This appears to be an optimizer bug. The bad plans DO have an impossible WHERE, but somehow this is not being reduced to an overall `WHERE FALSE` like it is when the WHERE IN clause references a specific non-DUAL table.

I imagine it can be fixed by applying the same logic that works for the latter case to the problematic case.

Hi Mr. Adelson,

Thank you for your bug report.

However, we do not know which product have you tested. All latest releases of all our maintained versions return the same result:

mysql> EXPLAIN SELECT COUNT(*) FROM t1 WHERE id IN (SELECT NULL FROM INFORMATION_SCHEMA.TABLES WHERE FALSE);
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra            |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------+
|  1 | SIMPLE      | NULL  | NULL       | NULL | NULL          | NULL | NULL    | NULL | NULL |     NULL | Impossible WHERE |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+------------------+
1 row in set, 1 warning (0.03 sec)

mysql> SHOW WARNINGS;
+-------+------+---------------------------------------------------------------------------+
| Level | Code | Message                                                                   |
+-------+------+---------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select count(0) AS `COUNT(*)` from `test`.`t1` where false |
+-------+------+---------------------------------------------------------------------------+
1 row in set (0.00 sec)

Simply, nothing is scanned as the WHERE is impossible ......

Definitely not a bug.

Did you try one of the queries I marked as bad?

EXPLAIN SELECT COUNT(*) FROM t1 WHERE id IN (SELECT NULL WHERE FALSE);

| id | select_type | table | rows | filtered | Extra
| 1  | PRIMARY     | t1    | 90000 | 100.00  | Using where
| 2  | DEPENDENT SUBQUERY | null  | null     | Impossible WHERE

This query runs quite slowly if t1 is large; it is clearly doing the fullscan. In contrast, the query with `FROM INFORMATION_SCHEMA.TABLES` in the IN clause returns instantly.

Hi Mr. Adelson,

That example that you cited used a dependent nested query that , by rule, has to be run for every row of the tables used in the outer query.

That is standard behaviour.

> That example that you cited used a dependent nested query that , by rule, has to be run for every row of the tables used in the outer query.

Sorry I don't follow. I know that the plan produces a dependent nested query, but that plan is doing extra unnecessary work by my understanding.

These 3 queries are all logically equivalent, but one of them gets a fast plan and the other 2 get slow plans:

```
-- EXAMPLE 1: produces a plan that returns instantly with no table scan
SELECT COUNT(*) FROM t1 WHERE id IN (SELECT NULL FROM INFORMATION_SCHEMA.TABLES WHERE FALSE);

-- EXAMPLE 2: produces a plan that returns slowly after scanning the full table
SELECT COUNT(*) FROM t1 WHERE id IN (SELECT NULL WHERE FALSE);

-- EXAMPLE 3: produces a plan that returns slowly after scanning the full table
SELECT COUNT(*) FROM t1 WHERE id IN (SELECT NULL FROM DUAL WHERE FALSE);
```

I believe that the optimizer should give all 3 of these queries the same fast plan. What am I missing?

HI Mr. Adelson,

A difference is in the table that actually exists.

Problem is that many nested queries could finish in producing impossible condition, or just a condition that returns 0 (zero) rows.

Out of 1000 (one thousand) queries, 1 will be like that. If we would optimise all of the examples, then all 1000 queries would have to pass through those checks. This would finish in the over-optimisation, id est, queries would spend too much time in the optimisation stage and not in the execution.

We must avoid that.

However, in your case, we think that this can make a feature request.

This is now verified as a feature request for 8.0 and higher.

> we think that this can make a feature request.

Thanks! I think this will be valuable because Dapper (the most popular micro-ORM for .NET) uses the pattern `IN (SELECT NULL WHERE 1 = 0)` to handle empty IN clauses and other frameworks that generate SQL might use it as well.

Therefore, optimizing this pattern makes it easier for such frameworks to generate performant queries.