MySQL Bugs: #20111: named columns used in subqueries hinder optimisation of LIKE, BETWEEN, range

Bug #20111	named columns used in subqueries hinder optimisation of LIKE, BETWEEN, range
Submitted:	28 May 2006 10:57	Modified:	24 Jan 2014 13:26
Reporter:	Daniel Treplin	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	4.1.19-log	OS:	Linux (Linux)
Assigned to:		CPU Architecture:	Any

Description:
If I use a named column from an outer query to feed a LIKE within the inner query, no index is used even when the value does not start with a wildcard. The reason seems to me that the optimiser does not treat the named column as a constant. (bug in constant propagation phase?)

It even does not help to rewrite the query with a range condition. mysql then recognises t1_c1 as possible key, but the query executes even a little slower than the one using like. (the index is not used). When using "between", things get even worse.

In my example, the query with 31 matches in the outer query takes 4,7 seconds in total. With distinct programming using 32 single queries, total runtime is < 0.5 seconds, including all the overhead. 

How to repeat:
mysql> explain select t1.c1, concat(t1.c1,'.%') as thread,
(select count(*) from t1 where c1 like thread)
from t1 where t1.c1 like 'a.b.%' and t1.c1 regexp 'a\.b\.[^\.]*$' order by t1.c1;
+----+--------------------+-------+-------+---------------+-------+---------+------+-------+-----------------------------+
| id | select_type        | table | type  | possible_keys | key   | key_len | ref  | rows  | Extra                       |
+----+--------------------+-------+-------+---------------+-------+---------+------+-------+-----------------------------+
|  1 | PRIMARY            | t1    | range | t1_c1         | t1_c1 |     255 | NULL |  1293 | Using where; Using filesort |
|  2 | DEPENDENT SUBQUERY | t1    | index | NULL          | t1_c1 |     255 | NULL | 72346 | Using where; Using index    |
+----+--------------------+-------+-------+---------------+-------+---------+------+-------+-----------------------------+
2 rows in set (0.00 sec)

rewritten with range condition (does also not work)

mysql> explain select t1.c1, concat(t1.c1,'.') as thread_start, concat(t1.c1,'/') as thread_end,
(select count(*) from t1 where c1 > thread_start and c1 < thread_end)
from t1 where t1.c1 like 'a.b.%' and t1.c1 regexp 'a\.b\.[^\.]*$' order by t1.c1;
+----+--------------------+-------+-------+---------------+-------+---------+------+-------+-----------------------------+
| id | select_type        | table | type  | possible_keys | key   | key_len | ref  | rows  | Extra                       |
+----+--------------------+-------+-------+---------------+-------+---------+------+-------+-----------------------------+
|  1 | PRIMARY            | t1    | range | t1_c1         | t1_c1 |     255 | NULL |  1293 | Using where; Using filesort |
|  2 | DEPENDENT SUBQUERY | t1    | index | t1_c1         | t1_c1 |     255 | NULL | 72346 | Using where; Using index     |
+----+--------------------+-------+-------+---------------+-------+---------+------+-------+-----------------------------+
2 rows in set (0.00 sec)

Suggested fix:
Check the way subqueries are passed and therefore are treated different from equivalent queries that are given directly to the optimiser.

Further testing showed that the problem (not using the key, but scanning the whole index file) occurs with any range condition within a subquery, no matter how the subquery is written (named colums or not, like, between, gt/lt, whatever ), if the range boundaries are passed from the outer query.

Thank you for a problem report. Why do you think that the index is not used? It should be used, according to EXPLAIN results... Have you checked with SHOW STATUS?

Hi, thanks for your answer. Please take a close look to the explain values of the dependent subquery of example 1. It says "possible keys: NULL", no key usage. Please also take a look to example 2, dependent subquery, column "type". It says "index", but correcly optimised, it should read "range". 

In both cases, type reads "index" what means full scan of the index file, but no key lookup, what results in a speed difference in order of magnitude. When I do the subquery part by hand as simple query, the query behaves as it should.  (watch the type: it reads "range"):

mysql> explain select count(*) from t1 where ind like "a.b.%";
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key   | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
|  1 | SIMPLE      | t1    | range | t1_c1         | t1_c1 |     255 | NULL |  175 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
1 row in set (0.00 sec) 

All this leads me to the conclusion, that there is a problem in the constant propagation phase of the query optimiser when handling subqueries. To me, it looks like it does not recognise the values referencing to the outer query as constants, as, in my opinion, it should, because they are constant within a dependent subquery like this. But if they are not propagated as constants, mysql, as documented in many places, will do to a scan instead of a key lookup for all sorts of range conditions, including like.

Sorry, I forgot a correction when replacing my real field names with abstract values. The example query should read:

explain select count(*) from t1 where c1 like "a.b.%";

These (as well as many other) problems with DEPENDENT subqueries optimisation will be fixed in MySQL 5.2.

What's the status of this bug? Have the performance issues been resolved?

Roy: 
This problem will not be fixed by subquery enhancements added to 5.6.
First, it is a scalar subquery which is not covered by the optimizations.
However, I think that the scalar subquery can be rewritten to an IN
subquery which makes it possible for the optimizer to evaluate the query
faster.
But still, the desired index is not selected by the optimizer.

Rewrite applied:
 and (select url from ... limit 1) is not null
-->
 and 1 IN (select 1 from ...)