Bug #50394 Regression in EXPLAIN with index scan, LIMIT, GROUP BY and ORDER BY computed col
Submitted: 17 Jan 2010 4:55 Modified: 22 Nov 2010 0:39
Reporter: Morgan Tocker Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:5.1.41, 5.1, 5.5.99 bzr OS:Any
Assigned to: Martin Hansson CPU Architecture:Any
Tags: explain, GROUP BY, limit, order by, regression
Triage: Triaged: D3 (Medium) / R2 (Low) / E2 (Low)

[17 Jan 2010 4:55] Morgan Tocker
Description:
This relates to this bug:
http://bugs.mysql.com/bug.php?id=50168

In 5.1 the EXPLAIN plan shows a much smaller rows examined count when there:
* Is an index scan
* A group by
* An order by on a computed column
* A LIMIT clause

My test case is a bit more complicated than it could be, but should still be easy enough to comprehend the danger of this change:

How to repeat:
I managed to produce this very easily with the IMDB database - http://www.imdb.com/interfaces#plain (with imdbpy2sql - http://imdbpy.sourceforge.net/) on a query to find the actor who's been cast in the greatest number of movies.

In 5.1.41, here's what we get:

mysql-5141> explain select STRAIGHT_JOIN count(*) as c, person_id from cast_info FORCE INDEX(person_id) inner join title on (cast_info.movie_id=title.id) where title.kind_id = 1 GROUP BY cast_info.person_id ORDER by c DESC LIMIT 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cast_info
         type: index
possible_keys: NULL
          key: person_id
      key_len: 8
          ref: NULL
         rows: 8
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: title
         type: eq_ref
possible_keys: PRIMARY,title_kind_id_exists
          key: PRIMARY
      key_len: 4
          ref: imdb.cast_info.movie_id
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

In reality I can't see how any optimizations are possible to prevent that index scan on cast_info from look at every row in the person_id index.  There are 22187768 rows in the table (a *lot* more than 8).

This is what 5.0.89 shows:

mysql-5089> EXPLAIN select STRAIGHT_JOIN count(*) as c, person_id from cast_info FORCE INDEX(person_id) inner join title on (cast_info.movie_id=title.id) where title.kind_id = 1 GROUP BY cast_info.person_id ORDER by c DESC LIMIT 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cast_info
         type: index
possible_keys: NULL
          key: person_id
      key_len: 8
          ref: NULL
         rows: 22187768
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: title
         type: eq_ref
possible_keys: PRIMARY,title_kind_id_exists
          key: PRIMARY
      key_len: 4
          ref: imdb.cast_info.movie_id
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

Obligatory 5.1 output to show the LIMIT is required to trigger this, and the SHOW CREATE TABLE information:

mysql-5141> EXPLAIN select STRAIGHT_JOIN count(*) as c, person_id from cast_info FORCE INDEX(person_id) inner join title on (cast_info.movie_id=title.id) where title.kind_id = 1 GROUP BY cast_info.person_id ORDER by c DESC\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cast_info
         type: index
possible_keys: NULL
          key: person_id
      key_len: 8
          ref: NULL
         rows: 22187768
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: title
         type: eq_ref
possible_keys: PRIMARY,title_kind_id_exists
          key: PRIMARY
      key_len: 4
          ref: imdb.cast_info.movie_id
         rows: 1
        Extra: Using where
2 rows in set (0.05 sec)

mysql> SHOW CREATE TABLE cast_info\G
*************************** 1. row ***************************
       Table: cast_info
Create Table: CREATE TABLE `cast_info` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `person_id` int(11) NOT NULL,
  `movie_id` int(11) NOT NULL,
  `person_role_id` int(11) DEFAULT NULL,
  `note` text,
  `nr_order` int(11) DEFAULT NULL,
  `role_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_pid` (`person_id`),
  KEY `idx_mid` (`movie_id`),
  KEY `idx_cid` (`person_role_id`),
  KEY `cast_info_role_id_exists` (`role_id`),
  KEY `person_id` (`person_id`,`movie_id`)
) ENGINE=MyISAM AUTO_INCREMENT=22187769 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> SHOW CREATE TABLE title\G
*************************** 1. row ***************************
       Table: title
Create Table: CREATE TABLE `title` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` text NOT NULL,
  `imdb_index` varchar(12) DEFAULT NULL,
  `kind_id` int(11) NOT NULL,
  `production_year` int(11) DEFAULT NULL,
  `imdb_id` int(11) DEFAULT NULL,
  `phonetic_code` varchar(5) DEFAULT NULL,
  `episode_of_id` int(11) DEFAULT NULL,
  `season_nr` int(11) DEFAULT NULL,
  `episode_nr` int(11) DEFAULT NULL,
  `series_years` varchar(49) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_title` (`title`(10)),
  KEY `idx_pcode` (`phonetic_code`),
  KEY `idx_epof` (`episode_of_id`),
  KEY `title_kind_id_exists` (`kind_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1543721 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

Suggested fix:
There is no regression in performance (from what I can tell), just a regression in interpreting diagnostics.  I suspect these sorts of queries are common (find the largest X), so it would be good to have the old behavior back!
[20 Jan 2010 6:21] Sveta Smirnova
Thank you for the report.

Verified as described.

mysql> explain select STRAIGHT_JOIN count(*) as c, person_id from cast_info FORCE INDEX(person_id) inner join title on (cast_info.movie_id=title.id) where title.kind_id = 1 GROUP BY cast_info.person_id ORDER by c DESC LIMIT 1;
+----+-------------+-----------+--------+---------------+-----------+---------+-------------------------+------+----------------------------------------------+
| id | select_type | table     | type   | possible_keys | key       | key_len | ref                     | rows | Extra                                        |
+----+-------------+-----------+--------+---------------+-----------+---------+-------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | cast_info | index  | NULL          | person_id | 8       | NULL                    |    7 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | title     | eq_ref | PRIMARY       | PRIMARY   | 4       | imdb.cast_info.movie_id |    1 | Using where                                  |
+----+-------------+-----------+--------+---------------+-----------+---------+-------------------------+------+----------------------------------------------+
2 rows in set (0.00 sec)

But

mysql> show status like 'ha%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 27    |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 25    |
+----------------------------+-------+
15 rows in set (0.00 sec)

mysql> select STRAIGHT_JOIN count(*) as c, person_id from cast_info FORCE INDEX(person_id) inner join title on (cast_info.movie_id=title.id) where title.kind_id = 1 GROUP BY cast_info.person_id ORDER by c DESC LIMIT 1;
+------+-----------+
| c    | person_id |
+------+-----------+
| 2461 |     99208 |
+------+-----------+
1 row in set (5 min 55.73 sec)

mysql> show status like 'ha%';+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| Handler_commit             | 0        |
| Handler_delete             | 0        |
| Handler_discover           | 0        |
| Handler_prepare            | 0        |
| Handler_read_first         | 1        |
| Handler_read_key           | 13890229 |
| Handler_read_next          | 14286456 |
| Handler_read_prev          | 0        |
| Handler_read_rnd           | 0        |
| Handler_read_rnd_next      | 2407004  |
| Handler_rollback           | 0        |
| Handler_savepoint          | 0        |
| Handler_savepoint_rollback | 0        |
| Handler_update             | 0        |
| Handler_write              | 2407001  |
+----------------------------+----------+
15 rows in set (0.00 sec)
[21 Jan 2010 23:09] Omer Barnir
This is not a bug but an improvement.
5.0 shows number of tuples that will be traversed, while 5.1 shows number of tuples that will satisfy a condition
[22 Jan 2010 1:00] Morgan Tocker
Omer, if that's the case, could you point to me where in the incompatible changes this is mentioned? ;)

http://dev.mysql.com/doc/refman/5.1/en/upgrading-from-previous-series.html

People write automated tools to check the output of explain and make important decisions off of that.  I think it's bug.
[22 Jan 2010 1:00] Morgan Tocker
Changing from not a bug to Open.
[22 Jan 2010 1:16] Morgan Tocker
The manual also states rows as:
http://dev.mysql.com/doc/refman/5.1/en/using-explain.html
" The rows column indicates the number of rows MySQL believes it must examine to execute the query. "
[30 May 2010 9:44] Moshe Lampert
Also there are problem with devired tables.

I'm trying to join devired subquery, and explain (and maybe optimizer itself) return wrong results:

explain select  n_title, n_id, n_visible, n_date, n_Short,  c_name, c_id from news, news_cats where  c_id=n_cat and c_website=0  order by n_date desc limit 30;
1	SIMPLE	news	index	DateCat,videocat	Index_6	8		30	
1	SIMPLE	news_cats	eq_ref	PRIMARY	PRIMARY	4	a7.news.n_cat	1	Using where

But this query:

explain select *  from (select  n_title, n_id, n_visible, n_date, n_Short,  c_name, c_id from news, news_cats where C_website=0 and c_id=n_cat   order by n_date desc limit 0,30) as a left join click_log on l_type=0 and l_item=n_id  group by n_id order by n_date desc

1	PRIMARY	<derived2>	ALL					30	Using temporary; Using filesort
1	PRIMARY	click_log	ref	type-item-date-hour,type-date-item,typeday	typeday	2	const	9938	
2	DERIVED	news	index	DateCat,videocat	Index_6	8		88792	
2	DERIVED	news_cats	eq_ref	PRIMARY	PRIMARY	4	a7.news.n_cat	1	Using where
[9 Sep 2010 12:29] Andrii Nikitin
may be related or duplicate of bug #42094
[10 Sep 2010 8:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117975

3505 Martin Hansson	2010-09-10
      Bug #50394: Regression in EXPLAIN with index scan, LIMIT, GROUP BY and
      ORDER BY computed col
      
      GROUP BY implies ORDER BY in the MySQL dialect of SQL. Therefore, when an
      index on the first table in the query is used, and that index satisfies
      ordering according to the GROUP BY clause, the query optimizer estimates the
      number of tuples that need to be read from this index. If there is a LIMIT
      clause, table statistics on tables following this 'sort table' are employed.
      
      There may be a separate ORDER BY clause however, which mandates reading the
      whole 'sort table' anyway. But the previous estimate was left untouched.
      
      Fixed by removing the estimate if GROUP BY is followed by an unresolved ORDER
      BY clause.
[13 Sep 2010 11:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118070

3507 Martin Hansson	2010-09-13
      Bug #50394: Regression in EXPLAIN with index scan, LIMIT, GROUP BY and
      ORDER BY computed col
            
      GROUP BY implies ORDER BY in the MySQL dialect of SQL. Therefore, when an
      index on the first table in the query is used, and that index satisfies
      ordering according to the GROUP BY clause, the query optimizer estimates the
      number of tuples that need to be read from this index. If there is a LIMIT
      clause, table statistics on tables following this 'sort table' are employed.
      
      There may be a separate ORDER BY clause however, which mandates reading the
      whole 'sort table' anyway. But the previous estimate was left untouched.
      
      Fixed by removing the estimate from EXPLAIN output if GROUP BY is used in
      conjunction with an ORDER BY clause that mandates using a temporary table.
[13 Sep 2010 12:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118075

3202 Martin Hansson	2010-09-13 [merge]
      Merge of fix for Bug#50394.
[28 Sep 2010 15:40] Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alik@sun.com-20100928153607-tdsxkdm5cmuym5sq) (version source revid:alik@sun.com-20100928153508-0saa6v93dinqx1u7) (merge vers: 5.6.1-m4) (pib:21)
[28 Sep 2010 15:42] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100928153646-pqp8o1a92mxtuj3h) (version source revid:alik@sun.com-20100928153532-lr3gtvnyp2en4y75) (pib:21)
[28 Sep 2010 15:44] Bugs System
Pushed into mysql-5.5 5.5.7-rc (revid:alik@sun.com-20100928153459-4nudf4zgzlou4s7q) (version source revid:alik@sun.com-20100928153459-4nudf4zgzlou4s7q) (merge vers: 5.5.7-rc) (pib:21)
[29 Sep 2010 12:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/119412

3306 Tor Didriksen	2010-09-29
      Bug#50394 Regression in EXPLAIN with index scan, LIMIT, GROUP BY and ORDER BY computed col
     @ mysql-test/r/join_cache_jcl1.result
        Correct the rows estimate.
[29 Sep 2010 12:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/119413

3305 Tor Didriksen	2010-09-29
      Bug#50394 Regression in EXPLAIN with index scan, LIMIT, GROUP BY and ORDER BY computed col
     @ mysql-test/r/join_cache_jcl1.result
        Correct the rows estimate.
[29 Sep 2010 19:47] Paul Dubois
Noted in 5.5.7, 5.6.1 changelogs.

EXPLAIN produced an incorrect rows value for queries evaluated using
an index scan and that included LIMIT, GROUP BY, and ORDER BY on a
computed column.
[30 Sep 2010 14:57] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/119562

3257 Tor Didriksen	2010-09-30
      Bug#50394 Regression in EXPLAIN with index scan, LIMIT, GROUP BY and ORDER BY computed col
[2 Oct 2010 18:15] Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)
[3 Oct 2010 3:06] Morgan Tocker
5.1 is still in 'Active Support', is there any intention to back port this fix?
[7 Oct 2010 23:11] Paul Dubois
Noted in 5.1.52 changelog.
[1 Nov 2010 19:02] Bugs System
Pushed into mysql-5.1 5.1.53 (revid:build@mysql.com-20101101184443-o2olipi8vkaxzsqk) (version source revid:build@mysql.com-20101101184443-o2olipi8vkaxzsqk) (merge vers: 5.1.53) (pib:21)
[13 Nov 2010 16:26] Bugs System
Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)
[13 Nov 2010 16:35] Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101113160336-atmtmfb3mzm4pz4i) (version source revid:jimmy.yang@oracle.com-20100804103744-vbpeghipkz6pyc9z) (pib:21)