MySQL Bugs: #49111: Change in execution plan for count_distinct_group_on

Bug #49111	Change in execution plan for count_distinct_group_on_key causes peformance drop
Submitted:	25 Nov 2009 20:11	Modified:	30 Aug 2012 14:32
Reporter:	Alexey Stroganov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.5.0-beta, 5.1	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	cost model, regression

Description:
While testing  of the MySQL 5.5.0/InnoDB with mysql-bench test suite performance degradation was observed for the count_distinct_group_on_key test:

5.1.41/InnoDB_plugin105:
Time for count_distinct_group_on_key (1000:6000): 7.173 wallclock secs 

5.4.3/InnoDB_plugin104:
Time for count_distinct_group_on_key (1000:6000): 7.145 wallclock secs

5.5.0/InnoDB_plugin105:
Time for count_distinct_group_on_key (1000:6000): 29.740 wallclock secs

Explain for the query in 5.5.0/InnoDB was different from 5.1.41/5.4.3 InnoDB:

5.1.41/5.4.3 InnoDB:

mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | bench1 | index | NULL          | PRIMARY | 5       | NULL | 9660 | Using index |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

5.5.0 InnoDB:
mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows  | Extra                    |
+----+-------------+--------+-------+---------------+---------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | bench1 | range | NULL          | PRIMARY | 5       | NULL | 13007 | Using index for group-by |
+----+-------------+--------+-------+---------------+---------+---------+------+-------+--------------------------+

However execution time of the same query for MyISAM engine was notable improved:

5.1.41/MyISAM
Time for count_distinct_group_on_key (1000:6000): 7.862 wallclock secs

5.4.3/MyISAM
Time for count_distinct_group_on_key (1000:6000): 7.576 wallclock secs

5.5.0/MyISAM
Time for count_distinct_group_on_key (1000:6000): 4.588 wallclock secs

Explain for 5.1.41/5.4.3 MyISAM:
mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+-------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows  | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+-------+-------------+
|  1 | SIMPLE      | bench1 | index | NULL          | PRIMARY | 5       | NULL | 10000 | Using index |
+----+-------------+--------+-------+---------------+---------+---------+------+-------+-------------+

Explain for 5.5.0 MyISAM:
mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+-------+-------------------------------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows  | Extra                               |
+----+-------------+--------+-------+---------------+---------+---------+------+-------+-------------------------------------+
|  1 | SIMPLE      | bench1 | range | NULL          | PRIMARY | 5       | NULL | 10001 | Using index for group-by (scanning) |
+----+-------------+--------+-------+---------------+---------+---------+------+-------+-------------------------------------+

How to repeat:
1. Start 5.1.41/5.4.3/5.5.0 server 

2. Load attached dump file

3. check explain or execute query for InnoDB and MyISAM :
   select region,count(distinct idn) from bench1 group by region;

Dataset for issue

Attachment: test.bench1.dmp.gz (application/gzip, text), 74.33 KiB.

Issue is not observed in 5.1.42pre:

5.1.42pre/MyISAM:

mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | bench1 | index | NULL          | PRIMARY | 5       | NULL | 9640 | Using index |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

Time for count_distinct_group_on_key (1000:6000): 8.459 wallclock secs 

5.1.42pre/InnoDB:

mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | bench1 | index | NULL          | PRIMARY | 5       | NULL | 9864 | Using index |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+

Time for count_distinct_group_on_key (1000:6000): 8.483 wallclock secs

The difference in plan is because of the imprecise innodb statistics : e.g. Innodb returns 10628 rows in the table compared to 10K precise for MyISAM.

Besides : the problem is nothing new. On the same data the optimizer chooses loose index scan for InnoDB tables and index scan for MyISAM for the equivalent query (not quite, but close enough : sans the temporary table)  :
explain select region,count(idn) from (select region,idn from bench1 group by region,idn) x;

WL#3220 now just expands the cases where loose index scan is probed to cover certain queries with  COUNT/AVG/SUM(DISTINCT) and a GROUP BY.

Bigger dataset(10000 rows) for the issue

Attachment: bench1.count.10000.dmp.bz2 (application/x-bzip, text), 29.18 KiB.

Please retriage based on Joro's findings. And also please ask someone from optimizer team to provide E/R values.

Below are execution plans and results of two queries for dataset with 10.000.000 rows:

QUERY 1: select region,count(distinct idn) from bench1 group by region
---------------------------------------------------------------------

MyISAM:
-------

mysql>select region,count(distinct idn) from bench1 group by region;
+--------+---------------------+
| region | count(distinct idn) |
+--------+---------------------+
| B      |             1666667 |
| C      |             1666667 |
| D      |             1666667 |
| E      |             1666667 |
| F      |             1666666 |
| F      |             1666666 |
+--------+---------------------+
6 rows in set (4.81 sec)

mysql> explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+----------+---------------------------------
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows     | Extra
+----+-------------+--------+-------+---------------+---------+---------+------+----------+---------------------------------
|  1 | SIMPLE      | bench1 | range | NULL          | PRIMARY | 5       | NULL | 10000001 | Using index for group-by (scanning) |
+----+-------------+--------+-------+---------------+---------+---------+------+----------+---------------------------------

 row in set (0.00 sec)

InnoDB:
-------

mysql> select region,count(distinct idn) from bench1 group by region;
+--------+---------------------+
| region | count(distinct idn) |
+--------+---------------------+
| B      |             1666667 |
| C      |             1666667 |
| D      |             1666667 |
| E      |             1666667 |
| F      |             1666666 |
| F      |             1666666 |
+--------+---------------------+
6 rows in set (1 min 6.33 sec)

explain select region,count(distinct idn) from bench1 group by region;
+----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows     | Extra                    |
+----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | bench1 | range | NULL          | PRIMARY | 5       | NULL | 10002853 | Using index for group-by |
+----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------+
1 row in set (0.00 sec)

QUERY 2: select region,count(idn) from (select region,idn from bench1 group by region,idn) x
-------------------------------------------------------------------------------------------

MyISAM
------

mysql> explain  select region,count(idn) from (select region,idn from bench1 group by region,idn) x;
+----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+
| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows     | Extra       |
+----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL    | NULL    | NULL | 10000000 |             |
|  2 | DERIVED     | bench1     | index | NULL          | PRIMARY | 5       | NULL | 10000000 | Using index |
+----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+
2 rows in set (4.90 sec)
mysql> select region,count(idn) from (select region,idn from bench1 group by region,idn) x;
+--------+------------+
| region | count(idn) |
+--------+------------+
| A      |   10000000 |
+--------+------------+
1 row in set (5.43 sec)

InnoDB
------

mysql> explain  select region,count(idn) from (select region,idn from bench1 group by region,idn) x;
+----+-------------+------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows     | Extra                    |
+----+-------------+------------+-------+---------------+---------+---------+------+----------+--------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL    | NULL    | NULL | 10000000 |                          |
|  2 | DERIVED     | bench1     | range | NULL          | PRIMARY | 5       | NULL | 10002853 | Using index for group-by |
+----+-------------+------------+-------+---------------+---------+---------+------+----------+--------------------------+
2 rows in set (1 min 17.11 sec)

mysql> select region,count(idn) from (select region,idn from bench1 group by region,idn) x;
+--------+------------+
| region | count(idn) |
+--------+------------+
| A      |   10000000 |
+--------+------------+
1 row in set (1 min 17.67 sec)

What surprises me, is the huge difference in performance between the two plans.
On the other hand, the last test compares a bit apples to oranges (myisam to
innodb). Please try the following, and report the result:
Test myisam and innodb with both plans (loose scan, and index scan), e.g.
by changing slightly the row count to make the optimizer choose the other
plan. Even better with a system variable, but I don't remember if there is
any.

I would like to compare the same strategy for each engine, and then to check
if the change in performance for both engines is equally big. All this should
allow us to reason how to modify the cost model of this access method to make
it more stable. Statistics will always be imprecise, so we shouldn't try to
fix that.

Timour

According to Evgeny, the fix isn't complex, 1 line added 4 removed. Due to changed cost model it easily could produce regressions like not choosing a better index. He'll check all changed plans in our test suite, but it's not a guarantee that regressions wouldn't occur.
Thus risk is high (due to possible regressions), efforts are less than medium.

As far as I can see the issue does not have anything to do with MyISAM vs. InnoDB but it due to a "worse" optimizer decision in 5.5.x version. 

When comparing the execution plan for a query on the same InnoDB table (same structure, same data) between 5.1.33 (built-in InnoDB), 5.5.0-m2 (InnoDB 1.0.5), and 5.5.3-m3 (InnoDB 1.0.6), the plan is the same for 5.1.33 and 5.5.0-m2 but it worsens significantly in 5.5.3-m3. Please see below.

For 5.1.33, the execution plan looks ok:
mysql> explain SELECT DISTINCTROW id,lang_id FROM page_hits_xxxxxxxx WHERE dt_when>=DATE_SUB(NOW(),INTERVAL 600 SECOND) AND lang_id IN (1);
+----+-------------+--------------------+-------+-----------------+---------+---------+------+------+------------------------------+
| id | select_type | table              | type  | possible_keys   | key     | key_len | ref  | rows | Extra                        |
+----+-------------+--------------------+-------+-----------------+---------+---------+------+------+------------------------------+
|  1 | SIMPLE      | page_hits_xxxxxxxx | range | dt_when,lang_id | dt_when | 9       | NULL |    1 | Using where; Using temporary |
+----+-------------+--------------------+-------+-----------------+---------+---------+------+------+------------------------------+
1 row in set (0.01 sec)

The plan for 5.5.0-m2 is the same as in 5.1.33, range access is chosen and index "dt_when" is used.

However, for 5.5.3-m3, the plan is much worse than for the previous versions, now evaluating 2.5 M rows instead of just 1. Type has changed to "ref" and a different index is used:

mysql> explain SELECT DISTINCTROW id,lang_id FROM page_hits_xxxxxxxx WHERE dt_when>=DATE_SUB(NOW(),INTERVAL 600 SECOND) AND lang_id IN (1);
+----+-------------+--------------------+------+---------------+---------+---------+-------+---------+-------------------------------------------+
| id | select_type | table              | type | possible_keys | key     | key_len | ref   | rows    | Extra                                     |
+----+-------------+--------------------+------+---------------+---------+---------+-------+---------+-------------------------------------------+
|  1 | SIMPLE      | page_hits_xxxxxxxx | ref  | lang_id       | lang_id | 4       | const | 2582973 | Using where; Using index; Using temporary |
+----+-------------+--------------------+------+---------------+---------+---------+-------+---------+-------------------------------------------+
1 row in set (0.00 sec)

As mentioned before, the underlying table is completely identical (===). To test the differences between the MySQL versions I used the 3 different mysqld versions on the same data directory one after the after.

As it still works ok for 5.5.0-m2, the issue seems to have been introduced either between 5.5.0-m2 and 5.5.3-m3, or in InnoDB between 1.0.5 and 1.0.6.

Noted in 5.6.7, 5.7.0 changelogs.

There was a performance regression for queries that used GROUP BY and
COUNT(DISTINCT).