Bug #49111 | Change in execution plan for count_distinct_group_on_key causes peformance drop | ||
---|---|---|---|
Submitted: | 25 Nov 2009 20:11 | Modified: | 30 Aug 2012 14:32 |
Reporter: | Alexey Stroganov | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S3 (Non-critical) |
Version: | 5.5.0-beta, 5.1 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | cost model, regression |
[25 Nov 2009 20:11]
Alexey Stroganov
[25 Nov 2009 20:12]
Alexey Stroganov
Dataset for issue
Attachment: test.bench1.dmp.gz (application/gzip, text), 74.33 KiB.
[2 Dec 2009 13:57]
Alexey Stroganov
Issue is not observed in 5.1.42pre: 5.1.42pre/MyISAM: mysql> explain select region,count(distinct idn) from bench1 group by region; +----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | bench1 | index | NULL | PRIMARY | 5 | NULL | 9640 | Using index | +----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+ Time for count_distinct_group_on_key (1000:6000): 8.459 wallclock secs 5.1.42pre/InnoDB: mysql> explain select region,count(distinct idn) from bench1 group by region; +----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | bench1 | index | NULL | PRIMARY | 5 | NULL | 9864 | Using index | +----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+ Time for count_distinct_group_on_key (1000:6000): 8.483 wallclock secs
[9 Dec 2009 15:06]
Georgi Kodinov
The difference in plan is because of the imprecise innodb statistics : e.g. Innodb returns 10628 rows in the table compared to 10K precise for MyISAM. Besides : the problem is nothing new. On the same data the optimizer chooses loose index scan for InnoDB tables and index scan for MyISAM for the equivalent query (not quite, but close enough : sans the temporary table) : explain select region,count(idn) from (select region,idn from bench1 group by region,idn) x; WL#3220 now just expands the cases where loose index scan is probed to cover certain queries with COUNT/AVG/SUM(DISTINCT) and a GROUP BY.
[9 Dec 2009 15:15]
Alexey Stroganov
Bigger dataset(10000 rows) for the issue
Attachment: bench1.count.10000.dmp.bz2 (application/x-bzip, text), 29.18 KiB.
[9 Dec 2009 15:23]
Ståle Deraas
Please retriage based on Joro's findings. And also please ask someone from optimizer team to provide E/R values.
[9 Dec 2009 16:00]
Alexey Stroganov
Below are execution plans and results of two queries for dataset with 10.000.000 rows: QUERY 1: select region,count(distinct idn) from bench1 group by region --------------------------------------------------------------------- MyISAM: ------- mysql>select region,count(distinct idn) from bench1 group by region; +--------+---------------------+ | region | count(distinct idn) | +--------+---------------------+ | B | 1666667 | | C | 1666667 | | D | 1666667 | | E | 1666667 | | F | 1666666 | | F | 1666666 | +--------+---------------------+ 6 rows in set (4.81 sec) mysql> explain select region,count(distinct idn) from bench1 group by region; +----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------------- | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra +----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------------- | 1 | SIMPLE | bench1 | range | NULL | PRIMARY | 5 | NULL | 10000001 | Using index for group-by (scanning) | +----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------------- row in set (0.00 sec) InnoDB: ------- mysql> select region,count(distinct idn) from bench1 group by region; +--------+---------------------+ | region | count(distinct idn) | +--------+---------------------+ | B | 1666667 | | C | 1666667 | | D | 1666667 | | E | 1666667 | | F | 1666666 | | F | 1666666 | +--------+---------------------+ 6 rows in set (1 min 6.33 sec) explain select region,count(distinct idn) from bench1 group by region; +----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------+ | 1 | SIMPLE | bench1 | range | NULL | PRIMARY | 5 | NULL | 10002853 | Using index for group-by | +----+-------------+--------+-------+---------------+---------+---------+------+----------+--------------------------+ 1 row in set (0.00 sec) QUERY 2: select region,count(idn) from (select region,idn from bench1 group by region,idn) x ------------------------------------------------------------------------------------------- MyISAM ------ mysql> explain select region,count(idn) from (select region,idn from bench1 group by region,idn) x; +----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 10000000 | | | 2 | DERIVED | bench1 | index | NULL | PRIMARY | 5 | NULL | 10000000 | Using index | +----+-------------+------------+-------+---------------+---------+---------+------+----------+-------------+ 2 rows in set (4.90 sec) mysql> select region,count(idn) from (select region,idn from bench1 group by region,idn) x; +--------+------------+ | region | count(idn) | +--------+------------+ | A | 10000000 | +--------+------------+ 1 row in set (5.43 sec) InnoDB ------ mysql> explain select region,count(idn) from (select region,idn from bench1 group by region,idn) x; +----+-------------+------------+-------+---------------+---------+---------+------+----------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+-------+---------------+---------+---------+------+----------+--------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 10000000 | | | 2 | DERIVED | bench1 | range | NULL | PRIMARY | 5 | NULL | 10002853 | Using index for group-by | +----+-------------+------------+-------+---------------+---------+---------+------+----------+--------------------------+ 2 rows in set (1 min 17.11 sec) mysql> select region,count(idn) from (select region,idn from bench1 group by region,idn) x; +--------+------------+ | region | count(idn) | +--------+------------+ | A | 10000000 | +--------+------------+ 1 row in set (1 min 17.67 sec)
[10 Dec 2009 13:54]
Timour Katchaounov
What surprises me, is the huge difference in performance between the two plans. On the other hand, the last test compares a bit apples to oranges (myisam to innodb). Please try the following, and report the result: Test myisam and innodb with both plans (loose scan, and index scan), e.g. by changing slightly the row count to make the optimizer choose the other plan. Even better with a system variable, but I don't remember if there is any. I would like to compare the same strategy for each engine, and then to check if the change in performance for both engines is equally big. All this should allow us to reason how to modify the cost model of this access method to make it more stable. Statistics will always be imprecise, so we shouldn't try to fix that. Timour
[3 Feb 2010 15:15]
Manyi Lu
According to Evgeny, the fix isn't complex, 1 line added 4 removed. Due to changed cost model it easily could produce regressions like not choosing a better index. He'll check all changed plans in our test suite, but it's not a guarantee that regressions wouldn't occur. Thus risk is high (due to possible regressions), efforts are less than medium.
[27 Apr 2010 17:36]
Jan Steemann
As far as I can see the issue does not have anything to do with MyISAM vs. InnoDB but it due to a "worse" optimizer decision in 5.5.x version. When comparing the execution plan for a query on the same InnoDB table (same structure, same data) between 5.1.33 (built-in InnoDB), 5.5.0-m2 (InnoDB 1.0.5), and 5.5.3-m3 (InnoDB 1.0.6), the plan is the same for 5.1.33 and 5.5.0-m2 but it worsens significantly in 5.5.3-m3. Please see below. For 5.1.33, the execution plan looks ok: mysql> explain SELECT DISTINCTROW id,lang_id FROM page_hits_xxxxxxxx WHERE dt_when>=DATE_SUB(NOW(),INTERVAL 600 SECOND) AND lang_id IN (1); +----+-------------+--------------------+-------+-----------------+---------+---------+------+------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------------+-------+-----------------+---------+---------+------+------+------------------------------+ | 1 | SIMPLE | page_hits_xxxxxxxx | range | dt_when,lang_id | dt_when | 9 | NULL | 1 | Using where; Using temporary | +----+-------------+--------------------+-------+-----------------+---------+---------+------+------+------------------------------+ 1 row in set (0.01 sec) The plan for 5.5.0-m2 is the same as in 5.1.33, range access is chosen and index "dt_when" is used. However, for 5.5.3-m3, the plan is much worse than for the previous versions, now evaluating 2.5 M rows instead of just 1. Type has changed to "ref" and a different index is used: mysql> explain SELECT DISTINCTROW id,lang_id FROM page_hits_xxxxxxxx WHERE dt_when>=DATE_SUB(NOW(),INTERVAL 600 SECOND) AND lang_id IN (1); +----+-------------+--------------------+------+---------------+---------+---------+-------+---------+-------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------------+------+---------------+---------+---------+-------+---------+-------------------------------------------+ | 1 | SIMPLE | page_hits_xxxxxxxx | ref | lang_id | lang_id | 4 | const | 2582973 | Using where; Using index; Using temporary | +----+-------------+--------------------+------+---------------+---------+---------+-------+---------+-------------------------------------------+ 1 row in set (0.00 sec) As mentioned before, the underlying table is completely identical (===). To test the differences between the MySQL versions I used the 3 different mysqld versions on the same data directory one after the after. As it still works ok for 5.5.0-m2, the issue seems to have been introduced either between 5.5.0-m2 and 5.5.3-m3, or in InnoDB between 1.0.5 and 1.0.6.
[30 Aug 2012 14:32]
Paul DuBois
Noted in 5.6.7, 5.7.0 changelogs. There was a performance regression for queries that used GROUP BY and COUNT(DISTINCT).