Bug #29845 GROUP BY does not make best use of indicies when evaluating SUM aggregates
Submitted: 17 Jul 2007 13:10 Modified: 13 Nov 2007 17:22
Reporter: Chris Ricks Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:6.0.0-alpha OS:Any
Assigned to: CPU Architecture:Any
Tags: count, GROUP BY, INDEX

[17 Jul 2007 13:10] Chris Ricks
Description:
At present, MySQL is only able to utilise indicies to accelerate the completion of MIN and MAX aggregates over a single table, as referenced in http://dev.mysql.com/doc/refman/5.1/en/loose-index-scan.html. 

At present, I am doing some reporting and DB performance evaluations for a client involving a table containing 106807643 rows, with the following schema:

CREATE TABLE `acct_full` (
  `agent_id` int(2) unsigned NOT NULL,
  `mac_src` char(17) NOT NULL,
  `mac_dst` char(17) NOT NULL,
  `vlan` int(2) unsigned NOT NULL,
  `ip_src` char(15) NOT NULL,
  `ip_dst` char(15) NOT NULL,
  `src_port` int(2) unsigned NOT NULL,
  `dst_port` int(2) unsigned NOT NULL,
  `ip_proto` char(6) NOT NULL,
  `tos` int(4) unsigned NOT NULL,
  `packets` int(10) unsigned NOT NULL,
  `bytes` bigint(20) unsigned NOT NULL,
  `flows` int(10) unsigned NOT NULL,
  `stamp_inserted` datetime NOT NULL,
  `stamp_updated` datetime DEFAULT NULL,
  KEY `stamp_inserted` (`stamp_inserted`,`ip_dst`,`dst_port`,`ip_src`,`src_port`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

For my purposes, I'm looking to roll up some data so that the key referenced above can be a primary key. To do this, I need a list of duplicate values for the fields referenced in the key as produced by the following query:

create table repeated_values select `stamp_inserted`,`ip_dst`,`dst_port`,`ip_src` from acct_full group by `stamp_inserted`,`ip_dst`,`dst_port`,`ip_src`,`src_port` having count(*) > 1;

Although the index contains all required information to evaluate the result set quickly, MySQL does not currently do so, resorting to a full index scan:

mysql> explain select `stamp_inserted`,`ip_dst`,`dst_port`,`ip_src`,`src_port` from acct_full group by `stamp_inserted`,`ip_dst`,`dst_port`,`ip_src`,`src_port` having count(*) > 1;
+----+-------------+-----------+-------+---------------+----------------+---------+------+-----------+-------------+
| id | select_type | table     | type  | possible_keys | key            | key_len | ref  | rows      | Extra       |
+----+-------------+-----------+-------+---------------+----------------+---------+------+-----------+-------------+
|  1 | SIMPLE      | acct_full | index | NULL          | stamp_inserted | 46      | NULL | 106807643 | Using index |
+----+-------------+-----------+-------+---------------+----------------+---------+------+-----------+-------------+
1 row in set (0.01 sec)

The same result is observed by either including a smaller left-subset of the key for grouping (which should result in less of the index needing to be visited) or extending the key to include more fields.

How to repeat:
Create table as above, populate with some data and execute included queries.

Suggested fix:
Optimise index usage when evaluating COUNT aggregates.
[13 Nov 2007 17:22] Valeriy Kravchuk
Thank you for a reasonable feature request.