MySQL Bugs: #15198: Dropping then adding an index changes group by optimization behaviour

Bug #15198	Dropping then adding an index changes group by optimization behaviour
Submitted:	23 Nov 2005 19:58	Modified:	24 Mar 2006 7:52
Reporter:	Jaime Pinheiro	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.0.16-nt	OS:	Windows (Windows 2000)
Assigned to:	Georgi Kodinov	CPU Architecture:	Any

Description:
According to MySQL manual (section 7.2.13 How MySQL Optimizes GROUP BY) the following query should be handled optimally when there is an index on column "u":

select u from test_groupby group by u;

We have a very large table and we need to retrieve all distinct values for "u" and the query is not "Using index for group-by". But if we drop the index and then add it again, it works as stated in the manual and the query runs in the time we need.

The questions are
- How can that change the optimization strategy?
- Is that a bug or a feature?  
- If that's feature, how can we make sure MySQL will use index for group-by in such a situation?

Below is a sample table with only 4 rows to describe the odd behaviour. Notice "Using index for group-by" in the last explain output:

How to repeat:
mysql> select version();
+-----------+
| version() |
+-----------+
| 5.0.16-nt |
+-----------+
1 row in set (0.00 sec)

mysql> CREATE TABLE `test_groupby` (
    ->   `id` int(11) NOT NULL auto_increment,
    ->   `d` datetime NOT NULL default '0000-00-00 00:00:00',
    ->   `d2` date NOT NULL default '0000-00-00',
    ->   `h` time NOT NULL default '00:00:00',
    ->   `c` int(11) NOT NULL default '0',
    ->   `c2` varchar(150) default NULL,
    ->   `u` varchar(255) default NULL,
    ->   `u2` varchar(15) default NULL,
    ->   `u3` varchar(5) default NULL,
    ->   `d3` varchar(255) default NULL,
    ->   `d4` varchar(5) default NULL,
    ->   `t` float default NULL,
    ->   `p` text,
    ->   `b` float default NULL,
    ->   `p2` int(11) default NULL,
    ->   `c3` int(11) default NULL,
    ->   PRIMARY KEY  (`id`),
    ->   KEY `d2` (`d2`),
    ->   KEY `u` (`u`),
    ->   KEY `c` (`c`),
    ->   KEY `h` (`h`),
    ->   KEY `c3` (`c3`)
    -> ) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=0 ;
Query OK, 0 rows affected (0.16 sec)

mysql> INSERT INTO `test_groupby` VALUES (333937, '2003-10-21 11:09:35', '2004-1
0-01', '08:15:40', 8, '68987', 'TTTTT', '10.0.0.1', '11467', 'XXXXX', NULL, NULL
, 'sample text', NULL, 0, 9216);
Query OK, 1 row affected (0.05 sec)

mysql> INSERT INTO `test_groupby` VALUES (333938, '2003-10-21 11:09:36', '2004-1
0-01', '08:15:47', 4, '0', 'TTTTT', '10.0.0.1', '11467', 'XXXXX', NULL, NULL, 's
ample text', NULL, 0, 9216);
Query OK, 1 row affected (0.06 sec)

mysql> INSERT INTO `test_groupby` VALUES (333939, '2003-10-21 11:09:36', '2004-1
0-01', '08:15:54', 4, '1', 'TTTTT', 'XXXXX', NULL, 'XXXXX', NULL, NULL, 'sample
text', NULL, NULL, 9216);
Query OK, 1 row affected (0.05 sec)

mysql> INSERT INTO `test_groupby` VALUES (333940, '2005-11-21 11:09:37', '2004-1
0-01', '08:21:31', 4, '0', 'TTTTT', '10.0.0.1', '30786', 'XXXXX', NULL, NULL, 's
ample text', NULL, 0, 9219);
Query OK, 1 row affected (0.03 sec)

mysql> explain select u from test_groupby group by u;
+----+-------------+--------------+-------+---------------+------+---------+----
--+------+-------------+
| id | select_type | table        | type  | possible_keys | key  | key_len | ref
  | rows | Extra       |
+----+-------------+--------------+-------+---------------+------+---------+----
--+------+-------------+
|  1 | SIMPLE      | test_groupby | index | NULL          | u    | 258     | NUL
L |    4 | Using index |
+----+-------------+--------------+-------+---------------+------+---------+----
--+------+-------------+
1 row in set (0.00 sec)

mysql> alter table test_groupby drop index u;
Query OK, 4 rows affected (0.34 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> alter table test_groupby add index u ( u );
Query OK, 4 rows affected (0.33 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> explain select u from test_groupby group by u;
+----+-------------+--------------+-------+---------------+------+---------+----
--+------+--------------------------+
| id | select_type | table        | type  | possible_keys | key  | key_len | ref
  | rows | Extra                    |
+----+-------------+--------------+-------+---------------+------+---------+----
--+------+--------------------------+
|  1 | SIMPLE      | test_groupby | range | NULL          | u    | 258     | NUL
L |    3 | Using index for group-by |
+----+-------------+--------------+-------+---------------+------+---------+----
--+------+--------------------------+
1 row in set (0.00 sec)

Hi!

This may be due to the fact that MySQL updates index statistics in the ALTER.

But clearly the GROUP BY should always use the index, regardless of statistics. This is probably an optimizer bug.

Regards,

Heikki

Thanks for your bug report.  I was able to reproduce this under FC4 using a fresh 5.0 pull

While playing with the stuff in this [1] thread in the optimizer forum, I've noticed the same problem in a slightly different context, but I guess it belongs into this bug report.

In a subquery, the optimizer seems to never choose "Using index for group-by" but only "Using index", which slows down the optimizer a lot, while the query itself is as fast as expected (ie. time for the query and for the "explain" are the same). 

How to reproduce:
mysql> create table foo (a int, index a (a));
Query OK, 0 rows affected (0.01 sec)

mysql> insert into foo values (5),(10),(15),(15),(15),(15),(15),(15),(15),(15),(15),(15),(15),(15),(15),(15),(20),(20),(25);
Query OK, 19 rows affected (0.00 sec)
Records: 19  Duplicates: 0  Warnings: 0

mysql> explain select distinct a from foo;
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
|  1 | SIMPLE      | foo   | range | NULL          | a    | 5       | NULL |   10 | Using index for group-by |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

mysql> explain select * from (select distinct a from foo) bar;
+----+-------------+------------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table      | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+------------+-------+---------------+------+---------+------+------+-------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL | NULL    | NULL |    5 |             |
|  2 | DERIVED     | foo        | index | NULL          | a    | 5       | NULL |   19 | Using index |
+----+-------------+------------+-------+---------------+------+---------+------+------+-------------+
2 rows in set (0.00 sec)

If the table becomes large, the second explain will take quite some time (over 14 seconds here for a 1GB test table with some additional fields). This subquery in particular is a quite interesting thing, as you can use it to "simulate" Oracle's index skip scan, if it works fast (see forum thread for some details on that).

While writing this comment, I noticed one thing being even worse:
mysql> explain select distinct a from foo limit 3;
+----+-------------+-------+-------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+-------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | foo   | index | NULL          | a    | 5       | NULL |   10 | Using index; Using temporary |
+----+-------------+-------+-------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)

I guess this could also use "Using index for group-by".

[1] http://forums.mysql.com/read.php?115,74686,77681#msg-77681

The problem looks like it is related to InnoDB statistics. 
I have performed several tests and have (somewhat) extended the original sql script.

First of all Heikki is right in saying that the "Using index for group-by" method is applicable and it's indeed tried in both cases, but it is ruled out because of difference in the statistics values coming from InnoDB.

I've compared the statistics that are passed to the cost_group_min_max() function (sql/opt_range.cc:7755), namely index_info->rec_per_key[group_key_parts - 1].
For the first select (the one without the dropped/recreated index) it is 1 whereas for the second select it is 2. This makes huge difference in the cost based on the fact that the record count (table->file->records) is 4 both times.
Note that if I restart the server and just reissue the selects I get the same number (2) for index_info->rec_per_key[group_key_parts - 1] in both cases.

I have then added some more rows to the table by doing :

delimiter |
create procedure a(start int, nrecs int)
begin
  while (nrecs > 0) do
    INSERT INTO t1 VALUES (start, '2003-10-21 11:09:35', '2004-10-01',
    '08:15:40', 8, '68987', 'TTTTT', '10.0.0.1', '11467', 'XXXXX', NULL, NULL,
    'sample text', NULL, 0, 9216);
    set nrecs = nrecs - 1;
    set start = start + 1;
  end while;
end|
delimiter ;
call a(333937, 10000);

instead of the four inserts that are done in the original report to get a better readout.

Now I get index_info->rec_per_key[group_key_parts - 1] = 1 and table->file->records = 9540 in the wrong case and index_info->rec_per_key[group_key_parts - 1] = 5118 and table->file->records = 10237 in the right case.
I also get two cases compile identically ("Using index for group-by") when I restart the server.

Then I've changed the storage engine for the table from InnoDB to MyISAM and the two selects compiled identically.

Instead of droping and recreating the index I've now tried 
ANALYZE TABLE test_groupby

This also fixes the problem. 
Please see the section 13.5.2.1. of the MySQL Reference manual (http://dev.mysql.com/doc/refman/5.0/en/analyze-table.html) where this command is documented.