MySQL Bugs: #91386: Index for group-by is not used with primary key for SELECT COUNT(DISTINCT a)

Bug #91386	Index for group-by is not used with primary key for SELECT COUNT(DISTINCT a)
Submitted:	24 Jun 2018 0:30	Modified:	29 Jun 2018 12:23
Reporter:	monty solomon	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	5.7.18	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
InnoDB is unable to use the first column of the primary key for index for group-by. It needs a secondary index starting with the same column.

SELECT COUNT(DISTINCT a) does not perform well when a is the first column in the primary key and not the first column of any secondary index. Adding a secondary index on a causes the optimizer to choose index for group-by and the query performs much better.

mysql> select count(distinct a) from tt\G
*************************** 1. row ***************************
count(distinct a): 5
1 row in set (2.83 sec)

mysql> explain select count(distinct a) from tt\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: index
possible_keys: PRIMARY,d
          key: d
      key_len: 4
          ref: NULL
         rows: 10462320
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

mysql> alter table tt add index (a);
Query OK, 0 rows affected (24.76 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql>  select count(distinct a) from tt\G
*************************** 1. row ***************************
count(distinct a): 5
1 row in set (0.00 sec)

mysql> explain select count(distinct a) from tt\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: range
possible_keys: PRIMARY,d,a
          key: a
      key_len: 4
          ref: NULL
         rows: 2525
     filtered: 100.00
        Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)

How to repeat:
Here are the steps so you can copy/paste them

USE test;
CREATE TABLE tt (a int, b int, c int) ENGINE=InnoDB; 

insert into tt values (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5);

insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;
insert into tt select * from tt;

alter table tt add d int unsigned auto_increment, add index(d), add primary key(a,b,c,d);
select count(distinct a) from tt\G
explain select count(distinct a) from tt\G

alter table tt add index (a);
select count(distinct a) from tt\G
explain select count(distinct a) from tt\G

Suggested fix:
Use index for group-by with PRIMARY KE for SELECT COUNT(DISTINCT a)

Here are the steps with their associated output (show warnings is enabled)

 
--------------
CREATE TABLE tt (a int, b int, c int) ENGINE=InnoDB
--------------

Query OK, 0 rows affected (0.03 sec)

--------------
insert into tt values (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5)
--------------

Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 10 rows affected (0.00 sec)
Records: 10  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 20 rows affected (0.00 sec)
Records: 20  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 40 rows affected (0.00 sec)
Records: 40  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 80 rows affected (0.01 sec)
Records: 80  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 160 rows affected (0.00 sec)
Records: 160  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 320 rows affected (0.00 sec)
Records: 320  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 640 rows affected (0.01 sec)
Records: 640  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 1280 rows affected (0.01 sec)
Records: 1280  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 2560 rows affected (0.02 sec)
Records: 2560  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 5120 rows affected (0.03 sec)
Records: 5120  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 10240 rows affected (0.13 sec)
Records: 10240  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 20480 rows affected (0.12 sec)
Records: 20480  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 40960 rows affected (0.25 sec)
Records: 40960  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 81920 rows affected (0.50 sec)
Records: 81920  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 163840 rows affected (0.97 sec)
Records: 163840  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 327680 rows affected (1.95 sec)
Records: 327680  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 655360 rows affected (4.05 sec)
Records: 655360  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 1310720 rows affected (9.55 sec)
Records: 1310720  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 2621440 rows affected (18.22 sec)
Records: 2621440  Duplicates: 0  Warnings: 0

--------------
insert into tt select * from tt
--------------

Query OK, 5242880 rows affected (38.36 sec)
Records: 5242880  Duplicates: 0  Warnings: 0

--------------
alter table tt add d int unsigned auto_increment, add index(d), add primary key(a,b,c,d)
--------------

Query OK, 0 rows affected (46.96 sec)
Records: 0  Duplicates: 0  Warnings: 0

--------------
select count(distinct a) from tt
--------------

*************************** 1. row ***************************
count(distinct a): 5
1 row in set (2.60 sec)

--------------
explain select count(distinct a) from tt
--------------

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: index
possible_keys: PRIMARY,d
          key: d
      key_len: 4
          ref: NULL
         rows: 10462320
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select count(distinct `test`.`tt`.`a`) AS `count(distinct a)` from `test`.`tt`
--------------
alter table tt add index (a)
--------------

Query OK, 0 rows affected (23.81 sec)
Records: 0  Duplicates: 0  Warnings: 0

--------------
select count(distinct a) from tt
--------------

*************************** 1. row ***************************
count(distinct a): 5
1 row in set (0.00 sec)

--------------
explain select count(distinct a) from tt
--------------

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: range
possible_keys: PRIMARY,d,a
          key: a
      key_len: 4
          ref: NULL
         rows: 2525
     filtered: 100.00
        Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select count(distinct `test`.`tt`.`a`) AS `count(distinct a)` from `test`.`tt`

Index for group-by is used with the primary key for SELECT MAX ...

mysql> explain SELECT MAX(b) FROM tt GROUP by a\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: range
possible_keys: PRIMARY,d,a
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 2378
     filtered: 100.00
        Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)

It doesn't use index for group-by with the PRIMARY KEY for AVG(DISTINCT ...) and SUM(DISTINCT ...)

It seems to be related to the implementation of WL#3220 where it opts for direct index scan instead of loose index scan for the primary key even though the loose index scan is faster.

6.2 Use Primary Key indexes as source of distinct rows

  The current optimization can be extended to detect DISTINCT over a
  primary key and rewrite AGG_FUNC(DISTINCT ...) into AGG_FUNC(...).
  In this case the optimization will not use loose index scan. It will
  be sufficient to use direct index scan.

Hi Monty,

First, one simple question.

Is the index correctly used when DISTINCT is not used ???

Also, can you send optimiser traces for both examples ???

Without using DISTINCT it chooses index d.

mysql> explain select count(a) from tt\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: index
possible_keys: NULL
          key: d
      key_len: 4
          ref: NULL
         rows: 10462320
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.04 sec)

After removing the index on d, and without using DISTINCT, it chooses the index on a.

mysql> alter table tt modify d int unsigned NOT NULL, drop index d;
Query OK, 10485760 rows affected (1 min 42.33 sec)
Records: 10485760  Duplicates: 0  Warnings: 0

mysql> explain select count(a) from tt\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: index
possible_keys: NULL
          key: a
      key_len: 4
          ref: NULL
         rows: 9714110
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

After removing the index on a it uses the PRIMARY KEY since that is the only remaining index.

mysql> alter table tt drop index a;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> explain select count(a) from tt\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 16
          ref: NULL
         rows: 9714110
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.01 sec)

mysql> explain select count(distinct a) from tt\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tt
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 16
          ref: NULL
         rows: 9714110
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

mysql> select count(a) from tt\G
*************************** 1. row ***************************
count(a): 10485760
1 row in set (3.53 sec)

mysql>select count(distinct a) from tt\G
*************************** 1. row ***************************
count(distinct a): 5
1 row in set (2.83 sec)

mysql> show create table tt\G
*************************** 1. row ***************************
       Table: tt
Create Table: CREATE TABLE `tt` (
  `a` int(11) NOT NULL,
  `b` int(11) NOT NULL,
  `c` int(11) NOT NULL,
  `d` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`a`,`b`,`c`,`d`),
  KEY `d` (`d`),
  KEY `a` (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=10485761 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
1 row in set (0.00 sec)

mysql> select count(distinct a) from tt\G
*************************** 1. row ***************************
count(distinct a): 5
1 row in set (0.00 sec)

mysql> SELECT trace FROM information_schema.optimizer_trace      INTO OUTFILE '/usr/share/hubspot/tmp/trace_with_index_a.txt' LINES TERMINATED BY '';
Query OK, 1 row affected (0.01 sec)

mysql> alter table tt drop index a;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select count(distinct a) from tt\G
*************************** 1. row ***************************
count(distinct a): 5
1 row in set (3.19 sec)

mysql> SELECT trace FROM information_schema.optimizer_trace      INTO OUTFILE '/usr/share/hubspot/tmp/trace_without_index_a.txt' LINES TERMINATED BY '';
Query OK, 1 row affected (0.00 sec)

optimizer trace with index a present

Attachment: trace_with_index_a.txt (text/plain), 5.71 KiB.

optimizer trace without index a

Attachment: trace_without_index_a.txt (text/plain), 4.12 KiB.

Hi,

I have  read your original , actually first, comment and I fail to see the problem. 

Index on `d` is much better to be used for that query, since PRIMARY index is much wider and what is much worse, it is totally clustered, which makes a cost of using it much higher then using a secondary index, even the one on `d`.

Best index to be used is of course index on 'a', so it is correct choice. Do not forget that DISTINCT clause within the aggregating functions causes huge complex trees to be used in order to weed out distinct values.

Last, but not least, have you been able to get MUCH better performance by using FORCE INDEX on the PRIMARY ???

Once the index on d is removed it uses the index on a to get the better performance but that index should not be needed since it is a left-prefix duplicate of the primary.

I created that example and data to demonstrate the problem.

I observed that the query performance for SELECT COUNT(DISTINCT cid) on the following table was poor because it was using the primary key but not using the loose scan index for group-by. The only fix was to add a secondary index on cid which is counterintuitive because it is a redundant left prefix of the primary key. Once the secondary index was added the optimizer used the loose scan index for group-by on the secondary index and dramatically improved query performance. The optimizer should use the loose scan index for group-by with the primary key and not require that a duplicate secondary key be added.

CREATE TABLE property_vs (
  v varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  cid int(10) unsigned NOT NULL,
  pid int(10) unsigned NOT NULL,
  ts bigint(20) unsigned NOT NULL,
  PRIMARY KEY (cid,pid),
  KEY pid (pid),
  KEY pvc (pid,v,cid),
  KEY pts (pid,ts)
) ENGINE=InnoDB;

Hi,

Once again ..... index on `a` will be faster then PRIMARY, for the reasons that I already explained.

Also, do let us know that, when all three indices are present, does FORCE INDEX on the index of your choice yields better result or not ??????

I am reporting a bug that requires that a duplicate index be added because MySQL doesn't use index for group-by with the primary key when the query uses COUNT(DISTINCT).

According to WL#3220: Loose index scan for aggregate functions, when it was implemented it was assumed that it would be sufficient to use direct index scan of the primary key as the source of the distinct rows.

As demonstrated in this bug report that is not the case and a duplicate index needs to be added to get the benefit of the loose index scan.

The duplicate index adds overhead on inserts/updates and consumes space both on disk and in the buffer pool.

This can be demonstrated by using a sample InnoDB table which has a single column and a primary key on that column.

Create Table: CREATE TABLE `uu` (
  `a` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=559874 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Fill the table with 500K rows.

Execute SELECT COUNT(DISTINCT a) FROM uu\G

mysql> SELECT COUNT(DISTINCT a) FROM uu\G
*************************** 1. row ***************************
COUNT(DISTINCT a): 559873
1 row in set (0.38 sec)

Add an index on column a

ALTER TABLE uu ADD INDEX(a);

mysql> SHOW CREATE TABLE uu\G
*************************** 1. row ***************************
       Table: uu
Create Table: CREATE TABLE `uu` (
  `a` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`a`),
  KEY `a` (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=559874 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
1 row in set (0.00 sec)

Execute SELECT COUNT(DISTINCT a) FROM uu and observe that it takes less than half the time to complete.

mysql> SELECT COUNT(DISTINCT a) FROM uu\G
*************************** 1. row ***************************
COUNT(DISTINCT a): 559873
1 row in set (0.13 sec)

Hi,

After reading the WorkLog entry, I must admit that this represents a proper and fully justified feature request.

Verified as the feature request.