Bug #91386 | Index for group-by is not used with primary key for SELECT COUNT(DISTINCT a) | ||
---|---|---|---|
Submitted: | 24 Jun 2018 0:30 | Modified: | 29 Jun 2018 12:23 |
Reporter: | monty solomon | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S4 (Feature request) |
Version: | 5.7.18 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[24 Jun 2018 0:30]
monty solomon
[24 Jun 2018 0:33]
monty solomon
Here are the steps with their associated output (show warnings is enabled) -------------- CREATE TABLE tt (a int, b int, c int) ENGINE=InnoDB -------------- Query OK, 0 rows affected (0.03 sec) -------------- insert into tt values (1,1,1), (2,2,2), (3,3,3), (4,4,4), (5,5,5) -------------- Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 10 rows affected (0.00 sec) Records: 10 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 20 rows affected (0.00 sec) Records: 20 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 40 rows affected (0.00 sec) Records: 40 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 80 rows affected (0.01 sec) Records: 80 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 160 rows affected (0.00 sec) Records: 160 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 320 rows affected (0.00 sec) Records: 320 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 640 rows affected (0.01 sec) Records: 640 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 1280 rows affected (0.01 sec) Records: 1280 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 2560 rows affected (0.02 sec) Records: 2560 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 5120 rows affected (0.03 sec) Records: 5120 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 10240 rows affected (0.13 sec) Records: 10240 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 20480 rows affected (0.12 sec) Records: 20480 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 40960 rows affected (0.25 sec) Records: 40960 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 81920 rows affected (0.50 sec) Records: 81920 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 163840 rows affected (0.97 sec) Records: 163840 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 327680 rows affected (1.95 sec) Records: 327680 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 655360 rows affected (4.05 sec) Records: 655360 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 1310720 rows affected (9.55 sec) Records: 1310720 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 2621440 rows affected (18.22 sec) Records: 2621440 Duplicates: 0 Warnings: 0 -------------- insert into tt select * from tt -------------- Query OK, 5242880 rows affected (38.36 sec) Records: 5242880 Duplicates: 0 Warnings: 0 -------------- alter table tt add d int unsigned auto_increment, add index(d), add primary key(a,b,c,d) -------------- Query OK, 0 rows affected (46.96 sec) Records: 0 Duplicates: 0 Warnings: 0 -------------- select count(distinct a) from tt -------------- *************************** 1. row *************************** count(distinct a): 5 1 row in set (2.60 sec) -------------- explain select count(distinct a) from tt -------------- *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: index possible_keys: PRIMARY,d key: d key_len: 4 ref: NULL rows: 10462320 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select count(distinct `test`.`tt`.`a`) AS `count(distinct a)` from `test`.`tt` -------------- alter table tt add index (a) -------------- Query OK, 0 rows affected (23.81 sec) Records: 0 Duplicates: 0 Warnings: 0 -------------- select count(distinct a) from tt -------------- *************************** 1. row *************************** count(distinct a): 5 1 row in set (0.00 sec) -------------- explain select count(distinct a) from tt -------------- *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: range possible_keys: PRIMARY,d,a key: a key_len: 4 ref: NULL rows: 2525 filtered: 100.00 Extra: Using index for group-by 1 row in set, 1 warning (0.00 sec) Note (Code 1003): /* select#1 */ select count(distinct `test`.`tt`.`a`) AS `count(distinct a)` from `test`.`tt`
[24 Jun 2018 0:39]
monty solomon
Index for group-by is used with the primary key for SELECT MAX ... mysql> explain SELECT MAX(b) FROM tt GROUP by a\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: range possible_keys: PRIMARY,d,a key: PRIMARY key_len: 4 ref: NULL rows: 2378 filtered: 100.00 Extra: Using index for group-by 1 row in set, 1 warning (0.00 sec)
[24 Jun 2018 0:46]
monty solomon
It doesn't use index for group-by with the PRIMARY KEY for AVG(DISTINCT ...) and SUM(DISTINCT ...)
[24 Jun 2018 1:43]
monty solomon
It seems to be related to the implementation of WL#3220 where it opts for direct index scan instead of loose index scan for the primary key even though the loose index scan is faster. 6.2 Use Primary Key indexes as source of distinct rows The current optimization can be extended to detect DISTINCT over a primary key and rewrite AGG_FUNC(DISTINCT ...) into AGG_FUNC(...). In this case the optimization will not use loose index scan. It will be sufficient to use direct index scan.
[25 Jun 2018 14:43]
MySQL Verification Team
Hi Monty, First, one simple question. Is the index correctly used when DISTINCT is not used ??? Also, can you send optimiser traces for both examples ???
[26 Jun 2018 3:46]
monty solomon
Without using DISTINCT it chooses index d. mysql> explain select count(a) from tt\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: index possible_keys: NULL key: d key_len: 4 ref: NULL rows: 10462320 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.04 sec) After removing the index on d, and without using DISTINCT, it chooses the index on a. mysql> alter table tt modify d int unsigned NOT NULL, drop index d; Query OK, 10485760 rows affected (1 min 42.33 sec) Records: 10485760 Duplicates: 0 Warnings: 0 mysql> explain select count(a) from tt\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: index possible_keys: NULL key: a key_len: 4 ref: NULL rows: 9714110 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.00 sec) After removing the index on a it uses the PRIMARY KEY since that is the only remaining index. mysql> alter table tt drop index a; Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> explain select count(a) from tt\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: index possible_keys: NULL key: PRIMARY key_len: 16 ref: NULL rows: 9714110 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.01 sec) mysql> explain select count(distinct a) from tt\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: tt partitions: NULL type: index possible_keys: PRIMARY key: PRIMARY key_len: 16 ref: NULL rows: 9714110 filtered: 100.00 Extra: Using index 1 row in set, 1 warning (0.00 sec) mysql> select count(a) from tt\G *************************** 1. row *************************** count(a): 10485760 1 row in set (3.53 sec) mysql>select count(distinct a) from tt\G *************************** 1. row *************************** count(distinct a): 5 1 row in set (2.83 sec)
[26 Jun 2018 3:56]
monty solomon
mysql> show create table tt\G *************************** 1. row *************************** Table: tt Create Table: CREATE TABLE `tt` ( `a` int(11) NOT NULL, `b` int(11) NOT NULL, `c` int(11) NOT NULL, `d` int(10) unsigned NOT NULL AUTO_INCREMENT, PRIMARY KEY (`a`,`b`,`c`,`d`), KEY `d` (`d`), KEY `a` (`a`) ) ENGINE=InnoDB AUTO_INCREMENT=10485761 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci 1 row in set (0.00 sec) mysql> select count(distinct a) from tt\G *************************** 1. row *************************** count(distinct a): 5 1 row in set (0.00 sec) mysql> SELECT trace FROM information_schema.optimizer_trace INTO OUTFILE '/usr/share/hubspot/tmp/trace_with_index_a.txt' LINES TERMINATED BY ''; Query OK, 1 row affected (0.01 sec) mysql> alter table tt drop index a; Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> select count(distinct a) from tt\G *************************** 1. row *************************** count(distinct a): 5 1 row in set (3.19 sec) mysql> SELECT trace FROM information_schema.optimizer_trace INTO OUTFILE '/usr/share/hubspot/tmp/trace_without_index_a.txt' LINES TERMINATED BY ''; Query OK, 1 row affected (0.00 sec)
[26 Jun 2018 3:57]
monty solomon
optimizer trace with index a present
Attachment: trace_with_index_a.txt (text/plain), 5.71 KiB.
[26 Jun 2018 3:58]
monty solomon
optimizer trace without index a
Attachment: trace_without_index_a.txt (text/plain), 4.12 KiB.
[26 Jun 2018 16:44]
MySQL Verification Team
Hi, I have read your original , actually first, comment and I fail to see the problem. Index on `d` is much better to be used for that query, since PRIMARY index is much wider and what is much worse, it is totally clustered, which makes a cost of using it much higher then using a secondary index, even the one on `d`. Best index to be used is of course index on 'a', so it is correct choice. Do not forget that DISTINCT clause within the aggregating functions causes huge complex trees to be used in order to weed out distinct values. Last, but not least, have you been able to get MUCH better performance by using FORCE INDEX on the PRIMARY ???
[26 Jun 2018 23:25]
monty solomon
Once the index on d is removed it uses the index on a to get the better performance but that index should not be needed since it is a left-prefix duplicate of the primary. I created that example and data to demonstrate the problem.
[26 Jun 2018 23:27]
monty solomon
I observed that the query performance for SELECT COUNT(DISTINCT cid) on the following table was poor because it was using the primary key but not using the loose scan index for group-by. The only fix was to add a secondary index on cid which is counterintuitive because it is a redundant left prefix of the primary key. Once the secondary index was added the optimizer used the loose scan index for group-by on the secondary index and dramatically improved query performance. The optimizer should use the loose scan index for group-by with the primary key and not require that a duplicate secondary key be added. CREATE TABLE property_vs ( v varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL, cid int(10) unsigned NOT NULL, pid int(10) unsigned NOT NULL, ts bigint(20) unsigned NOT NULL, PRIMARY KEY (cid,pid), KEY pid (pid), KEY pvc (pid,v,cid), KEY pts (pid,ts) ) ENGINE=InnoDB;
[27 Jun 2018 13:05]
MySQL Verification Team
Hi, Once again ..... index on `a` will be faster then PRIMARY, for the reasons that I already explained. Also, do let us know that, when all three indices are present, does FORCE INDEX on the index of your choice yields better result or not ??????
[29 Jun 2018 3:10]
monty solomon
I am reporting a bug that requires that a duplicate index be added because MySQL doesn't use index for group-by with the primary key when the query uses COUNT(DISTINCT). According to WL#3220: Loose index scan for aggregate functions, when it was implemented it was assumed that it would be sufficient to use direct index scan of the primary key as the source of the distinct rows. As demonstrated in this bug report that is not the case and a duplicate index needs to be added to get the benefit of the loose index scan. The duplicate index adds overhead on inserts/updates and consumes space both on disk and in the buffer pool. This can be demonstrated by using a sample InnoDB table which has a single column and a primary key on that column. Create Table: CREATE TABLE `uu` ( `a` int(10) unsigned NOT NULL AUTO_INCREMENT, PRIMARY KEY (`a`) ) ENGINE=InnoDB AUTO_INCREMENT=559874 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci Fill the table with 500K rows. Execute SELECT COUNT(DISTINCT a) FROM uu\G mysql> SELECT COUNT(DISTINCT a) FROM uu\G *************************** 1. row *************************** COUNT(DISTINCT a): 559873 1 row in set (0.38 sec) Add an index on column a ALTER TABLE uu ADD INDEX(a); mysql> SHOW CREATE TABLE uu\G *************************** 1. row *************************** Table: uu Create Table: CREATE TABLE `uu` ( `a` int(10) unsigned NOT NULL AUTO_INCREMENT, PRIMARY KEY (`a`), KEY `a` (`a`) ) ENGINE=InnoDB AUTO_INCREMENT=559874 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci 1 row in set (0.00 sec) Execute SELECT COUNT(DISTINCT a) FROM uu and observe that it takes less than half the time to complete. mysql> SELECT COUNT(DISTINCT a) FROM uu\G *************************** 1. row *************************** COUNT(DISTINCT a): 559873 1 row in set (0.13 sec)
[29 Jun 2018 12:23]
MySQL Verification Team
Hi, After reading the WorkLog entry, I must admit that this represents a proper and fully justified feature request. Verified as the feature request.