| Bug #28691 | GROUP BY optimization ignores functional dependency on unique/primary key | ||
|---|---|---|---|
| Submitted: | 25 May 2007 20:16 | Modified: | 23 Aug 2007 16:42 |
| Reporter: | Roland Bouman | Email Updates: | |
| Status: | Not a Bug | Impact on me: | |
| Category: | MySQL Server: Optimizer | Severity: | S3 (Non-critical) |
| Version: | 5.1.18 | OS: | Any |
| Assigned to: | Martin Hansson | CPU Architecture: | Any |
| Tags: | functional dependency, GROUP BY | ||
[30 May 2007 9:10]
Roland Bouman
Simplified test case: explain SELECT f.film_id , f.title , max(i.inventory_id) FROM film f LEFT JOIN inventory i ON f.film_id = i.film_id GROUP BY f.film_id , f.title
[30 May 2007 9:16]
Roland Bouman
At request of timour and joro, table definitions:
CREATE TABLE `film` (
`film_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`description` text,
`release_year` year(4) DEFAULT NULL,
`language_id` tinyint(3) unsigned NOT NULL,
`original_language_id` tinyint(3) unsigned DEFAULT NULL,
`rental_duration` tinyint(3) unsigned NOT NULL DEFAULT '3',
`rental_rate` decimal(4,2) NOT NULL DEFAULT '4.99',
`length` smallint(5) unsigned DEFAULT NULL,
`replacement_cost` decimal(5,2) NOT NULL DEFAULT '19.99',
`rating` enum('G','PG','PG-13','R','NC-17') DEFAULT 'G',
`special_features` set('Trailers','Commentaries','Deleted Scenes','Behind the Scenes') DEFAULT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`film_id`),
KEY `idx_fk_language_id` (`language_id`),
KEY `idx_fk_original_language_id` (`original_language_id`),
KEY `idx_title` (`title`),
CONSTRAINT `fk_film_language` FOREIGN KEY (`language_id`) REFERENCES `language` (`language_id`) ON UPDATE CASCADE,
CONSTRAINT `fk_film_language_original` FOREIGN KEY (`original_language_id`) REFERENCES `language` (`language_id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1001 DEFAULT CHARSET=utf8
CREATE TABLE `inventory` (
`inventory_id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`film_id` smallint(5) unsigned NOT NULL,
`store_id` tinyint(3) unsigned NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`inventory_id`),
KEY `idx_fk_film_id` (`film_id`),
KEY `idx_store_id_film_id` (`store_id`,`film_id`),
CONSTRAINT `fk_inventory_film` FOREIGN KEY (`film_id`) REFERENCES `film` (`film_id`) ON UPDATE CASCADE,
CONSTRAINT `fk_inventory_store` FOREIGN KEY (`store_id`) REFERENCES `store` (`store_id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=4582 DEFAULT CHARSET=utf8
Note that the sakila database may be downloaded here: http://downloads.mysql.com/docs/sakila-db.tar.gz
[7 Aug 2007 10:33]
Martin Hansson
This was not a trivial problem. In fact, it consists of several problematic issues: 1) The fix for Bug#16458 introduced a bug. I give two examples. This bug displays itself for InnoDB in the following way: DROP TABLE IF EXISTS t1; CREATE TABLE t1( a INT NOT NULL PRIMARY KEY, b INT NOT NULL, KEY( b ) ) engine=innodb; INSERT INTO t1 VALUES ( 1, 1 ); INSERT INTO t1 VALUES ( 2, 3 ); INSERT INTO t1 VALUES ( 3, 2 ); SELECT t1.a, b FROM t1 GROUP BY t1.a, t1.b; -- +---+---+ -- | a | b | -- +---+---+ -- | 1 | 1 | -- | 3 | 2 | -- | 2 | 3 | -- +---+---+ -- The result is wrong and EXPLAIN tells us why (the wrong key is used): EXPLAIN SELECT t1.a, b FROM t1 GROUP BY t1.a, t1.b\g -- *************************** 1. row *************************** -- id: 1 -- select_type: SIMPLE -- table: t1 -- type: index -- possible_keys: NULL -- key: b -- key_len: 4 -- ref: NULL -- rows: 3 -- Extra: Using index But the bug is visible in MyISAM, too: CREATE TABLE t2( a INT, b INT NOT NULL, c INT NOT NULL, d INT, PRIMARY KEY (a,b), UNIQUE KEY (c,b) ); INSERT INTO t2 VALUES (1,1,1,50), (1,2,3,40), (2,1,3,4); EXPLAIN SELECT DISTINCT a,b,d FROM t2 GROUP BY c,b,d\G -- *************************** 1. row *************************** -- id: 1 -- select_type: SIMPLE -- table: t2 -- type: ALL -- possible_keys: NULL -- key: NULL -- key_len: NULL -- ref: NULL -- rows: 3 -- Extra: Here no key is used and the GROUP BY removed. Obviously, this plan is unrealistic, and accordingly the result is wrong: SELECT * FROM t2; -- +---+---+---+------+ -- | a | b | c | d | -- +---+---+---+------+ -- | 1 | 1 | 1 | 50 | -- | 1 | 2 | 3 | 40 | -- | 2 | 1 | 3 | 4 | -- +---+---+---+------+ The central idea of the fix for bug#16458 is valid but needs to used differently. It is based on the following assumption: Assumption 16458: If ORDER BY or GROUP BY matches all keyparts of a unique key, we can skip ORDER/GROUP BY. I will add to this: "...as long as that key is used for lookup" 2) The decision above should be placed within test_if_skip_sort_order instead of before it, since in there we have the code to actually use an index. 3) We also have to fix test_if_order_by_key since it doesn't take Assumption 16458 above into account either. 4) MyISAM and InnoDB have opposite attitudes towards index scan. ha_innodb::keys_to_use_for_scanning() is happy to return the set of all indexes, even such that don't exist. ha_myisam::keys_to_use_for_scanning() on the other hand invariably returns the empty set of keys. The problem here is that test_if_skip_sort_order listens too much to what MyISAM says in this matter (i.e. don't use any keys for scanning). It should use whatever index it finds appropriate. Note that this function alone is not responsible for hiding the problem in #1, but probably MyISAM gives a lower cost for table scan than index scan. 5) Primary keys in temporary tables, heap tables and tables from federated handlers are not supposed to get discovered by test_if_skip_sort_order, if they are the handler will core dump when instructed to use them. We will remove from consideration all keys from these tables.
[7 Aug 2007 16:52]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/32209 ChangeSet@1.2485, 2007-08-07 18:51:43+02:00, mhansson@linux-st28.site +15 -0 Bug #28691: GROUP BY optimization ignores functional dependency on unique/primary key In the general case, MySQL did not exploit ordered keys in order to skip GROUP BY and ORDER by, and when it did results were sometimes sorted wrong. The problem consisted of: 1) GROUP BY was removed without choosing the ordered index 2) test_if_skip_sort_order and test_if_order_by_key failed to take into account that a full match on one unique ordered key is enough to skip GROUP BY once (1) is fixed. (some limitations apply, see comment for sql_select.cc) 3) Some storage engines (e.g. MyISAM) may not recommend index scan over table scan, and test_if_skip_sort_order did not consider 'unrecommended' keys. Fixed by 1) moving the skipping of GROUP BY to the place where an index is chosen, so that not one is done without the other, 2) Adding to test_if_skip_sort_order and test_if_order_by_key the ability to recognize good indexes, and 3) by not paying attention to what the storage engine thinks is good use of indexes in this case.
[22 Aug 2007 10:54]
Timour Katchaounov
Increased priority to P2 as the bug results in wrong result (incorrect result order).
[23 Aug 2007 15:15]
Martin Hansson
The problem with wrong result order is a different bug: Bug#30596. Therefore, I change the priority back to low.
[23 Aug 2007 15:24]
Martin Hansson
According to Georgi Kodinov, he had a discussion with Igor Babaev, concluding that GROUP BY should not be computed using indexes in multi-table selects. This is obvious unless all tables have unique constraints on the joined field(s) *and* on the grouped fields. But for more than two tables, this gets pretty complex.
[23 Aug 2007 16:42]
Timour Katchaounov
According to the previous comment from Martin, this is not a bug. In the course of analyzing this bug Martin discovered another related problem, which will be fixed as BUG#30596.

Description: A GROUP BY query that includes the primary key of a table as well as other columns of the table does not always use the primary key to solve the GROUP BY, leading to decrease in performance. MySQL is committed to returning rows in order of the GROUP BY clause, this means that a query like: SELECT T.pkcols, T.other_cols FROM T GROUP BY T.pkcols, T.other_cols (For a more real world example of this type of query see how to repeat) can always ignore T.other_cols to make the groups. However this is not the case. It is true that the user could be required to simply include only T.pkcols in the GROUP BY clause but unfortunately, SQL 92 would not allow T.other_cols to be SELECT-ed if the GROUP BY clause does not include T.other_cols. It would be How to repeat: This case works as expected: mysql> EXPLAIN -> SELECT f.film_id -> , f.title -> , sum(p.amount) sum_amount -> FROM film f -> LEFT JOIN inventory i -> ON f.film_id = i.film_id -> LEFT JOIN rental r -> ON i.inventory_id = r.inventory_id -> LEFT JOIN payment p -> ON r.rental_id = p.rental_id -> GROUP BY f.film_id -> HAVING sum_amount > 300 -> \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: f type: index possible_keys: NULL key: PRIMARY key_len: 2 ref: NULL rows: 953 Extra: *************************** 2. row *************************** id: 1 select_type: SIMPLE table: i type: ref possible_keys: idx_fk_film_id key: idx_fk_film_id key_len: 2 ref: sakila.f.film_id rows: 2 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: r type: ref possible_keys: idx_fk_inventory_id key: idx_fk_inventory_id key_len: 3 ref: sakila.i.inventory_id rows: 1 Extra: Using index *************************** 4. row *************************** id: 1 select_type: SIMPLE table: p type: ref possible_keys: fk_payment_rental key: fk_payment_rental key_len: 5 ref: sakila.r.rental_id rows: 1 Extra: 4 rows in set (0.00 sec) The following case does not work as expected, plan should have been the same as the previous one: mysql> EXPLAIN -> SELECT f.film_id -> , f.title -> , sum(p.amount) sum_amount -> FROM film f -> LEFT JOIN inventory i -> ON f.film_id = i.film_id -> LEFT JOIN rental r -> ON i.inventory_id = r.inventory_id -> LEFT JOIN payment p -> ON r.rental_id = p.rental_id -> GROUP BY f.film_id -> , f.title -> HAVING sum_amount > 300 -> \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: f type: index possible_keys: NULL key: idx_title key_len: 767 ref: NULL rows: 953 Extra: Using index; Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: i type: ref possible_keys: idx_fk_film_id key: idx_fk_film_id key_len: 2 ref: sakila.f.film_id rows: 2 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: r type: ref possible_keys: idx_fk_inventory_id key: idx_fk_inventory_id key_len: 3 ref: sakila.i.inventory_id rows: 1 Extra: Using index *************************** 4. row *************************** id: 1 select_type: SIMPLE table: p type: ref possible_keys: fk_payment_rental key: fk_payment_rental key_len: 5 ref: sakila.r.rental_id rows: 1 Extra: 4 rows in set (0.01 sec) In reality, the second query takes about 2 to 3 times as long to complete. Suggested fix: Please check if the GROUP BY includes a primary or unique key. If so, use that to perform the GROUP BY algorithm. In some cases this might not work: GROUP BY T.other_cols, T.pk_cols could be a problem as MySQL is committed to returning the rows in order of the GROUP BY clause. It would be a nice feature to allow the user to override this behaviour (by means of SQL_MODE) to allow the proposed optimization to be applied here too.