Description:
In simple two columns (with unique values) filtering, when available, MySQL chooses index on single column as it provides optimal plan. However, if only compound key is available, it seems to be unable to ignore the second index column when it is also mentioned in where clause.
And when using second index column is not possible (as not referenced in where), it uses the first column successfully and follows optimal, less costly plan.
How to repeat:
Below tested on 8.0.3, but same results are seen on 5.7.20 and 5.6.38.
mysql > show create table test5\G
*************************** 1. row ***************************
Table: test5
Create Table: CREATE TABLE `test5` (
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
`c` int(11) DEFAULT NULL,
`d` int(11) DEFAULT NULL,
KEY `a_d` (`a`,`d`),
KEY `a_b` (`a`,`b`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.11 sec)
$ for i in {1..1000};do mysql -S /tmp/mysql_sandbox12411.sock -umsandbox -pmsandbox test -e "insert into test5 values ($i,$i,$i,$i)"; done
mysql > EXPLAIN format=json select * FROM test5 WHERE a=1 AND b IN (1,2,3,4,5,6) \G
*************************** 1. row ***************************
EXPLAIN: {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "0.35"
},
"table": {
"table_name": "test5",
"access_type": "ref",
"possible_keys": [
"a_d",
"a_b"
],
"key": "a_d",
"used_key_parts": [
"a"
],
"key_length": "5",
"ref": [
"const"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 0,
"filtered": "50.00",
"cost_info": {
"read_cost": "0.25",
"eval_cost": "0.05",
"prefix_cost": "0.35",
"data_read_per_join": "12"
},
"used_columns": [
"a",
"b",
"c",
"d"
],
"attached_condition": "(`test`.`test5`.`b` in (1,2,3,4,5,6))"
}
}
}
1 row in set, 1 warning (0.00 sec)
mysql > alter table test5 drop key a_d;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql > EXPLAIN format=json select * FROM test5 WHERE a=1 AND b IN (1,2,3,4,5,6) \G
*************************** 1. row ***************************
EXPLAIN: {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "4.21"
},
"table": {
"table_name": "test5",
"access_type": "range",
"possible_keys": [
"a_b"
],
"key": "a_b",
"used_key_parts": [
"a",
"b"
],
"key_length": "10",
"rows_examined_per_scan": 6,
"rows_produced_per_join": 6,
"filtered": "100.00",
"index_condition": "((`test`.`test5`.`a` = 1) and (`test`.`test5`.`b` in (1,2,3,4,5,6)))",
"cost_info": {
"read_cost": "3.61",
"eval_cost": "0.60",
"prefix_cost": "4.21",
"data_read_per_join": "144"
},
"used_columns": [
"a",
"b",
"c",
"d"
]
}
}
}
1 row in set, 1 warning (0.00 sec)
master [localhost] {msandbox} (test) > alter table test5 add key(a);
Query OK, 0 rows affected (0.03 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (test) > EXPLAIN format=json select * FROM test5 WHERE a=1 AND b IN (1,2,3,4,5,6) \G
*************************** 1. row ***************************
EXPLAIN: {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "0.35"
},
"table": {
"table_name": "test5",
"access_type": "ref",
"possible_keys": [
"a_b",
"a"
],
"key": "a",
"used_key_parts": [
"a"
],
"key_length": "5",
"ref": [
"const"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 0,
"filtered": "50.00",
"cost_info": {
"read_cost": "0.25",
"eval_cost": "0.05",
"prefix_cost": "0.35",
"data_read_per_join": "12"
},
"used_columns": [
"a",
"b",
"c",
"d"
],
"attached_condition": "(`test`.`test5`.`b` in (1,2,3,4,5,6))"
}
}
}
1 row in set, 1 warning (0.00 sec)
When only (a,b) index is defined, query execution results in 'Handler_read_key' counter equal to number of IN values. And situation is the same even if there are 1K of IN values.
Suggested fix:
If in the above example, using key on 'a' alone provides better plan then when using 'a,b' pair - "access_type": "ref", "query_cost": "0.35" VS "access_type": "range", "query_cost": "4.21", then why optimizer cannot just ignore the 'b' key part from index (a,b) in this case?
When both (a) and (a,b) keys are defined, optimizer chooses (a) anyway.