Bug #67980 MySQL not utilising full index because of possible bad cardinality stats
Submitted: 27 Dec 2012 14:54 Modified: 19 Jan 2013 10:52
Reporter: Ovais Tariq Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.5.28 OS:Any
Assigned to: CPU Architecture:Any

[27 Dec 2012 14:54] Ovais Tariq
Description:
I am running MySQL version 5.5.28

I have the following table:

CREATE TABLE `t1` (
  `i1` int(11) NOT NULL AUTO_INCREMENT,
  `i2` int(11) DEFAULT NULL,
  `i3` int(11) DEFAULT NULL,
  `c1` varchar(1) DEFAULT NULL,
  KEY `i1_1` (`i1`),
  KEY `c1` (`c1`,`i2`,`i3`)
) ENGINE=InnoDB AUTO_INCREMENT=3062 DEFAULT CHARSET=latin1;

And the following query:
select * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10) and i3 in (9,11,15,20);

MySQL does not seem to use the full index for the above query. That seems to come from the fact that MySQL is generating wrong no. of rows estimation. MySQL only uses all the columns of the index when I force it.

Here is the explain plan of the query:
mysql [localhost] {msandbox} (test) > explain select * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10) and i3 in (9,11,15,20);
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
|  1 | SIMPLE      | t1    | ref  | c1            | c1   | 4       | const |  733 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
1 row in set (0.00 sec)

MySQL is clearly only using the column `c1` from the index. 

Now if I use FORCE INDEX:
mysql [localhost] {msandbox} (test) > explain select * from t1 force index (c1) where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10) and i3 in (9,11,15,20);
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t1    | range | c1            | c1   | 14      | NULL |  733 | Using where |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

You can see key_len has changed and its now using the entire index. Note the difference in the column 'type', without force_index type=ref, and with force_index type=range.

Now if I remove the filering on i2 and i3 from the query and only let the condition on the column `c1` remain, following is how the explain output looks:
mysql [localhost] {msandbox} (test) > explain select * from t1 where c1='d';
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
|  1 | SIMPLE      | t1    | ref  | c1            | c1   | 4       | const | 1647 | Using where |
+----+-------------+-------+------+---------------+------+---------+-------+------+-------------+
1 row in set (0.00 sec)

This is the plan that MySQL takes for the original query as well, but note the difference in no. of rows estimation. There is something wrong with MySQL estimation that is causing the no. of rows estimation to be off. This causes a real pain on real workload, when sometimes the query is fast because its utilising the full index and sometimes its slow. Increasing the value of innodb_stats_sample_pages has no affect.

How to repeat:
Load the dump file 'test_t1.dump' attached to the bug report and execute the queries as shown above.
[27 Dec 2012 14:54] Ovais Tariq
Dump file to test the queries with

Attachment: test_t1.dump (application/octet-stream, text), 33.31 KiB.

[27 Dec 2012 15:32] Valeriy Kravchuk
Looks like a kind of good old bug #12113 (that nobody actually is going to fix it seems, even though fix for post 5.1 was once planned).
[28 Dec 2012 11:04] Ovais Tariq
Valerii,

Yes it looks similar, but there is no order by and limit involved. Anyhow it looks very wrong for the optimizer to not use the index completely in this kind of query, by default.
[29 Dec 2012 12:39] Valeriy Kravchuk
It's easy to verify that the same problem happens with recent 5.5.29 and with MyISAM table instead of InnoDB, even after ANALYZE. So, this is more fundamental than just incorrect statistics estimation at InnoDB level.

For current simplified test case the problem may not be that visible or serious:

C:\Program Files\MySQL\MySQL Server 5.5\bin>mysqlslap -uroot -proot -P3312 --con
currency=1 --create-schema=test --no-drop --number-of-queries=1000 --iterations=
10 --query="select * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10) and i
3 in (9,11,15,20)"
Benchmark
        Average number of seconds to run all queries: 3.807 seconds
        Minimum number of seconds to run all queries: 2.841 seconds
        Maximum number of seconds to run all queries: 4.091 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1000

C:\Program Files\MySQL\MySQL Server 5.5\bin>mysqlslap -uroot -proot -P3312 --con
currency=1 --create-schema=test --no-drop --number-of-queries=1000 --iterations=
10 --query="select * from t1 force index(c1) where c1='d' and i2 in (1,2,3,4,5,6
,7,8,9,10) and i3 in (9,11,15,20)"
Benchmark
        Average number of seconds to run all queries: 3.615 seconds
        Minimum number of seconds to run all queries: 2.810 seconds
        Maximum number of seconds to run all queries: 4.075 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1000

but in general case it may become serious.
[29 Dec 2012 17:04] MySQL Verification Team
Looks fixed in mysql-trunk, have to find out which bugfix it was.

mysql> explain select * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10) and i3 in (9,11,15,20);
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                 |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
|  1 | SIMPLE      | t1    | range | c1            | c1   | 14      | NULL |  120 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
1 row in set (0.00 sec)

mysql> select version();
+-----------------+
| version()       |
+-----------------+
| 5.7.1-m11-debug |
+-----------------+
1 row in set (0.00 sec)
[8 Jan 2013 8:47] Jørgen Løland
Thank you for the bug report. Verified in 5.1, 5.5 and 5.6 (using eq_range_index_dive_limit > 40).
[19 Jan 2013 10:00] Ovais Tariq
Hi Jørgen,

I used optimizer_tracing to understand why Optimizer is choosing the plan it is choosing. I set the variable eq_range_index_dive_limit to both 0 and 1 to test what happens when index dives are used to compute cost or when statistics are used to compute cost. In both cases the optimizer is computing the cost of ref access (using only 1 column of the index) to be 6x to 8x lower as compared to range access (when all the columns of the index are used). Furthermore, the cost of range access is calculated to be 1.5x to 2x more as compared to table_scan.

-- use index statistics when estimating the number of rows in the ranges of the query
            "rows_estimation": [
              {
                "table": "`t1`",
                "range_analysis": {
                  "table_scan": {
                    "rows": 2048,
                    "cost": 418.7
                  },
…
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      { 
                        "index": "c1",
                        "ranges": [
                          "d <= c1 <= d AND 1 <= i2 <= 1",
                          "d <= c1 <= d AND 2 <= i2 <= 2",
                          "d <= c1 <= d AND 3 <= i2 <= 3",
                          "d <= c1 <= d AND 4 <= i2 <= 4",
                          "d <= c1 <= d AND 5 <= i2 <= 5",
                          "d <= c1 <= d AND 6 <= i2 <= 6",
                          "d <= c1 <= d AND 7 <= i2 <= 7",
                          "d <= c1 <= d AND 8 <= i2 <= 8",
                          "d <= c1 <= d AND 9 <= i2 <= 9",
                          "d <= c1 <= d AND 10 <= i2 <= 10"
                        ],
                        "index_dives_for_eq_ranges": false,
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "rows": 510,
                        "cost": 622.01,
                        "chosen": false,
                        "cause": "cost"
                      }
…
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "c1",
                      "rows": 510,
                      "cost": 123,
                      "chosen": true
                    },
                    {
                      "access_type": "scan",
                      "rows": 510,
                      "cost": 416.6,
                      "chosen": false
                    }
                  ]

-- do index dives
            "rows_estimation": [
              {
                "table": "`t1`",
                "range_analysis": {
                  "table_scan": {
                    "rows": 2048,
                    "cost": 418.7
                  },
…
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      { 
                        "index": "c1",
                        "ranges": [
                          "d <= c1 <= d AND 1 <= i2 <= 1",
                          "d <= c1 <= d AND 2 <= i2 <= 2",
                          "d <= c1 <= d AND 3 <= i2 <= 3",
                          "d <= c1 <= d AND 4 <= i2 <= 4",
                          "d <= c1 <= d AND 5 <= i2 <= 5",
                          "d <= c1 <= d AND 6 <= i2 <= 6",
                          "d <= c1 <= d AND 7 <= i2 <= 7",
                          "d <= c1 <= d AND 8 <= i2 <= 8",
                          "d <= c1 <= d AND 9 <= i2 <= 9",
                          "d <= c1 <= d AND 10 <= i2 <= 10"
                        ],
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "rows": 703,
                        "cost": 853.61,
                        "chosen": false,
                        "cause": "cost"
                      }
                    ]
…
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "c1",
                      "rows": 703,
                      "cost": 161.6,
                      "chosen": true
                    },
                    {
                      "access_type": "scan",
                      "rows": 703,
                      "cost": 416.6,
                      "chosen": false
                    }
                  ]

So it appears to me that there is something wrong with the cost calculation. The cost of doing range scan is not 6x to 8x times more, certainly that is not shown in the query execution times.

+----------+------------+---------------------------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                                               |
+----------+------------+---------------------------------------------------------------------------------------------------------------------+
|       26 | 0.00205925 | select sql_no_cache * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10)                                         |
|       27 | 0.00193075 | select sql_no_cache * from t1 force index(c1) where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10)                         |
|       28 | 0.00203975 | select sql_no_cache * from t1 ignore index(c1) where c1='d' and i2 in (1,2,3,4,5,6,7,8,9,10)                        |
+----------+------------+---------------------------------------------------------------------------------------------------------------------+

Btw thanks for the informative blog post: http://jorgenloland.blogspot.com/2012/04/on-queries-with-many-values-in-in.html
[19 Jan 2013 10:52] Ovais Tariq
Though I must say that the variable eq_range_index_dive_limit does have affect on the query execution plan:

mysql [localhost] {msandbox} (test) > set session eq_range_index_dive_limit=10;
Query OK, 0 rows affected (0.00 sec)

mysql [localhost] {msandbox} (test) > explain select sql_no_cache * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7);
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                 |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
|  1 | SIMPLE      | t1    | range | c1            | c1   | 9       | NULL |  304 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
1 row in set (0.00 sec)

mysql [localhost] {msandbox} (test) > set session eq_range_index_dive_limit=1;
Query OK, 0 rows affected (0.00 sec)

mysql [localhost] {msandbox} (test) > explain select sql_no_cache * from t1 where c1='d' and i2 in (1,2,3,4,5,6,7);
+----+-------------+-------+------+---------------+------+---------+-------+------+-----------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra                 |
+----+-------------+-------+------+---------------+------+---------+-------+------+-----------------------+
|  1 | SIMPLE      | t1    | ref  | c1            | c1   | 4       | const |  357 | Using index condition |
+----+-------------+-------+------+---------------+------+---------+-------+------+-----------------------+
1 row in set (0.00 sec)

As you can see that when index dives are used the full index is used, however when pre-calculated index_statistics are used then a single column is used from the index.