MySQL Bugs: #25068: MyISAM-Optimizer prefers full table scan than index for IN with >15 elements

Bug #25068	MyISAM-Optimizer prefers full table scan than index for IN with >15 elements
Submitted:	14 Dec 2006 11:56	Modified:	16 Oct 2008 18:08
Reporter:	Anders Henke	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	4.0.27, 5.0.19, 5.0.36-BK	OS:	Linux (Linux)
Assigned to:		CPU Architecture:	Any

Description:
I've noticed that SELECT statements with IN(..)-lists larger than 15 elements, the optimizer simply prefers a table scan rather than still using the index. This behaviour isn't documented in the optimizer the docs chapter 7, so I'm not sure wether this is a real bug or merely some undocumented issue.

At least, it's still surprising me a lot and many developers out there might not be aware that large IN-lists won't be using their indexes.

How to repeat:
I'll uploaded some sample data as bug_$bugid_dump.gz to the "secret" ftp incoming directory as soon as the bug has its own id. The dump originates on a 4.0.27-based server, but I've verified the issue on a server using 5.0.19.

mysql> explain SELECT count(*) as count FROM ibf_topics WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,189,8) AND last_post > '1166009794';
+------------+------+---------------+------+---------+------+------+-------------+
| table      | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------------+------+---------------+------+---------+------+------+-------------+
| ibf_topics | ALL  | forum_id      | NULL |    NULL | NULL | 9483 | Using where |
+------------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> explain SELECT count(*) as count FROM ibf_topics USE INDEX(forum_id) WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,189,8) AND last_post > '1166009794';
+------------+------+---------------+------+---------+------+------+-------------+
| table      | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------------+------+---------------+------+---------+------+------+-------------+
| ibf_topics | ALL  | forum_id      | NULL |    NULL | NULL | 9483 | Using where |
+------------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> explain SELECT count(*) as count FROM ibf_topics FORCE INDEX(forum_id) WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,189,8) AND last_post > '1166009794';
+------------+-------+---------------+----------+---------+------+------+-------------+
| table      | type  | possible_keys | key      | key_len | ref  | rows | Extra       |
+------------+-------+---------------+----------+---------+------+------+-------------+
| ibf_topics | range | forum_id      | forum_id |       2 | NULL | 2115 | Using where |
+------------+-------+---------------+----------+---------+------+------+-------------+
1 row in set (0.00 sec)

In this case, the optimizer scans four times the number of rows when not FORCING it to use the index.

Suggested fix:
Optimize optimizer :)

As a workaround, "FORCE INDEX" really does work; but this issue is still
surprising me a lot and many developers out there might not be aware that large (auto-generated) IN-lists won't be using their indexes.

As some rule of thumb, I'd say that the optimizer should still prefer an range index scan rather than a full table scan if the amount of rows to be scanned is no larger than twice the amount of elements in the "IN"-statement.

Some timing measurements on a lightly loaded server:

mysql> SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics USE INDEX(forum_id) WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,189,8) AND last_post > 1166009794;
+-------+
| count |
+-------+
|     7 |
+-------+
1 row in set (1.48 sec)

db122223253 mysql> SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics FORCE INDEX(forum_id) WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,189,8) AND last_post > 1166009794;
+-------+
| count |
+-------+
|     7 |
+-------+
1 row in set (0.02 sec)

mysql> SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics USE INDEX(forum_id) WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,189,8) AND last_post > 1166009794;
+-------+
| count |
+-------+
|     7 |
+-------+
1 row in set (0.87 sec)

I guess that the third query (identical to the first one) gives more suitable timings, as the first query spent much time of loading data into filesystem buffers and mysql caches, but in the end it's still a very dramatic increase in query time (more than 40 times as long).

Thank you for a problem report, and sorry for a delay with its processing. Please, try to repeat with a newer version of server, 5.0.27. In case of the same problem, please, send CREATE TABLE statement for that ibf_topics table, and some test data to demonstrate the behaviour described.

On Dec 14th, I've already sent a dump to the "secret" ftp upload (ftp.mysql.com:/pub/mysql/secret/), so that you can check this issue
on as many software releases as you'd like to do :-)

Verified just as described (on test data uploaded) with latest 5.0.36-BK on Linux. No big difference in execution time, with a single session, though :). 

Anyway:

mysql> analyze table ibf_topics;
+-----------------+---------+----------+-----------------------------+
| Table           | Op      | Msg_type | Msg_text                    |
+-----------------+---------+----------+-----------------------------+
| test.ibf_topics | analyze | status   | Table is already up to date |
+-----------------+---------+----------+-----------------------------+
1 row in set (0.00 sec)

mysql> SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics WHERE approved=1 A
ND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,10
0,193,8) AND last_post > 1166009794;
+-------+
| count |
+-------+
|     7 |
+-------+
1 row in set (0.02 sec)

mysql> EXPLAIN SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics WHERE appr
oved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,17
5,188,100,193,8) AND last_post > 1166009794;
+----+-------------+------------+------+--------------------+------+---------+--
----+------+-------------+
| id | select_type | table      | type | possible_keys      | key  | key_len | r
ef  | rows | Extra       |
+----+-------------+------------+------+--------------------+------+---------+--
----+------+-------------+
|  1 | SIMPLE      | ibf_topics | ALL  | forum_id,last_post | NULL | NULL    | N
ULL | 9483 | Using where |
+----+-------------+------------+------+--------------------+------+---------+--
----+------+-------------+
1 row in set (0.00 sec)

mysql> SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics WHERE approved=1 A
ND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,10
0,193,8);
+-------+
| count |
+-------+
|  2081 |
+-------+
1 row in set (0.03 sec)

mysql> EXPLAIN SELECT SQL_NO_CACHE count(*) as count FROM ibf_topics WHERE appr
oved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,17
5,188,100,193,8);
+----+-------------+------------+------+--------------------+------+---------+--
----+------+-------------+
| id | select_type | table      | type | possible_keys      | key  | key_len | r
ef  | rows | Extra       |
+----+-------------+------------+------+--------------------+------+---------+--
----+------+-------------+
|  1 | SIMPLE      | ibf_topics | ALL  | forum_id,last_post | NULL | NULL    | N
ULL | 9483 | Using where |
+----+-------------+------------+------+--------------------+------+---------+--
----+------+-------------+
1 row in set (0.00 sec)

Rows estimation above are surely incorrect, and may lead to bad optimizer decisions.

1. In the case of the reported queries the optimizer just compares the costs of two possible execution plans and chooses the plan with the least one.
The cost of the plan with the full table scan:

mysql> EXPLAIN SELECT count(*) as count FROM ibf_topics WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,8) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ibf_topics
         type: ALL
possible_keys: forum_id,last_post
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9483
        Extra: Using where
1 row in set (0.00 sec)

is: 

mysql> show status like '%cost';
+-----------------+-------------+
| Variable_name   | Value       |
+-----------------+-------------+
| Last_query_cost | 2120.934937 |
+-----------------+-------------+

The cost of the plan with range scan:

mysql> EXPLAIN SELECT count(*) as count FROM ibf_topics FORCE INDEX (forum_id) WHERE approved=1 AND state != 'link' AND forum_id IN(65,63,252,69,67,6,218,242,159,158,175,188,100,193,8) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ibf_topics
         type: range
possible_keys: forum_id
          key: forum_id
      key_len: 2
          ref: NULL
         rows: 2152
        Extra: Using where
1 row in set (0.00 sec)

is:

mysql> show status like '%cost';
+-----------------+-------------+
| Variable_name   | Value       |
+-----------------+-------------+
| Last_query_cost | 3027.809000 |
+-----------------+-------------+
1 row in set (0.01 sec)

2. The number of records in an EXPLAIN output is the number of processed records, not the number of records in the result set.
Our manual says:
"The rows column indicates the number of rows MySQL believes it must examine to execute the query". 

For a query over one table this statement must be understood literally.

3. In the next comment I'll explain how we get those two costs.

3.1 The cost of the plan with full table scan is calculated as follows:

- the number of disk I/O to read all rows of the table: 214 (= number of blocks in the table)
- the total number of condition evaluations: 9483 (number of rows in the table)
the cost of the plan: 214 + 9483/5 = 2120.

Additional notes:
The cost of 1 condition evaluation equal to 1/5 of the cost of a disk I/O for this query should be considered as too high.
The cost of reading for all pages equal to 214 random I/O should be considered as too high as we know that sequential reading is much faster than random one.

3.2 The cost of the plan with range scan is calculated as this:

- the number of ranges : 15
- the number of records in the ranges (estimate): 2152
- the cost of index operations: 2152/5
- the cost of row fetching (15+2152)
- the cost of condition evaluation: 2152/5

the cost of the plan 2152/5 + (15+2152) + 2152/5 = 3027

Additional notes:
The cost of the row access should be considered as too high for this query as each range actually contains the same key value. If we take into account that index entries with the same key value are sorted by their ROWIDs then the average cost of a row fetch should be definitely less than the cost of one random disk I/O operation. The cost estimate misses the real cost even more because the last range contains more than 1000 rows:
mysql> SELECT count(*) as count FROM ibf_topics FORCE INDEX (forum_id) WHERE approved=1 AND state != 'link' AND forum_id = 8;
+-------+
| count |
+-------+
| 1105 |
+-------+
1 row in set (0.01 sec)

Thus to read the rows from this range we need at most 214 disk I/O operations. Nevertheless the cost od reading rows from this range is estimated as 1105.

Conclusions:
1. The choice of the plan for the reported query is done in full accordance with the primitive cost model adopted in 4.0/5.0 There is no bug in the optimizer that resulted in this decision.
2. This problem cannot be fixed in 4.0/5.0 as the required changes will affect plan selection for other queries.

Finally I can say that the cost model in 5.2 will be refined and there we can hope for a better automatic plan selection for the reported queries. Until that the only possible solution is the usage of hints.

$goods_count = $db->GetOne("Select count(distinct(goods_id)) from sd_goods_attr") ;

I use it .It cost 17.6M mem and CPU cost 100%. This is full table scan??