MySQL Bugs: #77504: Optimizer choosing bad access path with uneven distribution of keys

Bug #77504	Optimizer choosing bad access path with uneven distribution of keys
Submitted:	26 Jun 2015 12:55	Modified:	9 Jul 2015 12:07
Reporter:	Aaron Craven	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
As suggested in the forum (http://forums.mysql.com/read.php?115,632170,632290#msg-632290), I am submitting this test case as a bug. This is slightly different than the test case presented there -- I have more records in my data, I have made the unevenly distributed key a bit less extreme, and I have evenly distributed some values that previously were unevenly distributed because they do not seem to contribute to the issue.:

With a multi-table join that involves a combination of relatively large tables and a relatively small table (as well as an uneven distribution of values in an indexed column on the larger tables), MySQL appears to be choosing an access path that doesn't make sense. 

This behavior is present in 5.6, but does not appear to be present in 5.0 (though I do not have access to create this particular scenario in our 5.0 server at the moment, because it is a production machine -- the test case is intended as a simplification of the production issue)

How to repeat:
My scenario consists of three tables, an "entries" table, an "entries group" table, and a "category" table (basically just a value lookup table) as follows: 

CREATE TABLE `entry` ( 
`id` int(11) NOT NULL AUTO_INCREMENT, 
`group_id` int(11) NOT NULL, 
`entry_val` smallint(6) NOT NULL, 
`uneven_val` smallint(6) NOT NULL, 
PRIMARY KEY (`id`), 
KEY `idx_entry_on_group_id` (`group_id`), 
KEY `idx_entry_on_uneven_val` (`uneven_val`) 
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=''This is a test table.'' 

CREATE TABLE `entry_group` ( 
`id` int(11) NOT NULL AUTO_INCREMENT, 
`category_id` smallint(6) NOT NULL, 
`grp_val1` smallint(6) NOT NULL, 
`grp_val2` smallint(6) NOT NULL, 
PRIMARY KEY (`id`), 
KEY `idx_entry_group_on_category_id` (`category_id`) 
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='This is a test table.' 

CREATE TABLE `category` ( 
`id` smallint(6) NOT NULL AUTO_INCREMENT, 
`name` varchar(10) NOT NULL, 
PRIMARY KEY (`id`), 
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='This is a test table.' 

entry: 
250K rows were inserted into entries, randomly choosing when to create a new associated group record (see below), creating an average group size of 10 entries. For each entry, entry_val is chosen randomly from the entire range of smallint values. For uneven_val, about 80% of the records were given a single value (212), and the rest are assigned a random smallint value. 

entry_group: 
This table is populated during creation of entry records (see above), which results in approximately 25K records (25089 in my test). grp_val1 and grp_val2 are randomly assigned a smallint value (and are not used in this test). category_id is also randomly assigned one of the 7 ids in the category table. 

So everything is evenly distributed except the value of entry.uneven_val. 

If you would like the source of the Java program I used to populate entry and entry_group, I can provide that.

I then experimented with this query:

select e.id, g.id, c.id
from   entry e inner join
       entry_group g on
          g.id = e.group_id inner join
       category c on
          c.id = g.category_id
where  e.uneven_val between 300 and 10000;

Since there are no conditions limiting values in any table except entry, I expected the query to start with the entry table, utilizing the idx_entry_on_uneven_val index. Instead, it starts with category and works its way across, reading (presumably) every record (or at least an index entry for every record) in all three tables unless the values in the range predicate represent a very small range (the break even point seems to be "e.uneven_val between 300 and 5900" for my data... this represents approximately 8600 or 3% of the 250000 records in the entry table).

table, index, explain, and profile results

Attachment: explain_and_profile.txt (text/plain), 214.49 KiB.

optimizer trace

Attachment: optimizer_trace.txt (text/plain), 25.53 KiB.

I have studied your bug report. I have compared it with bugs and features that were implemented in versions 5.0 to 5.6.

This is an imperfection and flaw that is truly a consequence of our current optimizer design. However, there are plans to further improve the optimizer and to introduce more modules that rely on cost-based optimizer then on current heuristics. That work will solve problems like this one.

There is one workaround, that is not elegant, which is to add USE or FORCE keywords into the query.

Fully verified as the feature request.

Understood and thanks for your reply. I will look forward to seeing this resolved in later versions.