MySQL Bugs: #14940: MySQL choose wrong index

Bug #14940	MySQL choose wrong index
Submitted:	15 Nov 2005 10:12	Modified:	20 Jun 2010 22:33
Reporter:	Victoria Reznichenko	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.0.17	OS:	Any (any)
Assigned to:		CPU Architecture:	Any

Description:
MySQL picks up wrong index incidents$c_id and it takes 16 secs to complete the query. If you add use index (incidents$queue_id) to the query it takes only 2 secs to complete the query.

How to repeat:
1. restore table from the dump.
2. execute query with USE INDEX clause and without.
3. check EXPLAIN output.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/32652

The fix is not complete yet

I've ported the fix up into 5.0.
However 5.0 is already GA, so according to Igor we're going to make changes in 5.1. 
In 5.1:
 * it will work
 * EXPLAIN output will have additional column next to "rows" which will show an estimate on what percentage of records will be filtered out when applying table condition. (i.e. rows column will always show #rows examined from the table, and (#rows * #filtered_out/100) will show number of records that will be joined with previous tables)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9211

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9212

I'm now working on making the following changes with the fix: 

Changes in EXPLAIN columns
~~~~~~~~~~~~~~~~~~~~~~~~~~
The "rows" column will be made to always contain #1 (ie. number of rows to
be scanned). Currently this is violated in some cases.

A new "fanout" column will be introduced. The column will be displayed when
specified by EXPLAIN options (see below on what the options are).

 Place:    right after the "rows" column
 Name:     "fanout"
 Domain:   float with 2 decimal digits, 00.00 <= fanout <= 100.00
 Meaning:  percentage of scanned records left after the table condition is
           applied.

That is, "rows" and "fanout" columns together are to be interpreted as
follows:
 * table access method will scan (and we assume return) #rows rows.
 * the returned rows will be filtered through the table condition(*) After
   the filter there will be (#rows * fanout/100) rows left.

Introduction of EXPLAIN options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have a growing number of columns/information that we may need to show in
EXPLAIN. On the other hand we need to keep EXPLAIN output compatible with 
previous versions and human-readable.

The solution to the above is to have a generic way to add optional parts into
EXPLAIN output. This will be achieved as follows:

We will introduce "explain options": a comma-separated list of option names. 
The order of options will not matter. Initially we'll have the following 
options:
  
  partitions - Show the "partitions" column. This will replace the EXPLAIN
               PARTITIONS syntax.

  fanout     - Show the "fanout" column. This is what is being introduced
               for BUG#14940.

  rewrites   - Produce a warning with query after the rewrites were made.
               This is what currently EXPLAIN EXTENDED does.

There will be two ways to specify explain options:

 1. Put them into per-session @@explain_opts system variable.
 2. Use "EXPLAIN WITH comma-separated-options-list SELECT" syntax. "WITH" is 
    added so there are fewer conflicts with "EXPLAIN tablename column" syntax
    and more freedom with option names.

Then

  "EXPLAIN SELECT ..." will use the @@explain_opts value.

  "EXPLAIN WITH opts SELECT ..." will use options specified with opts.

  "EXPLAIN EXTENDED SELECT ..." do the same as EXPLAIN but "rewrites" option
  will be assumed to be turned on.

The default value for @@explain_opts will be 'fanout'. Those who need EXPLAIN
output to be compatible with 5.0 will have to set @@explain_opts=''.

Next version of fix proposal, currently in discussion

Changes
~~~~~~~
Version 3: addressed input from Igor, PeterZ, PeterG, Timour:
 - EXPLAIN EXTENDED means "show all available information"
 - Renamed "fanout" to "out%", added explanation why percentage is preferred
   over "out_rows".
 - Added a note about feasible comma-separated form
     "EXPLAIN option1, option2 ..."

Version 2: addressed input from Igor, PeterZ, PeterG:
 - "rows" will show #rows scanned, not #rows returned.
 - the second column was renamed and now holds percentage instead of #rows.
 - EXPLAIN options are introduced.

Reasons for changing EXPLAIN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While working on BUG#14940 we've came to need for EXPLAIN to provide more
information: instead of a single value in the "rows" column we need two:

  1. number of rows that will be scanned by choosen access method
  2. number of rows that will be left after table condition is applied

Changes in EXPLAIN columns
~~~~~~~~~~~~~~~~~~~~~~~~~~
The "rows" column will be made to always contain #1 (ie. number of rows to
be scanned). Currently this is violated in some cases.

A new "out%" column will be introduced. The column will be displayed when
specified by EXPLAIN options (see below on what the options are).

 Place:    right after the "rows" column
 Name:     "out%"
 Domain:   float with 2 decimal digits, 00.00 <= out% <= 100.00
 Meaning:  percentage of scanned records left after the table condition is
           applied.

That is, "rows" and "out%" columns together are to be interpreted as
follows:
 * table access method will scan (and we assume return) #rows rows.
 * the returned rows will be filtered through the table condition (a
   condition that refers to current table and previous tables). After
   the filter there will be (#rows * 0.01*out%) rows left.

We will use percentage rather than "out_rows". This is because the intent is
to produce a selectivity for condition cond(t1, t2, ..., tK). For some row
of table tK, the value of condition may be true in one row combination, and
false in another, so the value of "out_rows" is not too meaningful.

Introduction of EXPLAIN options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have a growing number of columns/information that we may need to show in
EXPLAIN. On the other hand we need to keep EXPLAIN output compatible with
previous versions and human-readable.

The solution to the above is to have a generic way to add optional parts into
EXPLAIN output. This will be achieved as follows:

We will introduce "explain options": a comma-separated list of option names.
The order of options will not matter. Initially we'll have the following
options:

  partitions  - Show the "partitions" column. This will replace the EXPLAIN
                PARTITIONS syntax.

  out_fraction- Show the "out%" column. This is what is being introduced for
                BUG#14940.

  rewrites    - Produce a warning with query after the rewrites were made.
                This is what currently EXPLAIN EXTENDED does.

There will be two ways to specify explain options:

 1. Put them into per-session @@explain_options system variable.
 2. Use "EXPLAIN [WITH] comma-separated-options-list SELECT" syntax.
    One form, either with or without 'WITH' will be allowed. Whether WITH
    will be present remains to be determined.

Then

  "EXPLAIN SELECT ..." will use the @@explain_options value.

  "EXPLAIN {WITH} option[,...] SELECT ..." will use the specified options.

  "EXPLAIN EXTENDED SELECT ..." will display produce all possible information.

The default value for @@explain_options will be 'out_fraction'. Those who need
EXPLAIN output to be compatible with 5.0 will have to set @@explain_options=''

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9727

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9769

ChangeSet@1.2231, 2006-07-30 00:02:09+04:00, sergefp@mysql.com +1 -0
  BUG#14940: post-review fixes

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9795

ChangeSet@1.2233, 2006-07-30 18:25:57+04:00, sergefp@mysql.com +4 -0
  BUG#14940: Post-merge fixes

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9846

ChangeSet@1.2234, 2006-07-31 22:22:01+04:00, sergefp@mysql.com +1 -0
  BUG#14940: Post-merge fixes: update test results

- Make the range-et-al optimizer produce E(#table records after table 
                                           condition is applied),
- Make the join optimizer use this value,
- Add "filtered" column to EXPLAIN EXTENDED to show 
  fraction of records left after table condition is applied

Fixed in 5.1.12

Added explanation of the "filtered" column for EXPLAIN
EXTENDED to the EXPLAIN section in the 5.1 manual.

I confirm the bug is still present in latest 5.1, despite the fix being
there.

This was caused by last-minute changes in the patch. The fact that those
changes had such effect is an indication that we'll need to investigate more.

I confirm the bug is still present in latest 5.1, despite the fix being
there.

This was caused by last-minute changes in the patch. The fact that those
changes had such effect is an indication that we'll need to investigate more.

What happened with first bugfix
===============================
The original fix was made on assumption that the cause of the problem was
an incorrect value of E(#rows matching table condition) for `incidents` table.

1. The estimate (see above EXPLAINs)  = 53824
2. Actual #rows matching condition on index incidents$queue_id = 53958
3. #rows that match entire condition for `incidents` table = 42181

#2 was not a perfect estimate for #3 but it was reasonable.

Nevertheless, the problem of not using E(#rows that match the table
condition) existed (it was just not very dramatic in this scenario). The fix
addressed it and did this:

E(#matching records for `incidents` table) = 
  = #rows(`incidents`) * 
     PROD{over each index}( E(fraction of rows to be scanned by 
                              'range' access on that index) )

For `incidents` table the obtained value was:
E(#matching records for `incidents` table) = 
 =  241705 * (53824/241705) * 208750/241705 * 120101/241705 = 23098
 
Which is too optimistic, and even less correct than the initial value, but
it affected the choice of query plan so the better plan was actually chosen.

Then right before pushing we've changed PROD to MIN. That gives pessimistic and 
will avoid overly optimistic estimates for cases with several indexes
covering the same fields.

However, the s/PROD/MIN/ change had bad effect on this particular case: bad plan
got choosen again.

The outcome: we still think the fix is ok and should stay in, but for this
 particular bug we need another fix.

Attempt #2
==========

Look at above EXPLAINS and costs. The EXPLAIN for the slow plan has slightly
lower cost. The costs are: 

Slow plan:  {contacts, incidents, accounts, o__state}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
join->positions= {
  {records_read = 189344, read_time = 325.5859375, table = 0x9373eb8, key = 0x0}
  {records_read = 1, read_time = 189344, table = 0x9373bd8, key = 0x9362aa0}
  {records_read = 1, read_time = 1, table = 0x9373d48, key = 0x9362acc}
  {records_read = 1, read_time = 1, table = 0x9374028, key = 0x9362b24}
}

cost = 
#  count,  cost 
  1      * 325.5859 + // contacts,  189344 records out
  189344 * 1.0  +     // incidents  1 records out
  1      * 1.0  +     // accounts   1 records out  // count-is-small
  1      * 1.0        // o_state    1 records out  // count-is-small
= 189671.5859375

Fast plan, {incidents, o__state, accounts, contacts}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

join->positions= {
  {records_read = 53824, read_time = 64597.809999999998, table = 0x9373bd8, key = 0x0}
  {records_read = 1, read_time = 53824, table = 0x9374028, key = 0x9362b24}
  {records_read = 1, read_time = 53824, table = 0x9373d48, key = 0x9362acc}
  {records_read = 1, read_time = 53824, table = 0x9373eb8, key = 0x9362af8}
}

cost =
# count,  cost 
  1     * 64597.8099 +  // incidents
  53824 * 1.0 +         // o__state  // count-is-big
  53824 * 1.0 +         // accounts  // count-is-big
  53824 * 1.0           // contacts
= 226069.8098

Why is cost of fast plan estimated to be slower than the cost of slow plan? 

Two suspects:

Suspect #1
~~~~~~~~~~
Search the above plans for count-is-small and count-is-big. In Slow plan, we
assume that total cost (for entire join execution) of access to `accounts` 
and `o__state` tables will take 1 disk seek for each table. 

In Fast plan, we assume that the same table accesses will take 53824*1.0 disk
seeks (53824 is number of ref lookups to be performed, each lookup costs 1 
disk seek).

This is rather odd. The two query plans use the same ref access (here are
EXPLAIN output excepts:

SLOW:
----------+--------+------------------+-------------------------+-------
table     | type   | key              | ref                     | rows  
----------+--------+------------------+-------------------------+-------
contacts  | index  | contacts$c_id    | NULL                    | 189344 
incidents | ref    | incidents$c_id   | contacts.c_id           |      1 
accounts  | eq_ref | accounts$acct_id | incidents.assgn_acct_id |      1 SMAL
o__state  | eq_ref | orgs$org_id      | incidents.org_id        |      1 SMAL
----------+--------+------------------+-------------------------+-------

FAST: 
----------+--------+--------------------+-------------------------+------
table     | type   | key                | ref                     | rows 
----------+--------+--------------------+-------------------------+------
incidents | range  | incidents$queue_id | NULL                    | 53824 
o__state  | eq_ref | orgs$org_id        | incidents.org_id        |     1 BIG
accounts  | eq_ref | accounts$acct_id   | incidents.assgn_acct_id |     1 BIG
contacts  | eq_ref | contacts$c_id      | incidents.c_id          |     1
----------+--------+--------------------+-------------------------+------

) but the multiplier varies so much. 

<note>
After discussion with Igor we concluded the problem is in prev_record_reads()
function. We haven't yet found a proper way to fix it.
</note>

This difference is big enough. Really, 
1. If SLOW query plan calculations had the multipliers analogous to FAST
   query plan, its cost would be:

  1*325.5859 + 189344*1.0 + 189344*1.0 + 189344*1.0 = 568357.59
  
  i.e. it would have been more expensive than the FAST the plan.

2. Other way around: cost of FAST query plan would have been:

  1 * 64597.8099 + 1 * 1.0 + 1 * 1.0 + 53824 * 1.0 = 118423.80
 
  i.e. less expensive than the cost SLOW query plan.

Suspect #2
~~~~~~~~~~
In slow query plan, the cost of access table `contacts` is:

  {records_read = 189344, read_time = 325.5859375, table = 0x9373eb8, key = 0x0}

The EXPLAIN shows "index" access method to be used, and that is true.
However, the cost value is a cost of full table scan (which is lower then
cost of index scan, both in our model and in reality)

the cost of full index scan for `contacts` table would be (according to
calculations in get_index_only_read_time(ontacts$c_id, total_table_records)) = 3642.21

Then the total cost of slow query plan will be:
  
   1*3642.5859 + 189344*1.0 +  1*1.0 + 1*1.0= 192988.59

still less than the cost of the fast plan, so this won't make enough
difference.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/12645

ChangeSet@1.2279, 2006-09-27 22:56:50+04:00, sergefp@mysql.com +4 -0
  BUG#14940: Slow join order is chosen: 
  - Re-worked the prev_record_reads() function to return the lower bound of
    number of different table access scans that will be performed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/12835

ChangeSet@1.2279, 2006-09-29 15:58:47+04:00, sergefp@mysql.com +4 -0
  BUG#14940: Slow join order is chosen: [2nd commit with post-review fixes]
  - Re-worked the prev_record_reads() function to return the lower bound of
    number of different table access scans that will be performed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/12856

ChangeSet@1.2334, 2006-09-29 20:42:37+04:00, sergefp@mysql.com +2 -0
  BUG#14940: Update test results

Pushed in 5.1.12

Noted in 5.1.12 changelog.

EXPLAIN EXTENDED now shows a filtered column that is an estimated
percentage of the examined rows that will be joined with the previous
tables. This was added while dealing with a problem of MySQL choosing
the wrong index for some queries.

Also see bug #41740.

Pushed into 5.1.47 (revid:joro@sun.com-20100505145753-ivlt4hclbrjy8eye) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug. Re-closing.

Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (pib:16)

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Push resulted from incorporation of InnoDB tree. No changes pertinent to this bug.
Re-closing.

Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)

Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)