MySQL Bugs: #38049: incorrect rows estimations with references from preceding table

Bug #38049	incorrect rows estimations with references from preceding table
Submitted:	11 Jul 2008 14:50	Modified:	16 Nov 2010 3:52
Reporter:	Ruslan Zakirov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0.60, 5.0.68-bzr, 5.1.28-bzr	OS:	Any
Assigned to:	Guilhem Bichot	CPU Architecture:	Any
Tags:	INDEX, multiple-column index, preceding table, ref, rows

Description:
For the following class of queries "WHERE ... t.col1 = 'x' and t.col2 = preceding_table.some_col ..." mysql make wrong estimation of rows to be fetched from table 't' for multiple-column indexes.

For single-column index (col1) on t estimation is pretty correct, for example it's X. For multiple-column index (col1, col2) on t estimation is wrong, for example it's Y. The most important thing is that Y is greater than X when it's known that Y must be equal or lesser than X.

How to repeat:
In the attachment you'll find perl scripts to create required data, run a test and a simple benchmark.

Suggested fix:
Rows estimation for ref access with (const, table.col) must be always lesser than estimation for ref access with (const) for the same index. This is minimum requirement.

scripts to reproduce the problem

Attachment: ref_scan_with_preceding_table.tar.gz (application/gzip, text), 1.11 KiB.

Many thanks for reporting a bug.

I did your test but I can't see what exactly you meant.

Where do you think is a bug?

Hi

Oops, found that create_data.pl creates wrong index. Anyway results are pretty close, I'll attach new create_data.pl.

Sorry, may be my description was cryptic. Here is partial output from test.pl:

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX(groups_d)
        WHERE g.domain = 'queue' AND g.type = a.type
$VAR1 = {
          'Extra' => 'Using where',
          'key_len' => '35',
          'ref' => 'const',
          'table' => 'g',
          'rows' => '11',
          'key' => 'groups_d',
          'select_type' => 'SIMPLE',
          'possible_keys' => 'groups_d',
          'id' => '1',
          'type' => 'ref'
        };
EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX(groups_dt)
        WHERE g.domain = 'queue' AND g.type = a.type
$VAR1 = {
          'Extra' => 'Using where',
          'key_len' => '70',
          'ref' => 'const,test.a.type',
          'table' => 'g',
          'rows' => '335',
          'key' => 'groups_dt',
          'select_type' => 'SIMPLE',
          'possible_keys' => 'groups_dt',
          'id' => '1',
          'type' => 'ref'
        };

These are second rows from EXPLAINs when mysql forced to use different indexes for the same query. Condition "g.domain = 'queue' AND g.type = a.type" couldn't return more rows than "g.domain = 'queue'" when mysql's rows estimator says different, so server chooses shorter index.

That is major problem my bug report is about.

However, the following results are interesting for comparison as well:

EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX(groups_dt)
        WHERE g.domain = 'queue' AND g.type = a.type
$VAR1 = {
          'Extra' => 'Using where',
          'key_len' => '70',
          'ref' => 'const,test.a.type',
          'table' => 'g',
          'rows' => '335',
          'key' => 'groups_dt',
          'select_type' => 'SIMPLE',
          'possible_keys' => 'groups_dt',
          'id' => '1',
          'type' => 'ref'
        };
EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
        FROM acl a, groups g USE INDEX(groups_td)
        WHERE g.domain = 'queue' AND g.type = a.type
$VAR1 = {
          'Extra' => 'Using where',
          'key_len' => '70',
          'ref' => 'test.a.type,const',
          'table' => 'g',
          'rows' => '335',
          'key' => 'groups_td',
          'select_type' => 'SIMPLE',
          'possible_keys' => 'groups_td',
          'id' => '1',
          'type' => 'ref'
        };

This is cool that estimated rows for both are equal, but as I said above values are bigger than 11 (estimations for index "groups_d") what is wrong.

updated create_data.pl

Attachment: create_data.pl (text/x-perl), 1.26 KiB.

Verified just as described with 5.0.68:

openxs@suse:/home2/openxs/dbs/5.0> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.68-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show create table groups\G
*************************** 1. row ***************************
       Table: groups
Create Table: CREATE TABLE `groups` (
  `id` int(11) NOT NULL,
  `domain` varchar(32) default NULL,
  `type` varchar(32) default NULL,
  PRIMARY KEY  (`id`),
  KEY `groups_d` (`domain`),
  KEY `groups_t` (`type`),
  KEY `groups_dt` (`domain`,`type`),
  KEY `groups_td` (`type`,`domain`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> analyze table groups;
+-------------+---------+----------+-----------------------------+
| Table       | Op      | Msg_type | Msg_text                    |
+-------------+---------+----------+-----------------------------+
| test.groups | analyze | status   | Table is already up to date |
+-------------+---------+----------+-----------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id
    ->         FROM acl a, groups g USE INDEX(groups_dt)
    ->         WHERE g.domain = 'queue' AND g.type = a.type\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: a
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2
        Extra: Using temporary
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: g
         type: ref
possible_keys: groups_dt
          key: groups_dt
      key_len: 70
          ref: const,test.a.type
         rows: 335
        Extra: Using where
2 rows in set (0.03 sec)

mysql> EXPLAIN SELECT STRAIGHT_JOIN DISTINCT g.id FROM acl a, groups g USE INDEX(groups_td) WHERE g.domain = 'queue' AND g.type = a.type\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: a
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2
        Extra: Using temporary
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: g
         type: ref
possible_keys: groups_td
          key: groups_td
      key_len: 70
          ref: test.a.type,const
         rows: 335
        Extra: Using where
2 rows in set (0.00 sec)

Note the number of rows with domain='queue' though:

mysql> select count(*) from groups where domain='queue';
+----------+
| count(*) |
+----------+
|       12 |
+----------+
1 row in set (0.01 sec)

And note this estimation:

mysql> explain select count(*) from groups where domain='queue'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: groups
         type: ref
possible_keys: groups_d,groups_dt
          key: groups_d
      key_len: 35
          ref: const
         rows: 11
        Extra: Using where; Using index
1 row in set (0.00 sec)

and this one, when same groups_dt index is used:

mysql> explain select count(*) from groups force index(groups_dt) where domain='queue'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: groups
         type: ref
possible_keys: groups_dt
          key: groups_dt
      key_len: 35
          ref: const
         rows: 13
        Extra: Using where; Using index
1 row in set (0.00 sec)

The posted queries are a manifestation of two problems:

Problem1: ReuseRangeEstimateForRef bug
--------------------------------

mysql> explain SELECT STRAIGHT_JOIN g.id FROM acl a, groups g USE INDEX(groups_td) WHERE g.domain = 'queue' AND g.type = a.type;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: a
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: g
         type: ref
possible_keys: groups_dt
          key: groups_dt
      key_len: 70
          ref: const,j1.a.type
         rows: 335
        Extra: Using where

For this query:
 - Range optimizer obtains that records_in_range(groups_dt, g.domain = 'queue') = 13.  
 - Index statistics tell that rec_per_key[g.domain = const1 AND g.type = const2] = 335, which apparently a contradiction
 - ReuseRangeEstimateForRef-2 code is supposed to handle this but it does not.

Problem2: Picking ref access over a shorter index
-------------------------------------------------

Demonstrated by this query:

mysql> explain SELECT STRAIGHT_JOIN g.id FROM acl a, groups g  WHERE g.domain = 'queue' AND g.type = a.type;

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: a
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: g
         type: ref
possible_keys: groups_d,groups_t,groups_dt,groups_td
          key: groups_d
      key_len: 35
          ref: const
         rows: 11
        Extra: Using where

This uses index "groups_d" while there are more restrictive indexes groups_dt and groups_td.

Using a longer index can be more expensive because we'll need to walk through more index pages if we need to scan the same number of index entries. 

On the other hand, additional condition may limit the number of index entires and table records that we will need to scan.

One may argue that using more restrictions and a longer index almost always pays off.

Problem3: ReuseRangeEstimateForRef doesn't do complete job
----------------------------------------------------------
Tracing the query gets you this:
check_quick_select(): range access over g.domain = 'queue':
  groups_d:  11 rows
  groups_dt: 13 rows

best_access_path():
  index groups_d
    records = 11 (ReuseRangeEstmateForRef-1)

  index groups_t
    records = 1004 (from index stats)

  index groups_dt 
    records = 13 (ReuseRangeEstmateForRef-2)

  index groups_td
    records = 335 (from index statistics)

Here we see that 
1. MyISAM returns different estimates for the same range for different
   indexes. ReuseRangeEstmateForRef code propagates the difference to ref
   access estimates.

2. We use a highly-incorrect estimate for index groups_td. This probably
   doesn't matter much for this particular case because we have an equally
   good index grooups_dt which is processed by ReuseRangeEstmateForRef

During IRC discussion with Ruslan, noted the following approaches to resolve problems in 2-3:

1. Make ReuseRangeEstimateFoRef operate across indexes (i.e. record  and 
    estimates basing on column restrictions). One big question here will be
    whether we should consider restrictions on different columns independent
    or not.
  
2. Whenever ReuseRangeEstimateForRef re-uses an estimate for a prefix of the 
     used ref access, assume that restrictions on uncovered key parts have some 
    selectivity. The only numbers we could use to calculate the additional selectivity are 
    index cardinality values.  One might argue though, that ReuseRangeEstimateForRef
    makes difference in cases when index cardinality statistics is hugely skewed, so 
    we should not use it?

2. Introduce a heuristics/cost adjustment to use longer indexes

3. Make the range optimizer to cross-check estimates it has got from
   different indexes for the same ranges and produce a consistent estimates.
   This seems to be quite complicated.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/54002

2686 Sergey Petrunia	2008-09-12
      BUG#38049: incorrect rows estimations with references from preceding table
      - Fix condition in ReuseRangeEstimateForRef-3. The intended meaning of the changed
        part is: "And ref access candidate has key=const restrictions for every keypart that 
        was used by range access".

This the fix that was developed during irc discussion. It only fixes  Problem#1. The rest will be branched off to separate bug entries.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64080

2814 Sergey Petrunia	2009-01-26
      BUG#38049: incorrect rows estimations with references from preceding table
      - Fix condition in ReuseRangeEstimateForRef-3. The intended meaning of the changed
        part is: "And ref access candidate has key=const restrictions for every keypart
        that was used by range access".

Pushed into 6.0.10-alpha (revid:sergefp@mysql.com-20090202090240-dlkxhmc1asrar5rl) (version source revid:sergefp@mysql.com-20090126194259-ue20il3qro529l4d) (merge vers: 6.0.10-alpha) (pib:6)

I ported the fix to the next-mr-bugfixing tree

Somehow the system removed the fact that Sergey P did the fix, and that Igor reviewed it, and I cannot correct this.
Looks like the bug was never documented.
Fix is in 6.0 and queued to next-mr-bugfixing.
Changelog: "For certain SELECTs using 'ref' access, MySQL estimated a wrong number of rows, this could lead to inefficient query plans".

Pushed into 6.0.14-alpha (revid:alik@sun.com-20091031081410-qkxmjsdzjmj840aq) (version source revid:guilhem@mysql.com-20091019192733-v4nfabu8milak2ql) (merge vers: 6.0.14-alpha) (pib:13)

Noted in 6.0.14 changelog.

For certain SELECT statements using ref access, MySQL estimated an
incorrect number of rows, which could lead to inefficient query 
plans.  

Setting report to NDI pending push to 5.5.x.

Pushed into 5.5.0-beta (revid:alik@sun.com-20091110093229-0bh5hix780cyeicl) (version source revid:mikael@mysql.com-20091102100915-a2nbfxaqprpgptfw) (merge vers: 5.5.0-beta) (pib:13)

Noted in 5.5.0 changelog.

Pushed into mysql-next-mr (revid:alik@sun.com-20100816062819-bluwgdq8q4xysmlg) (version source revid:alik@sun.com-20100816062612-enatdwnv809iw3s9) (pib:20)

Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)