MySQL Bugs: #8113: Select with between optimization

Bug #8113	Select with between optimization
Submitted:	24 Jan 2005 15:43	Modified:	18 Aug 2009 10:09
Reporter:	Emmanuel van der Meulen	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.0, 5.1, azalea bzr	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I brought this point up on the list, once before but without success; here is my final attempt logging this as a feature request.

It concerns possibly optimizing selects when using BETWEEN.

Some background regarding a suggestion to optimize select when using between.

My query uses two tables. The query selects one row on table 'b' for each row on table 'a', and uses between in the select. 

It can only ever return one row from table 'b', for each row on table 'a', due to the contents that is stored in table 'b'. Table 'b' contains in excess of a million records. What happens as a result of the between is that for the query, several rows 'seem' to be candidates on table 'b', when only looking at the first column of the between, but once the query evaluates and sifts through the candidate rows on table 'b', only one row will ever match, once the second row of the between is evaluated.

As an experiment I took one example and used limit and the query reduced from 4 secs to .01 sec. 

It seems that with BETWEEN every possible row that qualifies in the first column is 'extracted' as candidates before the second column which is part of BETWEEN is evaluated.

My suggestion is that when BETWEEN is used both columns be evaluated simultaneously.

Both table 'b' columns used in the BETWEEN are indexed separately and together, thus indexes exist for 'b1' and 'b2' and 'b1b2'; same data types and lengths; cache is large enough.

Table 'a' contains a property which does not match any property on table 'b' directly, but matches within a range.

Example; 

table 'a' property 'num' 

num 
--- 
1000 
2000 
3000 

table 'b' properties 'fromNum' and 'toNum' 
            fromNum  toNum 
row 1.    0001       1500 
row 2.    1501       2000 
row 3.    2001       4000 

select... where 'num' between 'fromNum' and 'toNum'. 

As seen in this example, using between, only one row actually matches, however, the query engine seemingly would at first see more than one matching row i.e., with 'fromNum' (between is same as num >= fromNum), value 2000 from table 'a' would match row 1 & 2 on table 'b'; and with 'toNum' (between is same as num <= toNum), value 2000 from table 'a' would match row 2 & 3 on table 'b'; only once the range is taken together the result matches one row, viz., row 2 on table 'b'. 

Now please remember table 'b' has 1.4 million rows as in this example, with 'fromNum' and 'toNum' running consecutively, so the query takes 4 seconds to find the correct row in table 'b'. With say 200 rows in table 'a', that means the query runs for a long time relatively speaking.

What I did was to use one row in table 'a' and using limit 1, and ran the query;  this took .01 seconds. This example was merely an attempt to show how quickly it would be if with between both columns are evaluated simultaneously.

How to repeat:
Example; 

table 'a' property 'num' 

num 
--- 
1000 
2000 
3000 

table 'b' properties 'fromNum' and 'toNum' 
            fromNum  toNum 
row 1.    0001       1500 
row 2.    1501       2000 
row 3.    2001       4000 

select... where 'num' between 'fromNum' and 'toNum'. 

Suggested fix:
My suggestion is that when BETWEEN is used both columns be evaluated simultaneously.

Thank you for the report.

Please provide repeatable test case: CREATE TABLE for every table used, exact query you use. Also please check newer versions 5.0.84 and 5.1.37 as a lot of improvement in optimizer were made.

Thank you Sveta Smirnova,

>Please provide repeatable test case: CREATE TABLE for every table used
The tables are as simple as in the original examples. No need for more.

>exact query you use.
The query is as simple as in the original example. No need for more.

>Also please check newer versions 5.0.84 and 5.1.37 as a lot of improvement in optimizer were made.
Can't say for sure, haven't used these versions but to reproduce this on your side would be pretty easy and straight forward.

NB1. Once the issue I describe is understood--or as I like saying, once you get it--it would be easy to reproduce.
NB2. If you do not have large volumes the performance won't be noticeable.
NB3. This is a simple issue of the between algorithm which shouldn't be difficult to reproduce and to confirm that an improvement is or isn't possible.

Kind regards
Emmanuel

Thank you for the feedback.

But in the example provided:

----
table 'a' property 'num' 

num 
--- 
1000 
2000 
3000 

table 'b' properties 'fromNum' and 'toNum' 
            fromNum  toNum 
row 1.    0001       1500 
row 2.    1501       2000 
row 3.    2001       4000 
---

I can guess table 'a' contains single field and 3 rows and table 'b' contains 2 fields and 3 rows. Also I can guess all fields are of INT type.

But

> select... where 'num' between 'fromNum' and 'toNum'. 

Here I need WHERE clause, because it is not clear how you join these 2 tables. Please provide.

Thank you Sveta Smirnova,

> Here I need WHERE clause, because it is not clear how you join these 2 tables. Please provide.

No join.

The requirement:
- num is an ipAddress. 
- fromNum and toNum are ipAddress range blocks
- find in which ipAddress range block num existed (once found obtain required properties from 'b' - as I recall the info required from 'b' was country)

The between would have been a great way to do it. But processing took too long so I used 'limit 1'. This was after working for days to determine why the query was so slow.

Eventually used:
WHERE toNum >= num limit 1

Would have liked to use:
WHERE num between fromNum and toNum

Does this help?

Emmanuel

Thank you for the feedback.

Please provide output of SHOW CREATE TABLE a, SHOW CREATE TABLE b and full SELECT statement.

Test case showing everything works fast in my environment:

mysql> create temporary table a(num int);
Query OK, 0 rows affected (0.07 sec)

mysql> insert into a values(1000),(2000),(3000);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> create temporary table b(fromNum int, toNum int);
Query OK, 0 rows affected (0.10 sec)

mysql> insert into b values(0001, 1500),(1501, 2000),(2001, 4000);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select * from a, b where 'num' between 'fromNum' and 'toNum'; 
+------+---------+-------+
| num  | fromNum | toNum |
+------+---------+-------+
| 1000 |       1 |  1500 | 
| 2000 |       1 |  1500 | 
| 3000 |       1 |  1500 | 
| 1000 |    1501 |  2000 | 
| 2000 |    1501 |  2000 | 
| 3000 |    1501 |  2000 | 
| 1000 |    2001 |  4000 | 
| 2000 |    2001 |  4000 | 
| 3000 |    2001 |  4000 | 
+------+---------+-------+
9 rows in set (0.00 sec)

The moment table 'b' has say 500,000 rows, the query will drag its feet. If I recall it was around 4 seconds.

Also 'b' didn't have duplicates. ipAddress ranges cannot be duplicated.
Also toNum and fromNum do not overlap. ipAddress ranges cannot overlap.

Therefore, if the query realized the table consists of non-duplicates and non-overlapping ranges--thus the columns are distinct and the ranges are distinct, then it does not need to traverse the full table.

Thereby shortening the query time.

Thank you for the feedback.

MySQL did a lot of changes in optimizer code. This problem can be solved already. But to find what is wrong or correct in your case we need repeatable test case including CREATE TABLE definition. This is why I ask this information. You can add this privately as hidden comment if you don't want other people to see field names.

Thank you Sveta Smirnova,

I feel if you 'get' what I'm saying it's real easy to reproduce.

Below is further assistance to reproduce the slow query.

Here is the query--in the WHERE I originally used BETWEEN then the same query runs 4 secs--as it now stands it's much faster:
(here is a note I added in my Java program: with high volume; 'limit' over 'between' reduces query time from 4 secs to .01 sec:)

SELECT IPRange.ipRangeId, IPRange.addedOn, IPRange.ipFrom, IPRange.ipTo, IPRange.ipRangeIP2CityLink, 
IP2City.ip2CityId, IP2City.addedOn, IP2City.ipRangeIP2CityLink, IP2City.countryShort, IP2City.regionId, IP2City.cityId, 
City.cityId, City.addedOn, City.countryShort, City.regionShort, City.cityLong, 
Region.regionId, Region.addedOn, Region.countryShort, Region.regionShort, Region.regionLong,
locality.Country.countryId, locality.Country.addedOn, locality.Country.countryShort, locality.Country.countryLong
FROM IPRange
LEFT JOIN IP2City ON IPRange.ipRangeIP2CityLink = IP2City.ipRangeIP2CityLink
LEFT JOIN locality.Country ON IP2City.countryShort = locality.Country.countryShort
LEFT JOIN Region ON IP2City.regionId = Region.regionId
LEFT JOIN City ON IP2City.cityId = City.cityId
WHERE ipTo >= 2057920169 limit 1

Here is the link to the particular database backup/restore scripts (zipped);
https://rcpt.yousendit.com/726269967/1ac248f7cb7e579d6e52c5b767cd1a7a

Kind regards
Emmanuel

Thank you for the feedback.

I have similar results with both BETWEEN and original queries with current 5.0 version in my environment. Explain shows same result as well.

Please see file attached with results.

Please try current version 5.0.85 in your environment and if problem still exists provide output of SELECT queries with time and EXPLAIN like I did.

results of queries in viersion 5.0.85

Attachment: bug8113.txt (text/plain), 9.87 KiB.

Thank you, Sveta Smirnova,

Okay, but wait, please run the between without the 'limit 1'.

Then you'll see the query is 4 seconds.

By adding limit 1 to between, you're achieving the same as the >= query with limit 1 and that's not what I'm asking for.

I feel the SQL engine should be clever enough without the limit 1.

Kind regards
Emmanuel

Thank you for the feedback.

I finally got your point: you want query which uses BETWEEN condition which selects single row works as fast as with LIMIT 1.

Feature request verified as described, although this is repeatable only with ~2000000 rows with tables from your database. If I cut a table to 1000000 rows there is no performance issue. Also FORCE INDEX doesn't make query any faster. Also adding [unique] index on both columns doesn't make query any faster as well.

So feature request should sound like "implement some smart algorithm for such cases".

How to repeat.

1. Setup table using following PHP script:

<?php
$db = new mysqli('127.0.0.1', 'root', '', 'test', 3360);
$db->query("CREATE TABLE IF NOT EXISTS `bug8113` (
  `ipFrom` bigint(20) NOT NULL DEFAULT '0',
  `ipTo` bigint(20) NOT NULL DEFAULT '0',
  UNIQUE KEY `ipFrom` (`ipFrom`),
  UNIQUE KEY `ipTo` (`ipTo`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1");

$j = 0;

for ($i = 0; $i < 4000000; $i ++) {
	$db->query('insert into bug8113 values(' . ($j + 1) . ', ' . ($j += 100) . ')');
}
?>

2. From mysql client:

mysql> explain select * from bug8113 where 205792019 between ipFrom and ipTo ;
+----+-------------+---------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+---------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | bug8113 | ALL  | ipFrom,ipTo   | NULL | NULL    | NULL | 4000000 | Using where |
+----+-------------+---------+------+---------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)

mysql> explain select * from bug8113 where 205792019 between ipFrom and ipTo limit 1;
+----+-------------+---------+-------+---------------+------+---------+------+---------+------------------------+
| id | select_type | table   | type  | possible_keys | key  | key_len | ref  | rows    | Extra                  |
+----+-------------+---------+-------+---------------+------+---------+------+---------+------------------------+
|  1 | SIMPLE      | bug8113 | range | ipFrom,ipTo   | ipTo | 8       | NULL | 1940783 | Using where; Using MRR |
+----+-------------+---------+-------+---------------+------+---------+------+---------+------------------------+
1 row in set (0.01 sec)

mysql>  select * from bug8113 where 205792019 between ipFrom and ipTo ;
+-----------+-----------+
| ipFrom    | ipTo      |
+-----------+-----------+
| 205792001 | 205792100 |
+-----------+-----------+
1 row in set (3.74 sec)

mysql>  select * from bug8113 where 205792019 between ipFrom and ipTo limit 1;
+-----------+-----------+
| ipFrom    | ipTo      |
+-----------+-----------+
| 205792001 | 205792100 |
+-----------+-----------+
1 row in set (0.17 sec)

Performance degradation is noticeable, altough it is easy to guess there would be single row in result set, because both indexes are unique.

Sveta Smirnova, thank you for your patience and perseverance.

Kind regards
Emmanuel

How "there would be single row in result set" follows from "both indexes are unique"?

create table t1 (a int, b int, unique(a), unique(b));
insert t1 (1,100),(2,101),(3,102);
select * from t1 where 10 between a and b;

Sergei,

I believe in time when I wrote this comment I was thinking about feature original reporter asks for: if dataset is such what row is single, run fast.

Assume we have a query with condition WHERE x BETWEEN idx1 AND idx2

I'd search for values in idx1 which are nearer to x, then do same for idx2. Now we have idx1.min, idx1.max, idx2.min, idx2.max, so we can get min = min(idx1.min, idx2.min) and max=max(idx1.max,idx2.max), then get query like WHERE idx1 BETWEEN min AND max AND idx2 BETWEEN min AND max

I am running into this issue as well.  I was able to mitigate it by using a STRAIGHT_JOIN.

CREATE TABLE `t1` (
   `num` int(10) unsigned NOT NULL,
   PRIMARY KEY (`num`) USING BTREE
 ) ENGINE=InnoDB DEFAULT CHARSET=latin1
 
 CREATE TABLE `t2` (
    `numLow` int(10) unsigned NOT NULL,
    `numHigh` int(10) unsigned NOT NULL,
    PRIMARY KEY (`numHigh`,`numLow`) USING BTREE,
    UNIQUE KEY `unique` (`numLow`,`numHigh`) USING BTREE
 ) ENGINE=InnoDB DEFAULT CHARSET=latin1 MIN_ROWS=9000000 PACK_KEYS=1 ROW_FORMAT=FIXED
 
explain SELECT count(*) FROM t2 INNER JOIN t1 ON t1.num BETWEEN t2.numLow AND t2.numHigh;
# id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
'1', 'SIMPLE', 't1', 'index', 'PRIMARY', 'PRIMARY', '4', NULL, '50015', 'Using index'
'1', 'SIMPLE', 't2', 'ALL', 'PRIMARY,unique', NULL, NULL, NULL, '9597177', 'Range checked for each record (index map: 0x3)'

explain SELECT count(*) FROM t2 STRAIGHT_JOIN t1 ON t1.num BETWEEN t2.numLow AND t2.numHigh;
+------+-------------+-------+-------+----------------+---------+---------+------+---------+------------------------------------------------+
| id   | select_type | table | type  | possible_keys  | key     | key_len | ref  | rows    | Extra                                          |
+------+-------------+-------+-------+----------------+---------+---------+------+---------+------------------------------------------------+
|    1 | SIMPLE      | t2    | index | PRIMARY,unique | PRIMARY | 8       | NULL | 9597177 | Using index                                    |
|    1 | SIMPLE      | t1    | ALL   | PRIMARY        | NULL    | NULL    | NULL |   50015 | Range checked for each record (index map: 0x1) |
+------+-------------+-------+-------+----------------+---------+---------+------+---------+------------------------------------------------+
2 rows in set (0.00 sec)

The STRAIGHT_JOIN takes about 100s.  The inner join never completes.  

Postgres was able to do the same query in under 12 seconds.

Here is the postgres query plan:

"Aggregate  (cost=1771305342.61..1771305342.62 rows=1 width=0)"
"  ->  Nested Loop  (cost=0.29..1635274519.70 rows=54412329164 width=0)"
"        ->  Seq Scan on t2  (cost=0.00..148096.97 rows=9613297 width=16)"
"        ->  Index Only Scan using t1_pkey on t1  (cost=0.29..113.49 rows=5660 width=8)"
"              Index Cond: ((num >= t2.numlow) AND (num <= t2.numhigh))"

Seems to be solve in (or before) 8.0.3:

mysql> explain select * from bug8113 where 205792019 between ipFrom and ipTo \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bug8113
   partitions: NULL
         type: range
possible_keys: ipFrom,ipTo
          key: ipTo
      key_len: 8
          ref: NULL
         rows: 7
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql> explain select * from bug8113 where 205792019 between ipFrom and ipTo limit 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bug8113
   partitions: NULL
         type: range
possible_keys: ipFrom,ipTo
          key: ipTo
      key_len: 8
          ref: NULL
         rows: 7
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Posted by developer:
 
I'm still seeing the old plan when I run Sveta's repro with 4 million rows on head of trunk:

mysql> explain select * from bug8113 where 205792019 between ipFrom and ipTo\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bug8113
   partitions: NULL
         type: ALL
possible_keys: ipFrom,ipTo
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4000000
     filtered: 24.98
        Extra: Using where
1 row in set, 1 warning (0,00 sec)