Bug #85126 Delete by range in presence of partitioning and no PK always picks wrong index
Submitted: 22 Feb 10:42 Modified: 28 Feb 7:15
Reporter: Riccardo Pizzi Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S1 (Critical)
Version:5.6.35 OS:Any
Assigned to:
Tags: partitioning

[22 Feb 10:42] Riccardo Pizzi
Description:
Please consider the following table:

CREATE TABLE `BIG_STORAGE_INNO` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` datetime NOT NULL,
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
  KEY `HASH_ID` (`HASH_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p652 VALUES LESS THAN (736739) ENGINE = InnoDB,
 PARTITION p653 VALUES LESS THAN (736740) ENGINE = InnoDB,
 PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
 PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
 .....

And the following DELETE:

DELETE FROM testdb1.BIG_STORAGE_INNO WHERE HASH_ID = '6fpsrcwcccjigdek2851400rxuafibyepqny' AND EXPIRE_DATE > NOW() ;

We would expect the optimizer to use the HASH_ID index and do partition pruning using EXPIRE_DATE.
Instead, the optimizer always uses the EXPIRE_DATE index, causing a locking full table scan every time.

It does not happen in absence of partitioning, so seems a partitioning related bug.

This can be tested by filling a table with couple million random rows (see below), where HASH_ID is unique.  Cardinality from statistics is optimal, still the optimizer wants to use the date.

db11>explain DELETE FROM testdb1.BIG_STORAGE_INNO WHERE HASH_ID = '6fpsrcwcccjigdek2851400rxuafibyepqny' AND EXPIRE_DATE > NOW() ; +----+-------------+------------------+-------+------------------------+----------------+---------+-------+------+-------------+
| id | select_type | table            | type  | possible_keys          | key            | key_len | ref   | rows | Extra       |
+----+-------------+------------------+-------+------------------------+----------------+---------+-------+------+-------------+
|  1 | SIMPLE      | BIG_STORAGE_INNO | range | EXPIRE_DATE_IX,HASH_ID | EXPIRE_DATE_IX | 5       | const |    3 | Using where |
+----+-------------+------------------+-------+------------------------+----------------+---------+-------+------+-------------+
1 row in set (0.02 sec)

db11>select INDEX_NAME, CARDINALITY  FROM information_schema.STATISTICS s where s.table_schema =  'testdb1' AND s.table_name = 'BIG_STORAGE_INNO';
+----------------+-------------+
| INDEX_NAME     | CARDINALITY |
+----------------+-------------+
| EXPIRE_DATE_IX |           8 |
| HASH_ID        |     2657212 |
+----------------+-------------+
2 rows in set (0.00 sec)

How to repeat:

CREATE TABLE `BIG_STORAGE_INNO` (
  `HASH_ID` char(64) NOT NULL,
  `SERIALIZATION_TYPE` enum('JAVA','PROTOSTUFF') NOT NULL,
  `COMPRESSION_TYPE` enum('DEFLATE','LZ4','NONE','GZIP') NOT NULL,
  `RAW_DATA` mediumblob NOT NULL,
  `LAST_UPDATE` datetime NOT NULL,
  `EXPIRE_DATE` datetime NOT NULL,
  KEY `EXPIRE_DATE_IX` (`EXPIRE_DATE`),
  KEY `HASH_ID` (`HASH_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (TO_DAYS(EXPIRE_DATE))
(PARTITION p652 VALUES LESS THAN (736739) ENGINE = InnoDB,
 PARTITION p653 VALUES LESS THAN (736740) ENGINE = InnoDB,
 PARTITION p654 VALUES LESS THAN (736741) ENGINE = InnoDB,
 PARTITION p655 VALUES LESS THAN (736742) ENGINE = InnoDB,
 PARTITION p656 VALUES LESS THAN (736743) ENGINE = InnoDB,
 PARTITION p657 VALUES LESS THAN (736744) ENGINE = InnoDB,
 PARTITION p658 VALUES LESS THAN (736745) ENGINE = InnoDB,
 PARTITION p659 VALUES LESS THAN (736746) ENGINE = InnoDB,
 PARTITION p660 VALUES LESS THAN (736747) ENGINE = InnoDB,
 PARTITION p661 VALUES LESS THAN (736748) ENGINE = InnoDB,
 PARTITION p662 VALUES LESS THAN (736749) ENGINE = InnoDB,
 PARTITION p663 VALUES LESS THAN (736750) ENGINE = InnoDB,
 PARTITION p664 VALUES LESS THAN (736751) ENGINE = InnoDB,
 PARTITION p665 VALUES LESS THAN (736752) ENGINE = InnoDB,
 PARTITION p666 VALUES LESS THAN (736753) ENGINE = InnoDB,
 PARTITION p667 VALUES LESS THAN (736754) ENGINE = InnoDB,
 PARTITION p668 VALUES LESS THAN (736755) ENGINE = InnoDB,
 PARTITION p669 VALUES LESS THAN (736756) ENGINE = InnoDB,
 PARTITION p670 VALUES LESS THAN (736757) ENGINE = InnoDB,
 PARTITION p671 VALUES LESS THAN (736758) ENGINE = InnoDB,
 PARTITION p672 VALUES LESS THAN (736759) ENGINE = InnoDB,
 PARTITION p673 VALUES LESS THAN (736760) ENGINE = InnoDB,
 PARTITION p674 VALUES LESS THAN (736761) ENGINE = InnoDB,
 PARTITION p675 VALUES LESS THAN (736762) ENGINE = InnoDB,
 PARTITION p676 VALUES LESS THAN (736763) ENGINE = InnoDB,
 PARTITION p677 VALUES LESS THAN (736764) ENGINE = InnoDB,
 PARTITION p678 VALUES LESS THAN (736765) ENGINE = InnoDB,
 PARTITION p679 VALUES LESS THAN (736766) ENGINE = InnoDB,
 PARTITION p680 VALUES LESS THAN (736767) ENGINE = InnoDB,
 PARTITION p681 VALUES LESS THAN (736768) ENGINE = InnoDB,
 PARTITION p682 VALUES LESS THAN (736769) ENGINE = InnoDB,
 PARTITION p683 VALUES LESS THAN (736770) ENGINE = InnoDB) */

Use sysbench and insert approx 2M rows using following lua code (this creates a variable payload between 0 and 16k and also scatters rows across ten partitions):

function event(thread_id)
   local t1s1 = sb_rand_str("#@@@@@@@@@@@@@@")
   local t1s2 = sb_rand_str("@######")
   local t1s3 = sb_rand_str("#@@@@@@@@@@@@@")
      db_query("INSERT INTO testdb1.BIG_STORAGE_INNO  VALUES(SHA2('" .. t1s1 .. t1s2 .. t1s3 .. "', 256), 'PROTOSTUFF', 'GZIP', REPEAT(CHAR(FLOOR(RAND()* 96)+32), FLOOR(RAND()*16384)), NOW(), CONCAT('2017-02-', FLOOR(RAND()*10)+22, ' 23:59:59'))")
end

Then run or explain following DELETE:

DELETE FROM testdb1.BIG_STORAGE_INNO WHERE HASH_ID = '6fpsrcwcccjigdek2851400rxuafibyepqny' AND EXPIRE_DATE > NOW() ;
[22 Feb 11:09] Riccardo Pizzi
This will only happen when issuing a DML (DELETE or UPDATE), while SELECT will always use the right index
[22 Feb 13:10] Riccardo Pizzi
Further testing seems to indicate that the problem happens when the first partition of the table, which contains zero datetime values ('0000-00-00 00:00:00'), becomes the bigger partition in town.

This seem to trigger the bug (analyze table may be required after load to start it). 
Probably due to fact that statistics are only computed on largest partition.
See also https://bugs.mysql.com/bug.php?id=44059.

Actually, it seems to happen if/when  the largest partition is outside the pruning range, no matter the date value (tried with 2000-01-01 and wrong index is still used).
[22 Feb 14:22] Riccardo Pizzi
Further demonstration from one of our production machines.
Here, there is a partition that contains rows with EXPIRE_DATE < NOW(), this is the largest partition in the table, it is called p358.   If we run explain without including  this partition (first example), the correct index is used. If we include this partition (second example), it uses the wrong one.

>explain partitions DELETE FROM BIG_STORAGE_00  partition (p359,p360,p361,p362,p363,p364,p365,p366,p367,p368,p369,p370,p371,p372,p373,p374,p375,p376,p377,p378,p379,p380,p381,p382,p383,p384,p385,p386,p387,p388,p389,p390) WHERE HASH_ID = SHA2('6fpsrcwcccjigdek2851400rxuafibyepqny', 256) AND EXPIRE_DATE >  NOW() ;
+----+-------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------+---------+---------+-------+------+-------------+
| id | select_type | table          | partitions                                                                                                                                                      | type  | possible_keys          | key     | key_len | ref   | rows | Extra       |
+----+-------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | BIG_STORAGE_00 | p359,p360,p361,p362,p363,p364,p365,p366,p367,p368,p369,p370,p371,p372,p373,p374,p375,p376,p377,p378,p379,p380,p381,p382,p383,p384,p385,p386,p387,p388,p389,p390 | range | EXPIRE_DATE_IX,HASH_ID | HASH_ID | 64      | const |    1 | Using where |
+----+-------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

>explain partitions DELETE FROM BIG_STORAGE_00  partition (p358,p359,p360,p361,p362,p363,p364,p365,p366,p367,p368,p369,p370,p371,p372,p373,p374,p375,p376,p377,p378,p379,p380,p381,p382,p383,p384,p385,p386,p387,p388,p389,p390) WHERE HASH_ID = SHA2('6fpsrcwcccjigdek2851400rxuafibyepqny', 256) AND EXPIRE_DATE >  NOW() ;
+----+-------------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------+----------------+---------+-------+------+-------------+
| id | select_type | table          | partitions                                                                                                                                                           | type  | possible_keys          | key            | key_len | ref   | rows | Extra       |
+----+-------------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------+----------------+---------+-------+------+-------------+
|  1 | SIMPLE      | BIG_STORAGE_00 | p358,p359,p360,p361,p362,p363,p364,p365,p366,p367,p368,p369,p370,p371,p372,p373,p374,p375,p376,p377,p378,p379,p380,p381,p382,p383,p384,p385,p386,p387,p388,p389,p390 | range | EXPIRE_DATE_IX,HASH_ID | EXPIRE_DATE_IX | 5       | const |    1 | Using where |
+----+-------------+----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------+----------------+---------+-------+------+-------------+
1 row in set (0.00 sec)
[22 Feb 17:26] Riccardo Pizzi
Looks like the optimizer has given the exact same cost to both indexes: 5.81 (based on the largest partition which happens to have rows < EXPIRE_DATE).
Since they have the exact same cost, the optimizer decides to take the first out of the two.
Guess what, rebuilding the table with order of indexes inverted fixes the problem!!

I wonder why can't the optimizer know that EXPIRE_DATE is a partitioning key and all costs being equal, decide to use the other index, and reserve EXPIRE_DATE for pruning as intended?

EXPIRE_DATE_IX first, HASH_ID_IX last:

  opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 17:28:51 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f58c00b1fd8 (ROR scans)  scans: HASH_ID
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key EXPIRE_DATE_IX, cost 5.81, records 4

HASH_ID_IX first, EXPIRE_DATE_IX last:

opt_range.cc:   264: get_key_scans_params: opt: range_scan_alternatives: starting struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "HASH_ID_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9 <= HASH_ID <= 925445622fe6c9da0a432b3e686b807557c215f04233b59a94b7eff1cb6ef3d9"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 1
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 1
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc:   264: get_key_scans_params: opt: (null): starting struct
  opt_range.cc:   299: get_key_scans_params: opt: index: "EXPIRE_DATE_IX"
  opt_range.cc:  9541: check_quick_select: exit: Records: 4
  opt_range.cc:   264: get_key_scans_params: opt: ranges: starting struct
  opt_range.cc:   299: get_key_scans_params: opt: (null): "2017-02-22 18:20:05 < EXPIRE_DATE"
  opt_range.cc:   279: get_key_scans_params: opt: ranges: ending struct
  opt_range.cc:   313: get_key_scans_params: opt: index_dives_for_eq_ranges: 1
  opt_range.cc:   313: get_key_scans_params: opt: rowid_ordered: 0
  opt_range.cc:   313: get_key_scans_params: opt: using_mrr: 0
  opt_range.cc:   313: get_key_scans_params: opt: index_only: 0
  opt_range.cc:   336: get_key_scans_params: opt: rows: 4
  opt_range.cc:   347: get_key_scans_params: opt: cost: 5.81
  opt_range.cc:   313: get_key_scans_params: opt: chosen: 0
  opt_range.cc:   299: get_key_scans_params: opt: cause: "cost"
  opt_range.cc:   279: get_key_scans_params: opt: (null): ending struct
  opt_range.cc: 13830: print_sel_tree: info: SEL_TREE: 0x7f7da4095838 (ROR scans)  scans: HASH_ID_IX
  opt_range.cc:  5676: get_key_scans_params: info: Returning range plan for key HASH_ID_IX, cost 5.81, records 4
[28 Feb 7:15] Umesh Shastry
Hello Riccardo,

Thank you for the report and test case.

Thanks,
Umesh