Bug #36604 Partition pruning not pruning correctly
Submitted: 8 May 2008 22:11 Modified: 9 May 2008 17:49
Reporter: Gabriel Harriman Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Partitions Severity:S2 (Serious)
Version:5.1.22 OS:Linux
Assigned to: CPU Architecture:Any
Tags: hash, partition, pruning

[8 May 2008 22:11] Gabriel Harriman
Description:
I have this table, partitioned with a hash on "hour(date_time)", with 24 partitions, one per hour:

| oa_data_raw_ad_impression | CREATE TABLE `oa_data_raw_ad_impression` (
  `viewer_id` varchar(32) DEFAULT NULL,
  `viewer_session_id` varchar(32) DEFAULT NULL,
  `date_time` datetime NOT NULL,
  `ad_id` int(10) unsigned NOT NULL,
  `creative_id` int(10) unsigned NOT NULL,
  `zone_id` int(10) unsigned NOT NULL,
  `channel` varchar(255) DEFAULT NULL,
  `channel_ids` varchar(64) DEFAULT NULL,
  `language` varchar(32) DEFAULT NULL,
  `ip_address` varchar(16) DEFAULT NULL,
  `host_name` varchar(255) DEFAULT NULL,
  `country` char(2) DEFAULT NULL,
  `https` tinyint(1) DEFAULT NULL,
  `domain` varchar(255) DEFAULT NULL,
  `page` varchar(255) DEFAULT NULL,
  `query` varchar(255) DEFAULT NULL,
  `referer` varchar(255) DEFAULT NULL,
  `search_term` varchar(255) DEFAULT NULL,
  `user_agent` varchar(255) DEFAULT NULL,
  `os` varchar(32) DEFAULT NULL,
  `browser` varchar(32) DEFAULT NULL,
  `max_https` tinyint(1) DEFAULT NULL,
  `geo_region` varchar(50) DEFAULT NULL,
  `geo_city` varchar(50) DEFAULT NULL,
  `geo_postal_code` varchar(10) DEFAULT NULL,
  `geo_latitude` decimal(8,4) DEFAULT NULL,
  `geo_longitude` decimal(8,4) DEFAULT NULL,
  `geo_dma_code` varchar(50) DEFAULT NULL,
  `geo_area_code` varchar(50) DEFAULT NULL,
  `geo_organisation` varchar(50) DEFAULT NULL,
  `geo_netspeed` varchar(20) DEFAULT NULL,
  `geo_continent` varchar(13) DEFAULT NULL,
  KEY `data_raw_ad_impression_viewer_id` (`viewer_id`),
  KEY `data_raw_ad_impression_date_time` (`date_time`),
  KEY `data_raw_ad_impression_ad_id` (`ad_id`),
  KEY `data_raw_ad_impression_zone_id` (`zone_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 /*!50100 PARTITION BY HASH ( HOUR(date_time)) PARTITIONS 24  */ |

when I do an explain partitions to check that pruning is working i get this:

mysql> explain partitions select count(*) from oa_data_raw_ad_impression where HOUR(date_time) = 2;

+----+-------------+---------------------------+---------------------------------------------------------------------------------------+-------+---------------+----------------------------------+---------+------+--------+--------------------------+
| id | select_type | table                     | partitions                                                                            | type  | possible_keys | key                              | key_len | ref  | rows   | Extra                    |
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+-------+---------------+----------------------------------+---------+------+--------+--------------------------+
|  1 | SIMPLE      | oa_data_raw_ad_impression | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23 | index | NULL          | data_raw_ad_impression_date_time | 8       | NULL | 951804 | Using where; Using index |
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+-------+---------------+----------------------------------+---------+------+--------+--------------------------+

I'm pretty sure this query should only be accessing one partition. Is this assumption correct?

How to repeat:
CREATE TABLE `oa_data_raw_ad_impression` (
  `viewer_id` varchar(32) DEFAULT NULL,
  `viewer_session_id` varchar(32) DEFAULT NULL,
  `date_time` datetime NOT NULL,
  `ad_id` int(10) unsigned NOT NULL,
  `creative_id` int(10) unsigned NOT NULL,
  `zone_id` int(10) unsigned NOT NULL,
  `channel` varchar(255) DEFAULT NULL,
  `channel_ids` varchar(64) DEFAULT NULL,
  `language` varchar(32) DEFAULT NULL,
  `ip_address` varchar(16) DEFAULT NULL,
  `host_name` varchar(255) DEFAULT NULL,
  `country` char(2) DEFAULT NULL,
  `https` tinyint(1) DEFAULT NULL,
  `domain` varchar(255) DEFAULT NULL,
  `page` varchar(255) DEFAULT NULL,
  `query` varchar(255) DEFAULT NULL,
  `referer` varchar(255) DEFAULT NULL,
  `search_term` varchar(255) DEFAULT NULL,
  `user_agent` varchar(255) DEFAULT NULL,
  `os` varchar(32) DEFAULT NULL,
  `browser` varchar(32) DEFAULT NULL,
  `max_https` tinyint(1) DEFAULT NULL,
  `geo_region` varchar(50) DEFAULT NULL,
  `geo_city` varchar(50) DEFAULT NULL,
  `geo_postal_code` varchar(10) DEFAULT NULL,
  `geo_latitude` decimal(8,4) DEFAULT NULL,
  `geo_longitude` decimal(8,4) DEFAULT NULL,
  `geo_dma_code` varchar(50) DEFAULT NULL,
  `geo_area_code` varchar(50) DEFAULT NULL,
  `geo_organisation` varchar(50) DEFAULT NULL,
  `geo_netspeed` varchar(20) DEFAULT NULL,
  `geo_continent` varchar(13) DEFAULT NULL,
  KEY `data_raw_ad_impression_viewer_id` (`viewer_id`),
  KEY `data_raw_ad_impression_date_time` (`date_time`),
  KEY `data_raw_ad_impression_ad_id` (`ad_id`),
  KEY `data_raw_ad_impression_zone_id` (`zone_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 /*!50100 PARTITION BY HASH ( HOUR(date_time)) PARTITIONS 24  */

explain partitions select count(*) from oa_data_raw_ad_impression where HOUR(date_time) = 2;
[8 May 2008 22:12] Gabriel Harriman
this non-aggregated query accesses the same partitions:

mysql> explain partitions select * from oa_data_raw_ad_impression where HOUR(date_time) = 2;
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table                     | partitions                                                                            | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | oa_data_raw_ad_impression | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23 | ALL  | NULL          | NULL | NULL    | NULL | 969250 | Using where |
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+------+---------------+------+---------+------+--------+-------------+
[9 May 2008 8:11] Mattias Jonsson
Please look at http://s.petrunia.net/blog/?p=21

The problem is using a function in where clause, if using date_time ='<specific date>' it would be pruning properly.
[9 May 2008 15:00] Susanne Ebrecht
Many thanks for open a bug report.

I will close this bug report now because I think you problem is solved with the documentation Mattias gave you.

If you still have problems here, please feel free to open this bug report again.
[9 May 2008 17:09] Gabriel Harriman
OK - I understand the function factor and I confirmed that it only scans one partition if I search for an exact time, 2008-05-09 10:00:00 for example.  If I remove the function and do a date_time BETWEEN X AND Y, where X AND Y would be found in the same partition, it still scans all partitions. For instance:

mysql> explain partitions select * from oa_data_raw_ad_impression where date_time between '2008-05-09 10:00:00' and '2008-05-09 10:59:59';
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+-------+----------------------------------+----------------------------------+---------+------+-------+-------------+
| id | select_type | table                     | partitions                                                                            | type  | possible_keys                    | key                              | key_len | ref  | rows  | Extra       |
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+-------+----------------------------------+----------------------------------+---------+------+-------+-------------+
|  1 | SIMPLE      | oa_data_raw_ad_impression | p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23 | range | data_raw_ad_impression_date_time | data_raw_ad_impression_date_time | 8       | NULL | 39393 | Using where |
+----+-------------+---------------------------+---------------------------------------------------------------------------------------+-------+----------------------------------+----------------------------------+---------+------+-------+-------------+
[9 May 2008 17:11] Gabriel Harriman
Hi Susanne - 

I'm not dead yet!  =)  I still think there's something wrong with partition pruning.  Please see my last comment.
[9 May 2008 17:42] Susanne Ebrecht
Gabriel,

Statements with BETWEEN 'a' AND 'b' always occur that the system has to search through all.

There is no artificial intelligence. 
Computers are silly. When you make a between then this means that the computer will look to every value and will test if it is bigger/equal 'a' or smaller/equal 'b'.

That is an expected behaviour.

I will set this bug back to not a bug.
[9 May 2008 17:49] Gabriel Harriman
Alright, so maybe it isn't a bug, but it's definitely a feature that would be handy. =)

If you have a list and a range query, the range should be checked to see if it fits into one partition.  Otherwise it's almost pointless to partition based on a function of the datetime field. I can't think of any useful query that would request a record based on the exact second.
[12 May 2008 22:55] Mattias Jonsson
Gabriel, please try to use partitioning with range if you want to optimize range searches (it will then prune partitions correctly if the partitioning function is strictly monotonic increasing). The problem here is that you want it to prune on range for a hash partitioned table and that has not been implemented.

(the server does not know if f(X) is in the same partition for all X between two different date, M<=X<=N.)

What you could do is maybe to combine range partition with hash subpartition.
[16 May 2008 19:14] Sveta Smirnova
Related bug #36758