Bug #68640 Optimizer does not handle group by properly in insert...select
Submitted: 11 Mar 2013 20:01 Modified: 14 Mar 2013 13:44
Reporter: Trey Raymond Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.5.29 OS:Linux (RHEL 6.3)
Assigned to: CPU Architecture:Any
Tags: GROUP BY, insert select

[11 Mar 2013 20:01] Trey Raymond
Description:
The query (it's long, but just an insert ignore...select with a group by on the primary key):

insert ignore into audience_aggregate_hourly_age_gdr_city (hourID,campaignMap_campaignID,creativeMap_creativeID,age,gender,geoMap_WOE_ID,impressions,clicks,interactions,fallback_imps,fallback_clicks,richmedia_imps,richmedia_clicks,post_clicks_conv,post_imps_conv)

select t1.hourID,t1.campaignMap_campaignID,t1.creativeMap_creativeID,t1.age,t1.gender, 0 as geoMap_WOE_ID, t1.impressions - COALESCE(sum(t2.impressions),0) as impressions, t1.clicks - COALESCE(sum(t2.clicks),0) as clicks, t1.interactions - COALESCE(sum(t2.interactions),0) as interactions, t1.fallback_imps - COALESCE(sum(t2.fallback_imps),0) as fallback_imps, t1.fallback_clicks - COALESCE(sum(t2.fallback_clicks),0) as fallback_clicks, t1.richmedia_imps - COALESCE(sum(t2.richmedia_imps),0) as richmedia_imps, t1.richmedia_clicks - COALESCE(sum(t2.richmedia_clicks),0) as richmedia_clicks, t1.post_clicks_conv - COALESCE(sum(t2.post_clicks_conv),0) as post_clicks_conv, t1.post_imps_conv - COALESCE(sum(t2.post_imps_conv),0) as post_imps_conv

from audience_aggregate_hourly_age_gdr_city t1
left join audience_aggregate_hourly_age_gdr_city t2 force index (primary) on t1.campaignMap_campaignID=t2.campaignMap_campaignID and t1.age=t2.age and t1.gender=t2.gender and t1.hourID=t2.hourID and t1.creativeMap_creativeID=t2.creativeMap_creativeID and t2.geoMap_WOE_ID>1 and t2.hourID between 375288 and 375311
where t1.hourID between 375288 and 375311 and t1.geoMap_WOE_ID=1
group by t1.campaignMap_campaignID,t1.geoMap_WOE_ID,t1.age,t1.gender,t1.creativeMap_creativeID,t1.hourID
having impressions > 0 or clicks > 0 or post_clicks_conv > 0 or post_imps_conv > 0 or interactions > 0 or fallback_imps > 0 or fallback_clicks > 0 or richmedia_imps > 0 or richmedia_clicks > 0

"explain partitions" output on the select:
1	SIMPLE	t1	p_375312	index	NULL	PRIMARY	15	NULL	2	Using where
1	SIMPLE	t2	p_375312	ref	PRIMARY	PRIMARY	3	zestdb.t1.campaignMap_campaignID	118210
-- This looks good, using the key for the grouping as intended

But from processlist:
63873	wraymond	localhost	zestdb	Query	74	Copying to tmp table	[QUERY]	0	354849	354849

Now, why would it be copying to tmp table?  It's a straight insert ignore...select, no temp table is needed to solve this query.  It looks like some very basic query optimizations are not being applied in the case of insert...select.  It takes almost no time at all to run the select alone, and it doesn't insert many rows, but the insert...select takes ages.

FYI, the create table:
CREATE TABLE `audience_aggregate_hourly_age_gdr_city` (
  `age` tinyint(2) unsigned NOT NULL,
  `gender` tinyint(2) unsigned NOT NULL,
  `campaignMap_campaignID` mediumint(9) unsigned NOT NULL,
  `creativeMap_creativeID` mediumint(9) unsigned NOT NULL,
  `hourID` mediumint(9) unsigned NOT NULL,
  `geoMap_WOE_ID` int(10) unsigned NOT NULL,
  `clicks` mediumint(8) unsigned DEFAULT '0',
  `clicks_contributors` mediumint(8) unsigned DEFAULT '0',
  `impressions` int(10) unsigned DEFAULT '0',
  `impression_contributors` int(10) unsigned DEFAULT '0',
  `interaction_contributors` int(10) unsigned DEFAULT '0',
  `interactions` int(10) unsigned DEFAULT '0',
  `fallback_imps` int(10) unsigned DEFAULT '0',
  `fallback_clicks` mediumint(8) unsigned DEFAULT '0',
  `richmedia_imps` int(10) unsigned DEFAULT '0',
  `richmedia_clicks` mediumint(8) unsigned DEFAULT '0',
  `last_updated` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `post_clicks_conv` mediumint(8) unsigned DEFAULT '0',
  `post_imps_conv` int(10) unsigned DEFAULT '0',
  PRIMARY KEY (`campaignMap_campaignID`,`geoMap_WOE_ID`,`age`,`gender`,`creativeMap_creativeID`,`hourID`),
  KEY `camp_crtv` (`campaignMap_campaignID`,`creativeMap_creativeID`,`geoMap_WOE_ID`,`hourID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY RANGE (hourID)
(PARTITION p_START VALUES LESS THAN (0) ENGINE = InnoDB,
...
 PARTITION p_375192 VALUES LESS THAN (375192) ENGINE = InnoDB,
 PARTITION p_375216 VALUES LESS THAN (375216) ENGINE = InnoDB,
 PARTITION p_375240 VALUES LESS THAN (375240) ENGINE = InnoDB,
 PARTITION p_375264 VALUES LESS THAN (375264) ENGINE = InnoDB,
 PARTITION p_375288 VALUES LESS THAN (375288) ENGINE = InnoDB,
 PARTITION p_375312 VALUES LESS THAN (375312) ENGINE = InnoDB,
 PARTITION p_375336 VALUES LESS THAN (375336) ENGINE = InnoDB,
 PARTITION p_375360 VALUES LESS THAN (375360) ENGINE = InnoDB,
 PARTITION p_375384 VALUES LESS THAN (375384) ENGINE = InnoDB,
 PARTITION p_375408 VALUES LESS THAN (375408) ENGINE = InnoDB,
 ...
 PARTITION p_END VALUES LESS THAN MAXVALUE ENGINE = InnoDB);

How to repeat:
Create the table as described, populate with some garbage data, run the query and watch the processlist, compare to output for explain on the select query by itself.

Suggested fix:
Ensure the optimizer runs through the select inside of an insert the same way it would treat a normal select.
[12 Mar 2013 20:11] Sveta Smirnova
Thank you for the report.

SELECT really does not use temporary tables (at least for dummy data I used). You can run simple test:

FLUSH STATUS;
SELECT ...
show status like '%tmp%';

But INSERT uses. I think it is necessary, because you insert into same table from which you select.

If I use another table (create table audience_aggregate_hourly_age_gdr_city2 like audience_aggregate_hourly_age_gdr_city;) - no temporary table created.

But EXPLAIN in version 5.6 shows temporary table is used for INSERT:

explain insert ignore into audience_aggregate_hourly_age_gdr_city (hourID,campaignMap_campaignID,creativeMap_creativeID,age,gender,geoMap_WOE_ID,impressions,clicks,interactions,fallback_imps,fallback_clicks,richmedia_imps,richmedia_clicks,post_clicks_conv,post_imps_conv)
select t1.hourID,t1.campaignMap_campaignID,t1.creativeMap_creativeID,t1.age,t1.gender, 0 as geoMap_WOE_ID, t1.impressions - COALESCE(sum(t2.impressions),0) as impressions, t1.clicks - COALESCE(sum(t2.clicks),0) as clicks, t1.interactions - COALESCE(sum(t2.interactions),0) as interactions, t1.fallback_imps - COALESCE(sum(t2.fallback_imps),0) as fallback_imps, t1.fallback_clicks - COALESCE(sum(t2.fallback_clicks),0) as fallback_clicks, t1.richmedia_imps - COALESCE(sum(t2.richmedia_imps),0) as richmedia_imps, t1.richmedia_clicks - COALESCE(sum(t2.richmedia_clicks),0) as richmedia_clicks, t1.post_clicks_conv - COALESCE(sum(t2.post_clicks_conv),0) as post_clicks_conv, t1.post_imps_conv - COALESCE(sum(t2.post_imps_conv),0) as post_imps_conv
from audience_aggregate_hourly_age_gdr_city t1
left join audience_aggregate_hourly_age_gdr_city t2 force index (primary) on t1.campaignMap_campaignID=t2.campaignMap_campaignID and t1.age=t2.age and t1.gender=t2.gender and t1.hourID=t2.hourID and t1.creativeMap_creativeID=t2.creativeMap_creativeID and t2.geoMap_WOE_ID>1 and t2.hourID between 375288 and 375311
where t1.hourID between 375288 and 375311 and t1.geoMap_WOE_ID=1
group by t1.campaignMap_campaignID,t1.geoMap_WOE_ID,t1.age,t1.gender,t1.creativeMap_creativeID,t1.hourID
having impressions > 0 or clicks > 0 or post_clicks_conv > 0 or post_imps_conv > 0 or interactions > 0 or fallback_imps > 0 or fallback_clicks > 0 or richmedia_imps > 0 or richmedia_clicks > 0
;
id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	t1	index	PRIMARY,camp_crtv	PRIMARY	15	NULL	2	Using where; Using temporary
1	SIMPLE	t2	ref	PRIMARY	PRIMARY	3	test.t1.campaignMap_campaignID	1	Using where

So this looks like not a bug for me.
[12 Mar 2013 20:49] Trey Raymond
I could imagine a query needing one if we didn't have technologies like tx isolation/mvcc, but in this case I can't see how one is necessary, same table or not.  Can you perhaps explain how writing the rows progressively rather than in a temp table would be beneficial here?  Otherwise, looks like a bug.
[13 Mar 2013 18:35] Sveta Smirnova
easier test case for MTR

Attachment: bug68640.test (application/octet-stream, text), 691 bytes.

[13 Mar 2013 18:55] Trey Raymond
Yes, that's much simpler.  However, in this case as well, that temp table is not necessary, but is created anyway, so it still looks like a bug.  Any use case of a concurrent transaction doing something, at least any that I can think of, would be handled gracefully, temp table or not.

Posting output here for people's reference:

mysql> flush status;

mysql> show status like 'Created_tmp_tables';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_tables      | 0     |
+-------------------------+-------+

mysql> select t11.f2, t11.f3, t12.f1 from t1 as t11 join t1 as t12;
...
25 rows in set (0.00 sec)

mysql> show status like 'Created_tmp_tables';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_tables      | 0     |
+-------------------------+-------+

mysql> insert ignore into t1 select t11.f2, t11.f3, t12.f1 from t1 as t11 join t1 as t12;
Query OK, 11 rows affected (0.00 sec)
Records: 25  Duplicates: 14  Warnings: 0

mysql> show status like 'Created_tmp_tables';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_tables      | 1     |
+-------------------------+-------+
[14 Mar 2013 12:46] Sveta Smirnova
Thank you for the feedback.

Indeed, shorter test displays exactly same issue.

We discussed this report internally with InnoDB and Optimizer teams and agreed this is not a bug.

This is part of the discussion:

----<q>----
We do indeed have tx isoloation and mvcc, but in this case the whole query takes place in one transaction, meaning that the read cursor sees insertions immediately. It is indeed possible to have static cursors within a transaction (sort of having a sub-transaction that does not see the actions of the rest of the transaction). This could be implemented by copying to a temporary table (so no gain), or with some mvcc sub-transaction.
----</q>-----

So it is not reasonable to implement this in different way.
[14 Mar 2013 13:44] Trey Raymond
Okay, understood.