MySQL Bugs: #64557: Correlated subquery performance is degraded in releases alfter 5.0

Bug #64557	Correlated subquery performance is degraded in releases alfter 5.0
Submitted:	6 Mar 2012 0:12	Modified:	8 Oct 2012 18:33
Reporter:	David Berg	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.1.56, 5.5.20, 5.6.4	OS:	Windows (XP Pro SP3)
Assigned to:		CPU Architecture:	Any
Tags:	correlated, Optimizer, performance, regression, subquery

Description:
The following correlated subquery executes in under 5 seconds in Version 5.0.51a, but takes ~101 minutes in 5.1.56, 5.5.20, and 5.6.4.

SELECT project_id, project_name, indicator_name, ind_meas_date,
       ind_meas_measure, ind_meas_comment
  FROM projects p, indicators i, indicator_measures m
 WHERE m.indicator_id = i.indicator_id
   AND i.internal_project_id =  p.internal_project_id
   AND i.indicator_id IN (SELECT indicator_id
                            FROM indicator_measures
                           WHERE internal_project_id = p.internal_project_id
                           GROUP BY 1 HAVING COUNT(*) > 1
                         ) 
 ORDER BY project_name, indicator_name, ind_meas_date DESC;

This symptom resembles characteristics - and may be a duplicate - of Bugs 9021, 9090, 14253, 15135, 18465, 19895, 32665. However, suggestions that fix(es) have been backported to 5.6 and even 5.5 seem to not be borne out.

Attached (or uploaded to the bug server) are table schemas, row counts, and query plans for all three versions as well as table dumps for replication. 

How to repeat:
Create and populate the three tables from the attached (or uploaded) table dumps and execute the above query.

Suggested fix:
Restore whatever worked in 5.0 but is broken in 5.1 and later.

I do not see any attachments in this bug report.

The attachments were uploaded to ftp://ftp.oracle.com/support/incoming/bug-data-64557.zip immediately after posting the bug report.

Thank you for the report.

This is fixed in version 5.6.5, but I still set it to "Verified", because it is regression and fix was because improvements in Optimizer code. So devs can consider if they want to fix it in 5.5 too.

Results in my environment:

5.0 - 24.38 sec
5.1 - 56 min 28.07 sec
5.5 - 37 min 1.14 sec
5.6 - 14.16sec

In 5.0.97, this query uses a ref access on the IND_MEAS_IDX1 index. In 5.1.67 and 5.5.29, the ref access is replaced with an index access, which causes an excessively high evaluation cost.
In 5.6.8, the same ref access is used as in 5.0. I have not gone into any detail about what causes these changes, but I consider this bug fixed in 5.6.

Notice that use of GROUP BY and HAVING excludes semi-join transformation and use of outer reference in WHERE clause excludes subquery materialization.

Workaround without GROUP BY.
---------------------------

The query can be rewritten without a GROUP BY clause. It may then look like this:

SELECT project_id, project_name, indicator_name, ind_meas_date,
       ind_meas_measure, ind_meas_comment
FROM PROJECTS p, INDICATORS i, INDICATOR_MEASURES m
WHERE m.indicator_id = i.indicator_id
  AND i.internal_project_id =  p.internal_project_id
  AND 1 IN
   (SELECT 1
    FROM INDICATOR_MEASURES AS im
    WHERE i.indicator_id=indicator_id AND p.internal_project_id=internal_project_id
    HAVING COUNT(*) > 1
   )
 ORDER BY project_name, indicator_name, ind_meas_date DESC;

This query is faster in all releases (0.29 seconds in 5.0.97 and 0.36 seconds in 5.6.8), and uses ref access on the INDICATOR_ID index.

Workaround for materialization.
------------------------------

MySQL 5.6 supports materialization of IN subqueries, which may be taken advantage of by rewriting the query like this:

SELECT project_id, project_name, indicator_name, ind_meas_date,
       ind_meas_measure, ind_meas_comment
FROM PROJECTS p, INDICATORS i, INDICATOR_MEASURES m
WHERE m.indicator_id = i.indicator_id
  AND i.internal_project_id =  p.internal_project_id
  AND (i.indicator_id, p.internal_project_id) IN
   (SELECT indicator_id, internal_project_id
    FROM INDICATOR_MEASURES AS im
    GROUP BY indicator_id, internal_project_id
    HAVING COUNT(*) > 1
   )
 ORDER BY project_name, indicator_name, ind_meas_date DESC;

Notice that there are no outer references in the rewritten subquery. This is a pre-requisite for materialization to be used. In MySQL 5.6, this query uses 0.27 seconds, which is the fastest of all execution times.

Roy's suggested workaround without GROUP BY executes in 0.499 seconds on 5.5.24 on my now-Windows 7 platform.

Thank you!