MySQL Bugs: #46717: Unnecessary joins are not optimized away

Bug #46717	Unnecessary joins are not optimized away
Submitted:	14 Aug 2009 11:14	Modified:	24 Aug 2009 4:50
Reporter:	Robin Sillem	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	5.1.32	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
We have a use case where a reporting component presents a list of columns to the user. The list is, in effect, the set of columns available from a pre-defined join of one or more tables, and the user may choose which he wants to appear on his report.

In some cases the user may choose columns exclusively from a single table, thus rendering some or all of the joins redundant. However, the execution plan still uses the joins, resulting in a longer query time, even though the joins could be optimized away.

As we have a pre-defined FROM clause, which the end user is not encouraged to tinker with, and from which a number of related reports/queries may be defined by the end user, the option of hand optimizing each query is not an acceptable solution.

How to repeat:
This is a simplified example of the problem:

Create table Maternity.Babies
(
 Id INT PRIMARY KEY,
 ForeName varchar(20) NULL
) ENGINE=INNODB;

create table Maternity.Weights
(
 BabyId integer not null,
 AgeInDays integer not null,
 Weight float not null,
 CONSTRAINT Baby_Fk FOREIGN KEY (BabyId) REFERENCES Babies (Id)
) ENGINE=INNODB;

In my test case with 30000 rows in weights

select sql_no_cache w.* from weights w inner join babies b on w.babyid = b.id

takes roughly twice as long as the equivalent

select sql_no_cache w.* from weights w

Note that the join to babies could be optimized away in this case because b.id is a primary key and w.babyid is 'not null' (so there will be exactly 1 matching row in babies for each one in weights). A different set of restrictions would apply if it were an outer join.

Suggested fix:
Enable the MySQL query optimizer to detect situations where joins may be dropped altogether, and construct an execution planm accordingly

Many thanks for writing a bug report.

This is not a bug in my eyes.

Consider, there is no artificial intelligence ....

Create table Maternity.Babies
(
 Id INT PRIMARY KEY,
 ForeName varchar(20) NULL
) ENGINE=INNODB;

create table Maternity.Weights
(
 BabyId integer not null,
 AgeInDays integer not null,
 Weight float not null,
 CONSTRAINT Baby_Fk FOREIGN KEY (BabyId) REFERENCES Babies (Id)
) ENGINE=INNODB;

No primary key on table "weights".

Look to the constraint:
FOREIGN KEY (BabyId) REFERENCES Babies (Id)

For the system this is 1:N relation.

From where should the system know that you want 1:1 here?

Also let us do an example:

Baby id = 5

It has the following weights:

after birth: 3.5 lb
after 7 days: 7 lb
after a year: 20 lb

so, when you do:
SELECT * FROM weight;

Result is:

id | AgeInDays | Weight
________________________

5 | 0 | 3.5
5 | 7 | 7.0
5 | 365 | 20.0

3 rows for Baby 5.

So system just is not able to optimise away the join.

Susanne, 
Thanks for the swift response.

Perhaps I wasn't clear enough in what I was requesting.

It's not a question of whether it's a 1:1 or 1:N relationship so much as the fact that the relationship is not used at all. In the example you cite:

SELECT * FROM weights WHERE BabyId = 5;

Result is:

id | AgeInDays | Weight
________________________

5 | 0 | 3.5
5 | 7 | 7.0
5 | 365 | 20.0

all the returned data comes from weights and there is no need to reference babies in the execution plan, even if the query was written with the unnecessary join to babies, as:

SELECT w.* FROM weights w JOIN babies b ON w.BabyId = b.Id WHERE w.BabyId = 5

The fact that the schema constraints mandate that each row of weights must refer to exactly 1 baby means that the formulation with the join will return the same number of rows as the formulation without, means that the join can be dropped. 

This requires no domain knowledge or artificial intelligence, as all the conditions for dropping the unnecessary join can be discovered from the query and the schema.

My request is that the query optimizer should detect these situations, and construct an execution plan that does not use tables which do not affect the result set in any way.

I would submit that, in the context of the query optimizer, this is either a bug (because the execution plan is clearly sub-optimal), or at the least a feature request.

To confirm that this optimization is possible, I've set up the same tests using MS SQL Server 2005, which removes the reference to Babies from the execution plan as I have described.

Ahhh now I understand you.

Many thanks for writing a bug report.

Verified as described.

I'd like to add that this is not strictly related to foreign keys as in the previous example.

For example:

create table tbl1 (fld1 varchar(10) not null primary key);
create table tbl2 (fld1 varchar(10) not null primary key);
explain select tbl1.* from tbl1 left outer join tbl2 on tbl1.fld1 = tbl2.fld1;

mysql should be able to optimize away tbl2 and doesn't.  Either a primary key or a unique constraint on the secondary side of an outer join should qualify.