Bug #36138 SELECT with LEFT JOIN and aggregate function returns incorrect results
Submitted: 16 Apr 2008 13:21 Modified: 25 Apr 2008 12:06
Reporter: Rafal Rosiak Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: DML Severity:S1 (Critical)
Version:5.0.45 OS:Any
Assigned to: CPU Architecture:Any
Tags: join, sum

[16 Apr 2008 13:21] Rafal Rosiak
Description:
Table a
id, val1
0, 1
1, 2
2, 3

Table b
id, val2
1, 10
1, 10
1, 10

Table c
id, val3
0, 5
1, 5
2, 5

select a.id, val1, sum(val2), sum(val3) from a
left join b on a.id=b.id
left join c on a.id=c.id
group by val1

returns:
0 1 NULL 5
1 2 30 15
2 3 NULL 5

How to repeat:
run select mentioned above - always wrong

Suggested fix:
result shuld be:
0 1 NULL 5
1 2 30 5
2 3 NULL 5
[16 Apr 2008 19:44] Rafal Rosiak
I've forgot about a.id in GROUP BY, should be:

select a.id, val1, sum(val2), sum(val3) from a
left join b on a.id=b.id
left join c on a.id=c.id
group by a.id, val1
[16 Apr 2008 22:22] Peter Laursen
same on 5.1.24/windows

select a.id, val1, sum(val2), sum(val3) from a
left join b on a.id=b.id
left join c on a.id=c.id
group by id, val1;

/*
    id    val1  sum(val2)  sum(val3)
------  ------  ---------  ---------
     0       1     (NULL)          5
     1       2         30         15
     2       3     (NULL)          5
*/

-- but
select a.id, val1, sum(val3) from a
left join c on a.id=c.id
group by id, val1;

/*
    id    val1  sum(val3)
------  ------  ---------
     0       1          5
     1       2          5
     2       3          5
*/
[18 Apr 2008 14:12] Rafal Rosiak
I think it is very serious bug. In my case this bug eliminates MySQL from further using.
[22 Apr 2008 9:15] Peter Laursen
I think this is not a bug!

you have to do like this:'

select  a1.id, a1.val1, sumval2, sumval3 from (select a.id, a.val1, sum(val2) as sumval2 from a left join b on a.id=b.id group by val1) as a1
 left join 
(select c.id, sum(val3) as sumval3 from a left join c on a.id=c.id group by val1) as a2 on a1.id = a2.id;

First create a derived table with 3 columns and join again to get four columns.

The way you are doing means the result of first join is again joined to the thired table here like (A join B Join C) = (A join( B join C)).  There is not an independent join for every column in the result (as you expect).
[22 Apr 2008 10:11] Peter Laursen
I also think it gets so complicated because of your overall Schema design!

Why not simply have ONE table (id, val1, val2, val3) ??
[22 Apr 2008 11:22] Rafal Rosiak
The same on windows and I belive it is OS independent.
[22 Apr 2008 11:30] Rafal Rosiak
"I also think it gets so complicated because of your overall Schema design!
Why not simply have ONE table (id, val1, val2, val3) ??"

It's just a simple case to show the problem. Project has 11 tables and many tables are needed because of speed requirements and it is a natural effect of database normalisation.
[22 Apr 2008 11:31] Peter Laursen
it is 'not a bug'  :-)

this is the meaning of 

"left join b on a.id=b.id
left join c on a.id=c.id"

in SQL.  

You seem to assume that "left join b on a.id=b.id" has effect for one column of the result and "left join c on a.id=c.id" for another.

But they BOTH have effect for ALL columns returned! With this Schema you need to generate 'derived tables' and join again for every new column you want to add in the result.
[23 Apr 2008 14:11] Rafal Rosiak
Hi Peter,
Lets do that on the first two tables a and b:

select a.id, sum(val1), sum(val2)
from a
left join b on a.id=b.id
group by a.id

We'll see:
+----+-----------+-----------+
| id | sum(val1) | sum(val2) |
+----+-----------+-----------+
|  0 |         1 |      NULL | 
|  1 |         6 |        30 | 
|  2 |         3 |      NULL | 
+----+-----------+-----------+
and now lets take that query result as a answer for a question:
"how much money I spend in Chicago (val1) and in LA (val2) in each month (id)?"
I had $36 of all (select sum(val1) from a plus select sum(val2) from b) and I've spend $40 :) - month 1, sum(val1) - WRONG!!!
Magic?
Another way:
How it is possible to get 6 by sum rows with id=1 when it is only one row with value val1=2?

What is the meaning of relation after JOIN if it is not satisfied?

I'he tested query on mssql and the results are similar. Maybe it touches many other engines, but it doesn't mean, that's correct. Is that wrong fundamentals? Codd was wrong or it was a misunderstanding or relational algebra is so complicated?
Or I am wrong, because affecting on all attributes by every join statement doesn't appeals to me. It makes that kind of query useless.

Maybe it is worth of think about.
[23 Apr 2008 14:26] Peter Laursen
"I'he tested query on mssql and the results are similar". 
So I think you should realize and try to understand what such JOIN means in SQL!
(I first thouhgt too you were right, but a colleague of mine explained to me!) 

Did you try this:

select  a1.id, a1.val1, sumval2, sumval3 
  from (select a.id, a.val1, sum(val2) as sumval2 
        from a 
        left join b on a.id=b.id group by val1) 
        as a1
  left join (select c.id, sum(val3) as sumval3 
             from a 
             left join c on a.id=c.id group by val1) 
             as a2 
  on a1.id = a2.id;

The problem is that your Schema Design is bad and that is why you will a complicated JOIN like this one to get the desired result.  What does not exclude the possibility that someone can find a slightly simpler solution ...

I cannot express myself better than what I already did.

"You seem to assume that "left join b on a.id=b.id" has effect for one column of the result and "left join c on a.id=c.id" for another." and "There is not an independent join for every column in the result (as you expect)."

But for some weird reason MySQL people are very slow to respond here ...
[23 Apr 2008 14:46] Rafal Rosiak
I've used subqueries because it was only way to get right results. But look at this:
select a.id, sum(val1), sum(val2)
from a
left join b on a.id=b.id
where a.id=1 and b.id=1
group by a.id

and say what results You expect before You'll see them.
And, are You suprised? Because I am.
Tell me, how people did that before subqueries was introduced in MySQL?
I understand results of that query:
select a.id, val1, val2
from a
left join b on a.id=b.id
where a.id=1 and b.id=1

and that:

select a.id, val1, val2
from a
left join b on a.id=b.id

but it still doesn't convince me that join with aggregate is right.
[23 Apr 2008 14:49] Rafal Rosiak
..by the way, my query looks now like that:
select date(ut.data_sp), kf, komp, zwrtkf+zwrtkomp,
(select sum(wp.kwota) from wplaty wp where date(ut.data_sp)=wp.dday and ut.sklep_id=wp.sklep_id) as wpkw,
(select sum(wp.prowizja) from wplaty wp where date(ut.data_sp)=wp.dday and ut.sklep_id=wp.sklep_id) as wppr,
(select sum(pr.kwota) from przelew pr where date(ut.data_sp)=pr.dday and ut.sklep_id=pr.sklep_id) as prkw,
(select sum(ra.pwplata) from raty ra where date(ut.data_sp)=ra.dday and ut.sklep_id=ra.sklep_id) as rawp,
(select sum(ra.kredyt) from raty ra where date(ut.data_sp)=ra.dday and ut.sklep_id=ra.sklep_id) as rakr,
(select sum(ka.kwota) from karty ka where date(ut.data_sp)=ka.dday and ut.sklep_id=ka.sklep_id) as kakw,
(select sum(bo.kwota) from bony bo where date(ut.data_sp)=bo.dday and ut.sklep_id=bo.sklep_id) as bokw
from utarg ut
where month(ut.data_sp)=10 and year(ut.data_sp)=2007 and day(ut.data_sp)=16 and ut.sklep_id=65
group by date(ut.data_sp), kf, komp, zwrtkf+zwrtkomp
order by ut.data_sp

and before was:
select date(ut.data_sp), sum(wp.kwota), sum(wp.prowizja), sum(pr.kwota), sum(ra.pwplata), sum(ra.kredyt), sum(ka.kwota), sum(bo.kwota)
from utarg ut
left join wplaty wp on (date(ut.data_sp)=wp.dday and ut.sklep_id=wp.sklep_id)
left join przelew pr on (date(ut.data_sp)=pr.dday and ut.sklep_id=pr.sklep_id)
left join raty ra on (date(ut.data_sp)=ra.dday and ut.sklep_id=ra.sklep_id)
left join karty ka on (date(ut.data_sp)=ka.dday and ut.sklep_id=ka.sklep_id)
left join bony bo on (date(ut.data_sp)=bo.dday and ut.sklep_id=bo.sklep_id)
where month(ut.data_sp)=10 and year(ut.data_sp)=2007 and day(ut.data_sp)=16 and ut.sklep_id=65
group by date(ut.data_sp)
order by ut.data_sp
[23 Apr 2008 15:04] Rafal Rosiak
And a few words about schema. Three tables (but You can add more if it is too simple):
-shop takings - one per day,
-credit card payment - many in one day,
-hire-purchase - many in one day,
-delayed payment invoice - many in one day.

What schema do You propose (simplify: just date and value are needed)? And how to report month income by one day?
This is the reason of my consternation.
[23 Apr 2008 15:09] Peter Laursen
What example I provided is not a 'subquery' - it is JOIN of 'derived tables' and that was ALWAYS supported. And btw: Also SELECT ... WHERE .. IN (SELECT .. ) and SELECT .. FROM (SELECT ..) are not 'subqueries' and they were all always supported.  Lots of confusion here ..

The true meaning of the term 'subquery' is described here:
http://dev.mysql.com/doc/refman/5.0/en/subqueries.html
.. and in that EXACT meaning of the term 'subquery' support is only from MySQL 4.1.

This is a place for reporting bugs to MySQL.  I think discussion has evolved behind that now!  So I will not post again here.  But I think I have explained why I think this is 'not a bug'.

Now let the MySQL people reply ... what they should have done last week!
[25 Apr 2008 12:06] Sveta Smirnova
Thank you for the report.

Try query select * from a left join b on a.id=b.id left join c on a.id=c.id; Results would be:

id      val1    id      val2    id      val3
0       1       NULL    NULL    0       5
1       2       1       10      1       5
1       2       1       10      1       5
1       2       1       10      1       5
2       3       NULL    NULL    2       5

You can notice 3 rows with id=1. So it is expected to get 15 as sum(val3) when running original query.

I close the report as "Not a Bug" because this.

Peter, thank you for provided workarounds.