Bug #4064 group by with having produces wrong results
Submitted: 8 Jun 2004 21:19 Modified: 9 Jun 2004 0:33
Reporter: Gökhan Demir Email Updates:
Status: Not a Bug Impact on me:
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.2 OS:Linux (Linux)
Assigned to: Dean Ellis CPU Architecture:Any

[8 Jun 2004 21:19] Gökhan Demir
I want to select the latest records in their own groups (latest entered, or most important, etc.). For this, I am trying to use a select with group by and a having clause but am never able to get the desired results. 

How to repeat:
drop table T;
create table T (id int, val int, importance int, primary key(id) );
insert into T (id, val, importance) values (1, 1, 1);
insert into T (id, val, importance) values (2, 1, 2);
insert into T (id, val, importance) values (3, 1, 3);
insert into T (id, val, importance) values (4, 2, 1);
insert into T (id, val, importance) values (5, 2, 2);
insert into T (id, val, importance) values (6, 3, 1);
insert into T (id, val, importance) values (7, 4, 1);
insert into T (id, val, importance) values (8, 4, 7);
select * from T order by id
select val, importance from T group by val having importance = max(importance)

The query wrongly returns only 1 row:
   val  importance
------  ----------
     3         1  

Suggested fix:
Make the necessary bugfixes to make the above query return this result set:
   val  importance
------  ----------
     1         3  
     2         2  
     3         1  
     4         7  

Instead of using the having clause, I currently use subqueries for such selects as a workaround, such that:

select val, importance from T t1
  where importance = (select max(importance) from T where val = t1.val)
  group by val 

Also, I don't understand why I should mention the column name in select in order to use it in the having clause. Indeed, I don't need the importance column in my result set.
[8 Jun 2004 22:41] Dean Ellis
This is expected behavior.  Columns that do not appear in the GROUP BY clause or in aggregate functions basically take the first value the database finds when processing the query, and you cannot control which row is used to produce values for such columns without a subquery.  It is a MySQL extension to the standard that you can even reference the column in the SELECT without grouping by it or using it in aggregate functions.

With HAVING effectively being an output filter, you encounter all of this: the value must be available at this point in the query in order to make a comparison, and you have no control over which row is used to provide the value.
[8 Jun 2004 23:06] Gökhan Demir
All major db vendors implement having clause as I have described in this case. For me, it is very disappointing to see the MySQL team does not consider this as a bug whereas I think this is a serious bug.

The subquery attempt causes the query size to be twice as the subquery to get the max value on a specific group is very similar to the main query. Besides, as an EJB developer, I cannot have a unique select clause for different db vendors on that type of selects. The subquery attempt would require me to specify more ?s in the prepared statement, so, it would be really more difficult to have a unique Java code either (since there are different numbers of ?s in the prepared statement select).

I, as a developer, think that it shouldn't be that difficult to permit the use of a column in the having clause and also any aggregate (min or max would be the most used duo) on that column. Remember, all major db vendors allow that. (At the time of this writing, I have tried it with Sybase ASE, for example).

I hope, you as a MySQL developer would get the other team members' opinions on this and would decide to treat this case as a bug.

Best regards,
Gokhan Demir
[8 Jun 2004 23:40] Dean Ellis
All major vendors do not, in fact, implement such an extension as you have described; as you already have a feature request for this I will leave it at that, which is basically what you have here.

Even if the HAVING allowed you to do this, however, you still have the same problem caused by SELECTing columns which are not used in the GROUP BY or in aggregate functions; you would have no control over which row is used to provide the value it compares, so that if the values for that column are not unique within the particular GROUP BY set, you would have unpredictable results as you see here.

ie: changing your query to check MIN(importance) instead of MAX(importance) would likely demonstrate this unpredictability.
[9 Jun 2004 0:33] Sergei Golubchik
Just to clarify Dean's answer a little bit...

You do a GROUP BY val, but also consider a value of importance column - this is non-standard, the value of this column is not defined. E.g. for val=1 there are three possible importance values (1, 2, and 3) and MySQL is free to use any one from those:

mysql> select val, importance from T group by val;
| val  | importance |
|    1 |          1 |
|    2 |          1 |
|    3 |          1 |
|    4 |          1 |

So when you compare to max(importance) you can see that, indeed, only one row matches. But even that is not guaranteed - whan value of importance MySQL will use for each value of val depends on how the query is precessed internally - you cannot rely on it.

What you actually want to do, you can achieve without subqueries of having, just simply

mysql> select val, max(importance) from T group by val;
| val  | max(importance) |
|    1 |               3 |
|    2 |               2 |
|    3 |               1 |
|    4 |               7 |

but if you'd like to know the value of id for each row, it won't be that easy
[9 Jun 2004 0:34] Gökhan Demir
Mr Ellis,

You were right. I have tried with Oracle 9i, MS-SQL 2000, Sybase ASA (old Watcom SQL) and Sybase ASE 12.5. Only Sybase ASE 12.5 accepts and runs the query (in the way I think). So, sorry for pre-assuming the other major vendors would implement this feature without doing tests on my side.

My thinking is simply that: the comparison with a column and an aggregate on that column should (and would) be applied to all rows of the current group to distinguish which row (or rows) to be returned. Therefore, a query like this becomes reasonable:

select id from T group by val having importance=max(importance)

I read it like: "Give me the id numbers of the rows for each unique value with the highest importance.". 

Again, sorry for taking your time; and thank you for your comments.

Best Regards,
Gokhan Demir
[9 Jun 2004 1:13] Gökhan Demir
Mr Golubchik, Mr Ellis

I have further studied on the case. Thank you for your help. You are right. The subquery attempt seems to be the most reasonable choice. I probably need a select like this to get the id numbers of the rows that are the most important:

select val, max(id) from T t1
where importance = (select max(importance) from T where val = t1.val )
group by val

I have inserted two more rows to the test table:
insert into T (id, val, importance) values (9, 1, 2)
insert into T (id, val, importance) values (10, 4, 7)

So now, for the value 4, there are two "most important" rows. Sybase ASE returns these rows for this query:

select val, id from T group by val having importance = max(importance)
val         id          
----------- ----------- 
          1           3 
          2           5 
          3           6 
          4           8 
          4          10 

Which is different than what I have expected. Like in MySQL extension of invisible fields, I would expect Sybase to return a total of 4 rows with 4th row (with val = 4) to reflect either 8 or 10 as id. I was wrong. The case is over in my head. Sorry for taking your times.

Best wishes,
Gokhan Demir