Bug #815 SELECT with DISTINCT and JOIN ignores the ORDER BY
Submitted: 8 Jul 2003 15:37 Modified: 27 Aug 2003 2:20
Reporter: Stepan Riha Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.0 OS:x86 Linux
Assigned to: Paul DuBois CPU Architecture:Any

[8 Jul 2003 15:37] Stepan Riha
Description:
This is likely a duplicate of Bug #275 which was closed and supposedly fixed in 4.0.13 (but it still occurs in this case).  I am attaching a SQL dump, hence a new request.

I suspect that this is an optimizer bug.  Note that the DISTINCT field is in one table while the ORDER BY field is in another table connected via two JOINs.

The following query returns an incorrectly sorted result:

SELECT DISTINCT category_id 
FROM mt_entry
JOIN mt_placement ON placement_entry_id = entry_id 
JOIN mt_category ON placement_category_id = category_id 
ORDER BY entry_modified_on DESC

Omiting the DISTINCT directive returns properly sorted results (with duplicates, of course).

How to repeat:
Create tables using the attached SQL dump.

Run the following query:

SELECT DISTINCT category_id 
FROM mt_entry
JOIN mt_placement ON placement_entry_id = entry_id 
JOIN mt_category ON placement_category_id = category_id 
ORDER BY entry_modified_on DESC

The correct results should be:
9
19
9
4
1
[8 Jul 2003 15:38] Stepan Riha
SQL dump of database showing problem

Attachment: nonplus_test.sql (text/plain), 7.74 KiB.

[10 Jul 2003 14:36] MySQL Verification Team
I was not able to repeat it.

I tried with tables that I have.

Can you upload your tables to:

ftp://support.mysql.com:/pub/mysql/secret

and let us know a filename.
[10 Jul 2003 14:40] Stepan Riha
I've already attached the SQL dump to the bug report (nonplus_test.sql).  You should be able to get it via http://bugs.mysql.com/file.php?id=3

Let me know if for some reason you can't get to it.
[11 Jul 2003 3:52] MySQL Verification Team
Yes, it is a bug.
[19 Jul 2003 11:34] Sergei Golubchik
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

This is not a bug. For each of distinct values of the column category_id, there are many values of entry_modified_on. Which value will be picked up (and used in ORDER BY) is undefined. To make it more obvious let's rewrite your query with GROUP BY:

SELECT category_id FROM mt_entry
JOIN mt_placement ON placement_entry_id = entry_id 
JOIN mt_category ON placement_category_id = category_id 
GROUP BY category_id
ORDER BY entry_modified_on DESC;

This query will return the same result. Now, you can add entry_modified_on column to SELECT list and see:

mysql> SELECT category_id,entry_modified_on FROM mt_entry
   --> JOIN mt_placement ON placement_entry_id = entry_id
   --> JOIN mt_category ON placement_category_id = category_id
   --> GROUP BY category_id ORDER BY entry_modified_on DESC;
+-------------+-------------------+
| category_id | entry_modified_on |
+-------------+-------------------+
|          19 |    20030623165609 |
|           4 |    20030426103241 |
|           8 |    20030416204920 |
|           9 |    20030317100950 |
|           1 |    20030120134513 |
+-------------+-------------------+

As you see, the result is correct - column entry_modified_on is, indeed, has decreasing ordering.
[22 Jul 2003 15:09] Stepan Riha
As far as I can tell from the documentation, your rewritten query using GROUP BY is not semantically equivalent to the original query using DISTINCT.

According to the MySQL documentation "DISTINCT [... specifies] that duplicate rows in the result set should be removed".  It doesn't say anything about reordering of the result set.  If you look at the result set produced without using DISTINCT, you will see that applying DISTINCT does not do what the documentation says it should do.

Maybe the documentation needs to be updated?
[23 Jul 2003 5:11] Sergei Golubchik
GROUP BY also does not mean reordering. And some DBMS'es does not return rows from GROUP BY in any particular order (if, e.g. GROUP BY is implemented with hash tables).

And, yes, as "duplicate rows in the result set should be removed" it is not defined which row from the table will go into result set, so technically one cannot use in WHERE or ORDER BY fields that are not part of result set. According to SQL standard (I think any version of it will do) one is not allowed to use columns in the query with DISTINCT that are not part of result set.
[23 Jul 2003 6:55] Stepan Riha
You're right about the requirement of having the ORDER BY columns in a DISTINCT select in other DBMSes. I tried the same setup using Microsoft's SQLServer and it complained about this very problem.

I was (naively?) expecting that since MySQL did NOT return an error in my SQL, and since it has expanded standard SQL in other places (like SELECTing columns that are not specified as GROUP BY), that this was the behavior it would take.  Plus I *think* it used to work in previous versions of MySQL.

Oh well, I guess c'est la vie.  You may want to change the documentation to reflect this, though.
[29 Jul 2003 3:53] Lenz Grimmer
Assigned to documentation. Paul, could you please update the manual accordingly? Thanks!
[27 Aug 2003 2:20] Michael Widenius
Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

I am a little confused about what you think is the problem as I think
MySQL does exactly what you told it to do (I may be a bit biased of
course but...:)

I uploaded the tables and tested this in MySQL 4.0:

First the query without ORDER BY:

mysql> SELECT DISTINCT category_id
    -> FROM mt_entry
    -> JOIN mt_placement ON placement_entry_id = entry_id
    -> JOIN mt_category ON placement_category_id = category_id ;
+-------------+
| category_id |
+-------------+
|           1 |
|           4 |
|           8 |
|           9 |
|          19 |
+-------------+

Which I think we can both agree is correct.

If we add ORDER BY entry_modified_on DESC we get:

+-------------+
| category_id |
+-------------+
|          19 |
|           4 |
|           8 |
|           9 |
|           1 |
+-------------+

I don't see any reason why adding an ORDER BY should change the result
(in your example you had two nines in the result set which looks wrong)

As Sergei pointed out, the value for MySQL used for 'entry_modifed_on'
is in your query not well defined, and in this which case MySQL will
pick up the values from one of the matched rows.

This is a very good extension in the case where you know that all
matched rows will have an equal value for the referred columns (in this case
'entry_modifed_on'), but this is not the case in your example.

To explain how MySQL works:

- Read a row from mt_category.
- Join it with mt_placement and mt_entry
  - As soon as there is a matching row, store 'category_id' and
    'entry_modifed_on' and read next row from mt_category

- Remove found duplicates for 'category'
  (Note that when removing duplicates entry_modifed_on is not considered)

- Sort results based on the hidden entry_modifed_on field

Note that in the manual section:

http://www.mysql.com/doc/en/DISTINCT_optimisation.html

we do mention the DISTINCT is normally optimized as a query with GROUP BY

I have now updated the above manual section to refer to the GROUP BY
extensions that covers how 'hidden fields' are used.

Hope the above helps to understand this issue better.

Regards,
Monty