MySQL Bugs: #70203: GROUP BY is consistent with ORDER BY in subquery

Bug #70203	GROUP BY is consistent with ORDER BY in subquery
Submitted:	30 Aug 2013 16:38	Modified:	31 Aug 2013 12:31
Reporter:	Anthony Master	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S3 (Non-critical)
Version:	5.5.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	group-by, order-by, subquery

Description:
I answered a this question: http://stackoverflow.com/a/18521684/1707323 with a query similiar to:

SELECT
  *
FROM
  (
    SELECT
      *
    FROM
      table_name
    WHERE
      table_name.some_field='1' OR
      table_name.some_field='2'
    ORDER BY
      table_name.some_field
  ) sub_query
GROUP BY
  sub_query.primary_key_column

And the comment was made

Clever idea, but this is not guaranteed to work in MySQL, and will cause a hard error in SQL Server and most other database systems. Quoting from Extensions to GROUP BY: "The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause."

It's entirely possible that for your specific version of MySQL, for your specific tables, it will always give you the results you want, because the query plan you get causes your query to be executed exactly the way you expect. But if the documentation explicitly states that it will indeterminately choose a value, and that your ORDER BY is not enough to make this query reliable, I would be very cautious about trying this on other databases, or other versions of MySQL.

The MySQl manual continues also to say: "Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses." Which would be true for queries such as

SELECT
  *
FROM
  table_name
WHERE
  table_name.some_field='1' OR
  table_name.some_field='2'
GROUP BY
  table_name.primary_key_column
ORDER BY
  table_name.some_field

But that is different than the query with the order by clause in the sub query. 
The MySQL manual should be updated whether an ORDER BY clause in the subquery will force grouping to always use the first instance of that group.

Shouldn't FIFO or LIFO have something to do with GROUP BY with the way it operates? I just can't wrap my mind around:

The server is free to choose any value from each group

There has to be some sort of logic to this choice. The computer is not allowed to make choices on its own without some sort of directive. If the computer can make choices on its own without any directive then we have truly reached the ultimate goal in computing, ARTIFICIAL INTELLIGENCE, and now the computers can write all my code for my applications. [LOL] What are the odds that this is undocumented logic? wouldn't RAND in the logic take more processing power than FIFO or FILO? There has to be some logic involved here that is undocumented but works 100% of the time.

How to repeat:
CREATE TABLE IF NOT EXISTS `user_preferences` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) DEFAULT NULL,
  `preference_name` varchar(20) DEFAULT NULL,
  `preference_value` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=8 ;

--
-- Dumping data for table `user_preferences`
--

INSERT INTO `user_preferences` (`id`, `user_id`, `preference_name`, `preference_value`) VALUES
(1, 1, 'homepage', 'www.yahoo.com'),
(2, NULL, 'homepage', 'www.google.com'),
(3, NULL, 'email_notify', 'yes'),
(4, 1, 'email_notify', 'no'),
(5, NULL, 'history', 'off'),
(6, 1, 'pictures', 'HD'),
(7, 2, 'some_setting', 'setting_val');

SELECT
  *
FROM 
  (
    SELECT
      *
    FROM
      user_preferences
    WHERE
      user_id ='1' OR
      user_id IS NULL
    ORDER BY
      CASE WHEN user_id IS NULL THEN 1 ELSE 0 END
  ) sub_query
GROUP BY
  preference_name;

The above will give the same results 100% of the time which according the the MySQl manual is not the way that GROUP BY operates.

Suggested fix:
If there is some logic to GROUP BY then add it to the documentation instead of "The server is free to choose any value from each group". I believe this is an undocumented logic to the GROUP BY. If this is truly the case that there is never any logic done on the GROUP BY and truly "The server is free to choose any value from each group" then explain how MySQL has developed Artificial Intelligence or why MySQL uses RANDOM logic in GROUP BY which will slow down any/all queries with GROUP BY instead of using FIFO or LIFO

Quoted from: http://stackoverflow.com/users/107744/spencer7593

"The idea behind SQL (Structured Query Language) is that it is declarative, in that it declares the resultset to be returned, and does not specify "how" the result set is to be returned.

[...] well-written code isn't 'random' in nature, it executes following a set of rules, and we observe consistent behavior. However, this is not a guarantee that this behavior won't ever change. (Compilers implement optimizations which result in machine code that gives the same result, but operates differently.)

For years, when we did a GROUP BY in Oracle, we observed that Oracle would use a sort operation as part of the "how" it went about satisfying the GROUP BY. Everytime we ran the same query, Oracle would do the same sort. But Oracle never made a guarantee that a sort operation would always be used.

When a new version of Oracle was rolled out, lo and behold, we observe that Oracle is now using a hash operation, rather than a sort operation, to satisfy a GROUP BY. Any query that included an ORDER BY clause continued to return results in the order that was specified (guaranteed behavior), but other queries which didn't have an ORDER BY had rows returned in a different order. Oracle just improved "how" it went about preparing a resultset that satisfied the requirements.

This one example demonstrates why we don't write queries that depend on a behavior that is not guaranteed.

In some later release of MySQL, we will likely observe changes in the behavior, where the resultset returned by a query is guaranteed to meet the specifications, but will be DIFFERENT from the resultset returned currently.

One of the primary design goals of SQL is that we declare the result set is to be returned, without specifying "how" the DBMS is to go about producing the resultset. We do use hints, and we structure queries in certain ways, and that does influence the execution plan generated by the optimizer. But this is not a guarantee.

If the resultset is to meet a specified requirement, then the SQL should include that specification; we should not rely on a particular observed behavior to imply that specification."

RECOMMENDATION:

I recommend changing the line found on https://dev.mysql.com/doc/refman/5.7/en/group-by-extensions.html 

From:

"The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate."

To:

"The server's choice of the value from each group is based on the current version's optimization logic which may change from version to version, so unless the values are the same, they are indeterminate and may not be the same results from one version to the next."

Because:

The server is not given the freedom to choose the values, but the values are chosen based upon optimization logic within the current version.

Thank you for the report.

What you are asking is just same phrase, but written in different words. Also not only server version matters, but tables content, because optimizer can choose different plan depending from the table statistics.

Regarding to Geometry you can only refute proves this way. So, if we'd guarantee order is same and you showed opposite this method would work. But not in this case.

So I don't think new re-phrasing will make the user manual clearer.