MySQL Bugs: #68254: subquery optimize to wrong semi-join with extensions syntax to group by

Bug #68254	subquery optimize to wrong semi-join with extensions syntax to group by
Submitted:	2 Feb 2013 10:31	Modified:	9 Jan 2015 16:42
Reporter:	lou shuai (OCA)	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.6.10	OS:	Any
Assigned to:	Guilhem Bichot	CPU Architecture:	Any
Tags:	GROUP BY, semi join, subquery

Description:
subquery with group by produce wrong result set.

How to repeat:
use test;
CREATE TABLE `t1` (
  `t1_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  `type` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  PRIMARY KEY (`t1_id`),
  KEY `type` (`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
CREATE TABLE `t2` (
  `t2_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  `t1_id` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  PRIMARY KEY (`t2_id`),
  KEY `t1_id` (`t1_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 
 
INSERT INTO t1 SET TYPE=0;
INSERT INTO t1 SET TYPE=1;
INSERT INTO t1 SET TYPE=1;
INSERT INTO t1 SET TYPE=1;
 
INSERT INTO t2 SET t1_id=1;
INSERT INTO t2 SET t1_id=1;
INSERT INTO t2 SET t1_id=1;
INSERT INTO t2 SET t1_id=1;
INSERT INTO t2 SET t1_id=1;
INSERT INTO t2 SET t1_id=1;
 
INSERT INTO t2 SET t1_id=2;
INSERT INTO t2 SET t1_id=2;
INSERT INTO t2 SET t1_id=2;
INSERT INTO t2 SET t1_id=2;
INSERT INTO t2 SET t1_id=2;
INSERT INTO t2 SET t1_id=2;
 
INSERT INTO t2 SET t1_id=3;
INSERT INTO t2 SET t1_id=3;
INSERT INTO t2 SET t1_id=3;
INSERT INTO t2 SET t1_id=3;
INSERT INTO t2 SET t1_id=3;
INSERT INTO t2 SET t1_id=3;
 
INSERT INTO t2 SET t1_id=4;
INSERT INTO t2 SET t1_id=4;
INSERT INTO t2 SET t1_id=4;
INSERT INTO t2 SET t1_id=4;
INSERT INTO t2 SET t1_id=4;
INSERT INTO t2 SET t1_id=4;
COMMIT;

mysql> SELECT t1_id from t1 group by type;
+-------+
| t1_id |
+-------+
|     1 |
|     2 |
+-------+
2 rows in set (0.00 sec)

mysql> SELECT  t2_id,t1_id FROM t2 WHERE t2.t1_id IN (1,2);
+-------+-------+
| t2_id | t1_id |
+-------+-------+
|     1 |     1 |
|     2 |     1 |
|     3 |     1 |
|     4 |     1 |
|     5 |     1 |
|     6 |     1 |
|     7 |     2 |
|     8 |     2 |
|     9 |     2 |
|    10 |     2 |
|    11 |     2 |
|    12 |     2 |
+-------+-------+
12 rows in set (0.00 sec)

mysql> SELECT  t2_id,t1_id FROM t2 WHERE t2.t1_id IN (SELECT t1_id FROM t1 GROUP BY TYPE);
+-------+-------+
| t2_id | t1_id |
+-------+-------+
|     1 |     1 |
|     2 |     1 |
|     3 |     1 |
|     4 |     1 |
|     5 |     1 |
|     6 |     1 |
|     7 |     2 |
|     8 |     2 |
|     9 |     2 |
|    10 |     2 |
|    11 |     2 |
|    12 |     2 |
|    13 |     3 |
|    14 |     3 |
|    15 |     3 |
|    16 |     3 |
|    17 |     3 |
|    18 |     3 |
|    19 |     4 |
|    20 |     4 |
|    21 |     4 |
|    22 |     4 |
|    23 |     4 |
|    24 |     4 |
+-------+-------+
24 rows in set (0.00 sec)

Suggested fix:
see the attach file: patch.diff which is based on 
revno: 4579
tags: mysql-5.6.9

patch for this bug

Attachment: patch.diff (text/x-patch), 910 bytes.

Thank you for the excellent bug report! I will copy this over to the development team now, so that they can take a closer look at it.

5.5.31 seems to return a correct result, so this looks like a regression.

@Erlend Dahl
Actually, it's not a regression cause this is a subquery optimize fault which is
implemented only in 5.6. So the bug is brought in with the new optimize method:
subquery to semi join.

Hello. As explained in
http://dev.mysql.com/doc/refman/5.6/en/group-by-extensions.html
the subquery "SELECT t1_id from t1 group by type" is bound to give random results:
"MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. [...] However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.". In table t1, for the group which has type=1, there are three different values of t1_id, so results are indeterminate.
If we want to say this in more pedantic words, selected columns are not functionally dependent on grouping columns, so the subquery is also not compliant with standard SQL2011.

Because it gives random results, the top query also gives random results.
If running with sql_mode=only_full_group_by, it is properly rejected:
SELECT t1_id from t1 group by type;
ERROR 1055 (42000): 'test.t1.t1_id' isn't in GROUP BY

@Guilhem Bichot
I have read the manual about the extention syntax with Group By, and I also know the syntax is not valid in standard SQL syntax. But for now, MySQL already supports this syntax, so we should not just use an ONLY_FULL_GROUP_BY sqlmode to avoid this problem, and we should deal with this problem.

And the indeterminate value means the server is free to choose any value from each group. What every the server choose, the number of the result will be determinate according to the group by list. So whatever the subquery returns, it should just return two rows, so the result of the sql is wrong.

MySQL 5.5.* and before works well, and just MySQL 5.6.* return the different result  set. You should have a deep look into the Optimizer, and you will find this is caused by the subquery Optimize which is introduced only in 5.6.*, the Optimizer translate the subquery to semi-join, so the problem happens. And supply a patch for this problem, it's kind of you to have a glance at the patch.

Generally speaking, the bug is introduced only in 5.6.*, and the older version works well, so please fix it to make sure all of mysql versions works the same.

There is a bug here, but it is not important, IMHO.

There is no way that one should expect to see nested query finishing in:

... IN (1,2);

returning the same result as the nested query presented in the test case.

A nested query is free to return, for this test case, any of the three rows :

(1,2) or (1,3) or (1,4)

but no, it can't return (1,2,3,4) as `type` has only two distinct value. This is the only bug here, although very small.

Even that bug could be explained, by someone from Optimizer team. 

It is quite possible that in the optimizing stage sub query returns (1,2), while in execution it returns (1,4), which would lead to the error that we see in the last result set.

If Optimizer team discovers that this is a possible scenario, then I would say that this is not a bug. Simply, the extended results from aggregated queries are unreliable and can't be used in optimization.

http://bugs.mysql.com/bug.php?id=69521 marked as duplicate of this one.

actually they are not duplicates. This one has IN(subquery).