Bug #117667 Both INNER JOIN and SEMI JOIN are contradictory
Submitted: 11 Mar 12:57 Modified: 11 Mar 16:05
Reporter: jinhui lai Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:9.2.0, 9.1.0, 8.0.41, 8.4.4 OS:Ubuntu (22.04)
Assigned to: CPU Architecture:Any

[11 Mar 12:57] jinhui lai
Description:
Both INNER JOIN and SEMI JOIN return wrong result.

How to repeat:
docker run -it --name mysql-9.2.0 -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root mysql:9.2.0

CREATE TABLE IF NOT EXISTS t0(c0 CHAR(1), c1 CHAR(1)) ;
CREATE TABLE IF NOT EXISTS t1 LIKE t0;
INSERT INTO t0(c1) VALUES(NULL);
UPDATE t0 SET c0='A';
INSERT INTO t0(c0) VALUES('a');
INSERT INTO t1(c0) VALUES('a'), ('b');
SELECT DISTINCT t0.c0 FROM t0;
+------+
| c0   |
+------+
| A    |
+------+
SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 != t1.c0;--error, expect: A
+------+
| c0   |
+------+
| a    |
+------+
SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 = t1.c0;
+------+
| c0   |
+------+
| a    |
+------+
SELECT DISTINCT t0.c0 FROM t0  WHERE EXISTS (SELECT 1 FROM t1 WHERE t0.c0 != t1.c0);--error, expect: a
+------+
| c0   |
+------+
| A    |
+------+
SELECT DISTINCT t0.c0 FROM t0  WHERE EXISTS (SELECT 1 FROM t1 WHERE t0.c0 = t1.c0);
+------+
| c0   |
+------+
| A    |
+------+
[11 Mar 13:25] MySQL Verification Team
Hello jinhui lai,

Thank you for the report and test case.
Verified as described.

regards,
Umesh
[11 Mar 15:07] Roy Lyseng
Posted by developer:
 
I am not sure what is claimed to be a bug here, but the default collation of MySQL is case insensitive.
This means that e.g 'A' and 'a' compare as equal, and when selecting a DISTINCT value from
a set of values including these, it is arbitrary what the result will be.
[11 Mar 16:05] jinhui lai
Dear Roy,  

Thank you for your response. While I understand the default collation is case-insensitive, the issue lies in the logical consistency of the query results. For example:  

In this case, 'SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 != t1.c0' returns 'a', but under case-insensitive rules, 'A' and 'a' are equal, so neither should satisfy '!= 'if t1 contains 'a'. In fact, according to what you said, this query should return an empty set. 
SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 != t1.c0;
+------+
| c0   |
+------+
| a    |
+------+

Similarly, 'SELECT DISTINCT t0.c0 FROM t0 WHERE EXISTS (SELECT 1 FROM t1 WHERE t0.c0 != t1.c0);' returns 'A' instead of 'a', which also contradicts the expected logic. 
SELECT DISTINCT t0.c0 FROM t0 WHERE EXISTS (SELECT 1 FROM t1 WHERE t0.c0 != t1.c0);
+------+
| c0   |
+------+
| a    |
+------+

Furthermore, when i do not insert ‘b’, the result changes, which i consider to be too arbitrary.
CREATE TABLE IF NOT EXISTS t0(c0 CHAR(1), c1 CHAR(1)) ;
CREATE TABLE IF NOT EXISTS t1 LIKE t0;
INSERT INTO t0(c1) VALUES(NULL);
UPDATE t0 SET c0='A';
INSERT INTO t0(c0) VALUES('a');
INSERT INTO t1(c0) VALUES('a'); --not insert value 'b'
SELECT DISTINCT t0.c0 FROM t0;
+------+
| c0   |
+------+
| A    |
+------+
SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 != t1.c0;
+------+
| c0   |
+------+
| A   |
+------+
SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 = t1.c0;
+------+
| c0   |
+------+
| A    |
+------+
SELECT DISTINCT t0.c0 FROM t0  WHERE EXISTS (SELECT 1 FROM t1 WHERE t0.c0 != t1.c0);
+------+
| c0   |
+------+
| A    |
+------+
SELECT DISTINCT t0.c0 FROM t0  WHERE EXISTS (SELECT 1 FROM t1 WHERE t0.c0 = t1.c0);
+------+
| c0   |
+------+
| A    |
+------+

Since the returned value is arbitrary, INNER JOIN and SEMI JOIN are contradictory. In fact, i executed the same case on MariaDB and PostgreSQL and no bug was generated. I think this is a valuable reference.  

All in all, the issue appears to be a logic inconsistency in handling `!=` and `DISTINCT`, not just case sensitivity. This can cause real trouble for MySQL users. Could you please further investigate it? 

Best regards,  
Jinhui Lai
[11 Mar 22:03] Roy Lyseng
In the case

  SELECT DISTINCT t0.c0 FROM t0 INNER JOIN t1 ON t0.c0 != t1.c0;

the inequality is satisfied for the 'b' value in t1.c0, thus it is reasonable to return a value. It is also arbitrary whether 'a' or 'A' is returned.

As for the tests without the 'b' value, I am unable to reproduce your results on 9.x.