Bug #117493 Anti-join incorrectly handles ASCII control characters
Submitted: 17 Feb 15:41 Modified: 17 Feb 15:54
Reporter: jinhui lai Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:9.1.0 OS:Ubuntu (22.04)
Assigned to: CPU Architecture:x86

[17 Feb 15:41] jinhui lai
Description:
Anti-join incorrectly handles ASCII control characters, such as "SOH" (start of heading), by failing to respect their correct ordering (e.g., "SOH" < a < b < d < e). This results in incorrect output.

How to repeat:
Create a file test.sql and fill test.sql with follow sql statements. Notably, there is a ASCII control character "SOH"(start of heading),before letter "c".

CREATE TABLE IF NOT EXISTS t0(c0 TINYTEXT);
CREATE TABLE IF NOT EXISTS t1 LIKE t0;
INSERT INTO t0(c0) VALUES("b"), ("c"), ("d");
INSERT INTO t1(c0) VALUES('a'), ('e');
CREATE INDEX i1 ON t0(c0(1));

Then, execute SQL from file. For example: source /test.sql

The result set of a left join should equal the union of an anti-join and a semi-join. However, the results do not fit this equation, so we find this bug.

SELECT t1.c0 FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t0 WHERE t1.c0 < t0.c0);
+------+
| c0   |
+------+
| e    |
+------+
SELECT t1.c0 FROM t1 INNER JOIN t0 ON t1.c0 < t0.c0;
+------+
| c0   |
+------+
| a    |
| a    |
+------+
SELECT t1.c0 FROM t1 LEFT JOIN t0 ON t1.c0 < t0.c0; 
+------+
| c0   |
+------+
| a    |
| a    |
| a    |
| e    |
+------+
[17 Feb 15:54] MySQL Verification Team
Hi Mr. lai,

We tried but we could not repeat your test case.

These are the results that we get:

mysql> SELECT t1.c0 FROM t1 WHERE NOT EXISTS (SELECT 1 FROM t0 WHERE t1.c0 < t0.c0);
+------+
| c0   |
+------+
| e    |
+------+
1 row in set (0.00 sec)

mysql>
mysql> SELECT t1.c0 FROM t1 INNER JOIN t0 ON t1.c0 < t0.c0;
+------+
| c0   |
+------+
| a    |
| a    |
| a    |
+------+
3 rows in set (0.00 sec)

mysql>
mysql> SELECT t1.c0 FROM t1 LEFT JOIN t0 ON t1.c0 < t0.c0;
+------+
| c0   |
+------+
| a    |
| a    |
| a    |
| e    |
+------+
4 rows in set (0.00 sec)

-----------------------------

The problem is that you are not specifying Unicode characters as per utfmb4 standard.

Can't repeat.