Bug #97605 Trailing space in joins is handled differently in recent versions of MySQL 8
Submitted: 13 Nov 6:07 Modified: 21 Nov 7:48
Reporter: Yoseph Phillips Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.18, 8.0.11 OS:Any
Assigned to: CPU Architecture:Any

[13 Nov 6:07] Yoseph Phillips
Description:
In older versions of MySQL JOINing VARCHAR columns with a collation using PAD SPACE (such as utf8mb4_unicode_ci) would have joined values when the only difference was in the number of trailing spaces.

In 8.0.18 however the number of trailing spaces needs to match when doing a JOIN.

When doing a SELECT with a literal on the WHERE clause the number of trailing spaces is ignored in all versions of MySQL so it appears that it is only the JOIN that has changed.

We cannot seem to find this documented anywhere.

How to repeat:
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
       
CREATE TABLE test (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT, 
  testValue VARCHAR(10), 
  PRIMARY KEY (id)
);
       
INSERT INTO test (testValue) VALUES ('test'), ('test   ');
       
SELECT * FROM test t WHERE t.testValue = 'test';
       
SELECT *
FROM test t1
INNER JOIN test t2 ON t2.testValue = t1.testValue;

Suggested fix:
Either document this if this was intentional, or fix it if it wasn't.
[13 Nov 6:27] Umesh Shastry
Hello Yoseph Phillips,

Thank you for the report and feedback.

regards,
Umesh
[13 Nov 18:35] Steinar Gunderson
Hi,

This is documented here:

https://dev.mysql.com/doc/refman/8.0/en/charset-binary-collations.html#charset-binary-coll...
[14 Nov 0:01] Yoseph Phillips
That documentation does not talk about this.

The issue here is caused by the new hash join optimisation.

If one does:
SELECT *
FROM test t1
INNER JOIN test t2 ON t2.testValue = t1.testValue;

while using the hash join optimisation it gives different results than documented.

If one calls SET optimizer_switch = "hash_join=off"; first than it gives the results as documented.
[20 Nov 23:04] Jon Stephens
Documented fix as follows in the MySQL 8.0.19 changelog:

    When creating hash values used for lookups during a hash join,
    the server did not respect the PAD SPACE attribute, meaning that
    'foo' and 'foo ' would not match in a PAD SPACE collation. This
    is fixed this by padding all strings up to the same length as
    the longest possible string, where the longest possible string
    is deduced from the data type length specifier, such as N in 
    CHAR(N) or VARCHAR(N).

Closed.
[21 Nov 7:48] Yoseph Phillips
Maybe it would be better for performance and simpler to trim these values instead so that the lookup value of 'foo ' is 'foo'.