Bug #115132 group by with COLLATE return unexpected result
Submitted: 27 May 2:48 Modified: 31 May 7:13
Reporter: Yang Yu Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:8.0.32 OS:Any
Assigned to: CPU Architecture:Any
Tags: charset, collate

[27 May 2:48] Yang Yu
Description:
Some varcahr equal in some collate.
COLLATE is also supported by GROUP BY keyword.
However, equal varchar are not grouped.

How to repeat:
create table my_test(`c1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL);

insert into my_test values (_utf8mb4 0x0701), (_utf8mb4 0xE2808A);

select count(distinct c1) from my_test;

select * from my_test group by c1 collate utf8mb4_unicode_520_ci;

Suggested fix:
should not use temp table for grouping.
[27 May 10:02] MySQL Verification Team
Hi Mr. Yu,

Thank you for your bug report.

However, we are not able to repeat your results. You should use only full group by , since in any other cases you will not get exact  values of all columns.

Also, all aggregated queries, including GROUP BY have to use temporary tables for their resolution.

Can't repeat.
[27 May 11:34] Tianyu You
Use only full group by, I repeated this issue by the steps below:

create table my_test(`c1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL);

insert into my_test values (_utf8mb4 0x0701), (_utf8mb4 0xE2808A);

select _utf8mb4 0x0701 collate utf8mb4_unicode_520_ci = _utf8mb4 0xE2808A collate utf8mb4_unicode_520_ci; -- this query returns, which means that 0x0701 and 0xE2808A are identical under collate utf8mb4_unicode_520_ci

select count(distinct c1) from my_test; -- this query returns 1, which is correct

select count(*) from my_test group by c1; -- this query returns 2 rows, which is unexpected.

By migrating this test case for other collations, I found that utf8mb4_unicode_ci also has this issue.
[28 May 6:49] MySQL Verification Team
Isn't this inconsistent?  From 8.0.37:

mysql> set internal_tmp_mem_storage_engine=TempTable;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from my_test group by c1 collate utf8mb4_unicode_520_ci;
+----------+
| count(*) |
+----------+
|        1 |
|        1 |
+----------+
2 rows in set (0.00 sec)

mysql> set internal_tmp_mem_storage_engine=memory;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from my_test group by c1 collate utf8mb4_unicode_520_ci;
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)
[31 May 7:13] Yang Yu
TempTable storage engine has bug

mysql> set internal_tmp_mem_storage_engine=TempTable;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from my_test group by c1 collate
utf8mb4_unicode_520_ci;
+----------+
| count(*) |
+----------+
|        1 |
|        1 |
+----------+
2 rows in set (0.00 sec)

mysql> set internal_tmp_mem_storage_engine=memory;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from my_test group by c1 collate
utf8mb4_unicode_520_ci;
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)