Description:
Performing a SELECT...GROUP BY query with an aggregation function on a large table with a collation of `utf8mb4_unicode_520_ci` causes a "duplicate entry" error to occur. A duplicate entry error shouldn't be occurring on a SELECT query.
How to repeat:
A query that causes this error is:
SELECT COUNT(*) FROM reports GROUP BY `query`
An example of a specific error is:
Error Code: 1062. Duplicate entry 'mens gloves' for key 'group_key'
This error occurs even if 'mens gloves' occurs in only a single row in the `query` column. So there is no duplicate data even in the source table.
The query succeeds if the collation is explicitly set to a different collation:
SELECT SUM(impressions) FROM reports GROUP BY `query` COLLATE utf8mb4_general_ci
It also succeeds if a WHERE clause is added to filter the result size:
SELECT COUNT(*) FROM reports GROUP BY `query` WHERE `query` > 'o'
Depending on the size of the table, the query can also succeed if the size of the temp table is increased:
SET SESSION max_heap_table_size=536870912;
SET SESSION tmp_table_size=536870912;
Thus the issue appears related specifically to the `utf8mb4_unicode_520_ci` collation on a table with a good number of rows (around 70-80K rows is when the error starts appearing in our live data).
This bug appears related to bug #62755 and bug #58081, but is distinct from those bugs because the issue is occurring due to the collation set on the table and/or the query.
Suggested fix:
A workaround is to set the collation of the table or column to `utf8mb4_general_ci` or to explicitly specify the collation in the query.
A less reliable workaround is to adjust the temp table size. But I suspect this just increases the size limit at which the error occurs rather than getting rid of the error entirely.