MySQL Bugs: #115434: collation: utf8mb4_generail_ci can result in serious data skew

Bug #115434	collation: utf8mb4_generail_ci can result in serious data skew
Submitted:	26 Jun 2024 8:35	Modified:	1 Jul 2024 8:35
Reporter:	Chaofan Wang	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	utf8mb4 charset collation

Description:
when we using a parition table, with it's strategy is key and partition column is varchar/char type with collation of utf8mb4_generail_ci, data skew can be happened quite easy.

How to repeat:
1. create a partition table

CREATE TABLE `users` (
  `username` varchar(255) COLLATE utf8mb4_general_ci NOT NULL,
  `id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
PARTITION BY KEY (username)
PARTITIONS 8 

2. use a stored procedure to generate random data

CREATE PROCEDURE `gen_data`(val int)
begin
declare i int default 0;
while i < val do
insert into users values (SUBSTRING(CONV(FLOOR(RAND() * 99999999999999), 10, 36), 1, 8), i);
set i = i+1;
end while;
end;

3. call procedure to generate data
call gen_data(1000);

4. check data is quite random
select * from users;

5. check data skew happend
SELECT      table_schema,      table_name,      partition_name,      table_rows FROM     information_schema.partitions  WHERE      table_name = 'users' AND      table_schema = $schema_name;

You will see a phenomenon similar to the one shown in the below.
+--------------+------------+----------------+------------+
| TABLE_SCHEMA | TABLE_NAME | PARTITION_NAME | TABLE_ROWS |
+--------------+------------+----------------+------------+
| test         | users      | p0             |        169 |
| test         | users      | p1             |          0 |
| test         | users      | p2             |        407 |
| test         | users      | p3             |          1 |
| test         | users      | p4             |        150 |
| test         | users      | p5             |          2 |
| test         | users      | p6             |        271 |
| test         | users      | p7             |          0 |
+--------------+------------+----------------+------------+

Suggested fix:
The reason is the calculate of hashcode under utf8mb4_general_ci is not very random, you can see my_hash_sort_utf8mb4 function of ctype-utf8.cc. 

The core calculate logic is shown as follows:

while ((res = my_mb_wc_utf8mb4(&wc, s, e)) > 0) {
    my_tosort_unicode(uni_plane, &wc, cs->state);

    ch = (wc & 0xFF);
    tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8);
    tmp2 += 3;

    ch = (wc >> 8) & 0xFF;
    tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8);
    tmp2 += 3;

    if (wc > 0xFFFF) {
      /*
        Put the highest byte only if it is non-zero,
        to make hash functions for utf8mb3 and utf8mb4
        compatible for BMP characters.
        This is useful to keep order of records in
        test results, e.g. for "SHOW GRANTS".
       */
      ch = (wc >> 16) & 0xFF;
      tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8);
      tmp2 += 3;
    }
    s += res;
  }

When our data contains only numbers and letters (which is quite common), the above logic can be simplified as:

while ((res = my_mb_wc_utf8mb4(&wc, s, e)) > 0) {
    my_tosort_unicode(uni_plane, &wc, cs->state);

    ch = (wc & 0xFF);

    / * The initial value of temp2 is 4, so the value of temp2 here must be an even number.
    Therefore, once temp1 is an even number, the calculated value of temp1 obtained here will definitely be an even number. */

    tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8);
    tmp2 += 3;

    /* If temp1 is an even number, then it is clear that the temp1 calculated here will also be an even number. */
    tmp1 ^= (tmp1 << 8);
    tmp2 += 3;
    s += res;
  }

  Therefore, once temp1 is an even number, the calculated value of temp1 obtained here will definitely be an even number.

Suggested fix: add more random steps when calculating hashcode, but this may result in incompatibility between previous and subsequent versions.

Hi Mr. Wang,

Thank you for your bug report.

We managed to repeat it with latest 8.0 and 8.4:

TABLE_SCHEMA	TABLE_NAME	PARTITION_NAME	TABLE_ROWS
sc	users	p0	185
sc	users	p1	0
sc	users	p2	385
sc	users	p3	1
sc	users	p4	151
sc	users	p5	3
sc	users	p6	274
sc	users	p7	1

However, this is not a bug, but a feature request. A new feature would be a better distribution of the values among the partitions.

Verified as a feature request for the version 8.0 and higher.

Thank you for pointing us to the problem in the code.

I suggest you use utf8mb4_0900_ai_ci which works much better (in several ways). Using your repro but with utf8mb4_0900_ai_ci I get:

mysql> show create table users\G
*************************** 1. row ***************************
       Table: users
Create Table: CREATE TABLE `users` (
  `username` varchar(255) NOT NULL,
  `id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY KEY (username)
PARTITIONS 8 */
1 row in set (0,01 sec)

mysql> SELECT table_schema, table_name, partition_name, table_rows FROM information_schema.partitions WHERE table_name = 'users' AND table_schema = 'test';
+--------------+------------+----------------+------------+
| TABLE_SCHEMA | TABLE_NAME | PARTITION_NAME | TABLE_ROWS |
+--------------+------------+----------------+------------+
| test         | users      | p0             |        132 |
| test         | users      | p1             |        113 |
| test         | users      | p2             |        117 |
| test         | users      | p3             |        115 |
| test         | users      | p4             |        130 |
| test         | users      | p5             |        126 |
| test         | users      | p6             |        152 |
| test         | users      | p7             |        115 |
+--------------+------------+----------------+------------+
8 rows in set (0,00 sec)

Thank you, Bernt.

Since, we cannot change existing collations, and there is a reasonable workaround to upgrade to a more recent collation, we are closing this report as it is not feasible to fix.

Thank you, Roy.