| Bug #115434 | collation: utf8mb4_generail_ci can result in serious data skew | ||
|---|---|---|---|
| Submitted: | 26 Jun 2024 8:35 | Modified: | 1 Jul 2024 8:35 | 
| Reporter: | Chaofan Wang | Email Updates: | |
| Status: | Won't fix | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S4 (Feature request) | 
| Version: | 8.0 | OS: | Any | 
| Assigned to: | CPU Architecture: | Any | |
| Tags: | utf8mb4 charset collation | ||
   [26 Jun 2024 9:38]
   MySQL Verification Team        
  Hi Mr. Wang, Thank you for your bug report. We managed to repeat it with latest 8.0 and 8.4: TABLE_SCHEMA TABLE_NAME PARTITION_NAME TABLE_ROWS sc users p0 185 sc users p1 0 sc users p2 385 sc users p3 1 sc users p4 151 sc users p5 3 sc users p6 274 sc users p7 1 However, this is not a bug, but a feature request. A new feature would be a better distribution of the values among the partitions. Verified as a feature request for the version 8.0 and higher. Thank you for pointing us to the problem in the code.
   [26 Jun 2024 18:49]
   Bernt Marius Johnsen        
  I suggest you use utf8mb4_0900_ai_ci which works much better (in several ways). Using your repro but with utf8mb4_0900_ai_ci I get:
mysql> show create table users\G
*************************** 1. row ***************************
       Table: users
Create Table: CREATE TABLE `users` (
  `username` varchar(255) NOT NULL,
  `id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY KEY (username)
PARTITIONS 8 */
1 row in set (0,01 sec)
mysql> SELECT table_schema, table_name, partition_name, table_rows FROM information_schema.partitions WHERE table_name = 'users' AND table_schema = 'test';
+--------------+------------+----------------+------------+
| TABLE_SCHEMA | TABLE_NAME | PARTITION_NAME | TABLE_ROWS |
+--------------+------------+----------------+------------+
| test         | users      | p0             |        132 |
| test         | users      | p1             |        113 |
| test         | users      | p2             |        117 |
| test         | users      | p3             |        115 |
| test         | users      | p4             |        130 |
| test         | users      | p5             |        126 |
| test         | users      | p6             |        152 |
| test         | users      | p7             |        115 |
+--------------+------------+----------------+------------+
8 rows in set (0,00 sec)
 
   [27 Jun 2024 10:13]
   MySQL Verification Team        
  Thank you, Bernt.
   [1 Jul 2024 8:35]
   Roy Lyseng        
  Since, we cannot change existing collations, and there is a reasonable workaround to upgrade to a more recent collation, we are closing this report as it is not feasible to fix.
   [1 Jul 2024 11:20]
   MySQL Verification Team        
  Thank you, Roy.


Description: when we using a parition table, with it's strategy is key and partition column is varchar/char type with collation of utf8mb4_generail_ci, data skew can be happened quite easy. How to repeat: 1. create a partition table CREATE TABLE `users` ( `username` varchar(255) COLLATE utf8mb4_general_ci NOT NULL, `id` int DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci PARTITION BY KEY (username) PARTITIONS 8 2. use a stored procedure to generate random data CREATE PROCEDURE `gen_data`(val int) begin declare i int default 0; while i < val do insert into users values (SUBSTRING(CONV(FLOOR(RAND() * 99999999999999), 10, 36), 1, 8), i); set i = i+1; end while; end; 3. call procedure to generate data call gen_data(1000); 4. check data is quite random select * from users; 5. check data skew happend SELECT table_schema, table_name, partition_name, table_rows FROM information_schema.partitions WHERE table_name = 'users' AND table_schema = $schema_name; You will see a phenomenon similar to the one shown in the below. +--------------+------------+----------------+------------+ | TABLE_SCHEMA | TABLE_NAME | PARTITION_NAME | TABLE_ROWS | +--------------+------------+----------------+------------+ | test | users | p0 | 169 | | test | users | p1 | 0 | | test | users | p2 | 407 | | test | users | p3 | 1 | | test | users | p4 | 150 | | test | users | p5 | 2 | | test | users | p6 | 271 | | test | users | p7 | 0 | +--------------+------------+----------------+------------+ Suggested fix: The reason is the calculate of hashcode under utf8mb4_general_ci is not very random, you can see my_hash_sort_utf8mb4 function of ctype-utf8.cc. The core calculate logic is shown as follows: while ((res = my_mb_wc_utf8mb4(&wc, s, e)) > 0) { my_tosort_unicode(uni_plane, &wc, cs->state); ch = (wc & 0xFF); tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8); tmp2 += 3; ch = (wc >> 8) & 0xFF; tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8); tmp2 += 3; if (wc > 0xFFFF) { /* Put the highest byte only if it is non-zero, to make hash functions for utf8mb3 and utf8mb4 compatible for BMP characters. This is useful to keep order of records in test results, e.g. for "SHOW GRANTS". */ ch = (wc >> 16) & 0xFF; tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8); tmp2 += 3; } s += res; } When our data contains only numbers and letters (which is quite common), the above logic can be simplified as: while ((res = my_mb_wc_utf8mb4(&wc, s, e)) > 0) { my_tosort_unicode(uni_plane, &wc, cs->state); ch = (wc & 0xFF); / * The initial value of temp2 is 4, so the value of temp2 here must be an even number. Therefore, once temp1 is an even number, the calculated value of temp1 obtained here will definitely be an even number. */ tmp1 ^= (((tmp1 & 63) + tmp2) * ch) + (tmp1 << 8); tmp2 += 3; /* If temp1 is an even number, then it is clear that the temp1 calculated here will also be an even number. */ tmp1 ^= (tmp1 << 8); tmp2 += 3; s += res; } Therefore, once temp1 is an even number, the calculated value of temp1 obtained here will definitely be an even number. Suggested fix: add more random steps when calculating hashcode, but this may result in incompatibility between previous and subsequent versions.