Bug #109548 Poor performance when using HASH field to check unique
Submitted: 9 Jan 2023 7:55 Modified: 20 Dec 2023 12:33
Reporter: Hope Lee (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: DML Severity:S5 (Performance)
Version:8.0.13,8.0.31 OS:Any
Assigned to: CPU Architecture:Any

[9 Jan 2023 7:55] Hope Lee
Description:
When the total length of group parts exceeds the maximum key length of TempTable, the optimizer chooses to add an extra hidden hash field to check the uniqueness of group parts. The executor calculates the hash value by XOR the hash value of each varchar field in unique_hash_group(). But if the values of any two group parts keep the same, the XOR hash value will be 0. Thus, all rows will generate the same hash value, causing very bad performance.

How to repeat:
CREATE TABLE product (
  value int NOT NULL AUTO_INCREMENT,
  code varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  name varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  comment varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  platform varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  KEY idx_key (value)
);

delimiter //
CREATE PROCEDURE BatchInsert(IN row_count int)
BEGIN
  SET @n = 1;
  REPEAT
      SET @str = (CONCAT('test', CAST(@n AS CHAR)));
      INSERT INTO product(code, name) VALUES(@str, @str);
      SET @n = @n + 1;
  UNTIL @n > row_count
  END REPEAT;
END;
//
delimiter ;

CALL BatchInsert(20000);

mysql-8.0.13 > SELECT COUNT(*) FROM ( SELECT value FROM product GROUP BY code, name, comment, platform ) derived;
+----------+
| COUNT(*) |
+----------+
|    20000 |
+----------+
1 row in set (17.84 sec)

Suggested fix:
If we can improve the hash calculating algorithm in unique_hash() for data of varchar type instead of using XOR, we can avoid the bad performance in this scenario. The following change is a simple example.

diff --git a/sql/sql_executor.cc b/sql/sql_executor.cc
index e59dac8642b..ffc16342b99 100644
--- a/sql/sql_executor.cc
+++ b/sql/sql_executor.cc
@@ -3737,7 +3737,6 @@ static bool table_rec_cmp(TABLE *table) {

 ulonglong unique_hash(Field *field, ulonglong *hash_val) {
   uchar *pos, *end;
-  ulong seed1 = 0, seed2 = 4;
   ulonglong crc = *hash_val;

   if (field->is_null()) {
@@ -3757,12 +3756,6 @@ ulonglong unique_hash(Field *field, ulonglong *hash_val) {
     Field_json *json_field = down_cast<Field_json *>(field);

     crc = json_field->make_hash_key(hash_val);
-  } else if (field->key_type() == HA_KEYTYPE_TEXT ||
-             field->key_type() == HA_KEYTYPE_VARTEXT1 ||
-             field->key_type() == HA_KEYTYPE_VARTEXT2) {
-    field->charset()->coll->hash_sort(field->charset(), (const uchar *)pos,
-                                      field->data_length(), &seed1, &seed2);
-    crc ^= seed1;
   } else
     while (pos != end)
       crc = ((crc << 8) + (((uchar) * (uchar *)pos++))) +

After the above change, the same query can be returned in 0.02 sec:
mysql-8.0.13 > SELECT COUNT(*) FROM ( SELECT value FROM product GROUP BY code, name, comment, platform ) derived;
+----------+
| COUNT(*) |
+----------+
|    20000 |
+----------+
1 row in set (0.02 sec)
[9 Jan 2023 8:54] MySQL Verification Team
Hello Hope Lee,

Thank you for the report and feedback.
Please ensure to re-send the patch via "contribution" tab. Otherwise we would not be able to accept it. Thank you!

regards,
Umesh
[9 Jan 2023 8:55] MySQL Verification Team
- 8.0.31 with default config

rm -rf 109548/
bin/mysqld --no-defaults --initialize-insecure --basedir=$PWD --datadir=$PWD/109548 --log-error-verbosity=3
bin/mysqld_safe --no-defaults --mysqld-version='' --basedir=$PWD --datadir=$PWD/109548 --core-file --socket=/tmp/mysql.sock  --port=3306 --log-error=$PWD/109548/log.err --mysqlx-port=33330 --mysqlx-socket=/tmp/mysql_x_ushastry.sock --log-error-verbosity=3  --secure-file-priv="" --local-infile=1 2>&1 &

 bin/mysql -uroot -S /tmp/mysql.sock
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 7
Server version: 8.0.31 MySQL Community Server - GPL

Copyright (c) 2000, 2022, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create database test;
Query OK, 1 row affected (0.01 sec)

mysql> use test
Database changed
mysql> CREATE TABLE product (
    ->   value int NOT NULL AUTO_INCREMENT,
    ->   code varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
    ->   name varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
    ->   comment varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
    ->   platform varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
    ->   KEY idx_key (value)
    -> );
Query OK, 0 rows affected (0.02 sec)

mysql> delimiter //
mysql> CREATE PROCEDURE BatchInsert(IN row_count int)
    -> BEGIN
    ->   SET @n = 1;
    ->   REPEAT
    ->       SET @str = (CONCAT('test', CAST(@n AS CHAR)));
    ->       INSERT INTO product(code, name) VALUES(@str, @str);
    ->       SET @n = @n + 1;
    ->   UNTIL @n > row_count
    ->   END REPEAT;
    -> END;
    -> //
Query OK, 0 rows affected (0.01 sec)

mysql> delimiter ;
mysql> CALL BatchInsert(20000);
Query OK, 0 rows affected (51.29 sec)

mysql> CALL BatchInsert(20000);
Query OK, 0 rows affected (50.49 sec)
[9 Jan 2023 8:56] MySQL Verification Team
- 5.7.40 performance much better compared to 8.0

 bin/mysql -uroot -S /tmp/mysql.sock
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.40 MySQL Community Server (GPL)

Copyright (c) 2000, 2022, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create database test;
Query OK, 1 row affected (0.00 sec)

mysql> use test
Database changed
mysql> CREATE TABLE product (
    ->   value int NOT NULL AUTO_INCREMENT,
    ->   code varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
    ->   name varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
    ->   comment varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
    ->   platform varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
    ->   KEY idx_key (value)
    -> );
Query OK, 0 rows affected (0.02 sec)

mysql> delimiter //
mysql> CREATE PROCEDURE BatchInsert(IN row_count int)
    -> BEGIN
    ->   SET @n = 1;
    ->   REPEAT
    ->       SET @str = (CONCAT('test', CAST(@n AS CHAR)));
    ->       INSERT INTO product(code, name) VALUES(@str, @str);
    ->       SET @n = @n + 1;
    ->   UNTIL @n > row_count
    ->   END REPEAT;
    -> END;
    -> //
Query OK, 0 rows affected (0.00 sec)

mysql> delimiter ;
mysql> CALL BatchInsert(20000);
Query OK, 0 rows affected (23.62 sec)
[20 Oct 2023 10:09] Dag Wanvik
Posted by developer:
 
The below documents the effects of the proposed patch.

Performance test run without the patch. As can be seen, the
performance with the repro data runs much slower than the normal case.

    #
    # Bug#34959356: Poor performance when using HASH field to check unique
    #
    CREATE TABLE product (
    value INT NOT NULL AUTO_INCREMENT,
    code VARCHAR(255),
    name VARCHAR(255),
    comment VARCHAR(255),
    platform VARCHAR(50),
    KEY idx_key (value)
    );
    CREATE PROCEDURE BatchInsertShowIssue(IN row_count int)
    BEGIN
    START TRANSACTION;
    SET @n = 1;
    REPEAT
    SET @str = (CONCAT('test', CAST(@n AS CHAR)));
    INSERT INTO product(code, name) VALUES(@str, @str);
    SET @n = @n + 1;
    UNTIL @n > row_count
    END REPEAT;
    COMMIT;
    END;
    //
    CALL BatchInsertShowIssue(20000);
    EXPLAIN ANALYZE
    SELECT COUNT(*)
    FROM ( SELECT ANY_VALUE(value)
    FROM product
    GROUP BY code, name, comment, platform ) derived;
    EXPLAIN
    -> Aggregate: count(0)  (cost=8505..8505 rows=1) (actual time=29836..29836 rows=1 loops=1)
        -> Table scan on derived  (cost=6253..6505 rows=20000) (actual time=29833..29835 rows=20000 loops=1)
            -> Materialize  (cost=6253..6253 rows=20000) (actual time=29833..29833 rows=20000 loops=1)
                -> Table scan on <temporary>  (cost=4000..4253 rows=20000) (actual time=29827..29830 rows=20000 loops=1)
                    -> Temporary table with deduplication  (cost=4000..4000 rows=20000) (actual time=29827..29827 rows=20000 loops=1)
                        -> Table scan on product  (cost=2000 rows=20000) (actual time=0.0296..32.7 rows=20000 loops=1)

    TRUNCATE product;
    CREATE PROCEDURE BatchInsertShowNormal(IN row_count int)
    BEGIN
    START TRANSACTION;
    SET @n = 1;
    REPEAT
    SET @str1 = (CONCAT('test', CAST(@n AS CHAR)));
    SET @str2 = (CONCAT('tset', CAST(@n AS CHAR)));
    INSERT INTO product(code, name) VALUES(@str1, @str2);
    SET @n = @n + 1;
    UNTIL @n > row_count
    END REPEAT;
    COMMIT;
    END;
    //
    CALL BatchInsertShowNormal(20000);
    EXPLAIN ANALYZE
    SELECT COUNT(*)
    FROM ( SELECT ANY_VALUE(value)
    FROM product
    GROUP BY code, name, comment, platform ) derived;
    EXPLAIN
    -> Aggregate: count(0)  (cost=8505..8505 rows=1) (actual time=42.1..42.1 rows=1 loops=1)
        -> Table scan on derived  (cost=6253..6505 rows=20000) (actual time=38.4..41.1 rows=20000 loops=1)
            -> Materialize  (cost=6253..6253 rows=20000) (actual time=38.3..38.3 rows=20000 loops=1)
                -> Table scan on <temporary>  (cost=4000..4253 rows=20000) (actual time=33..36.3 rows=20000 loops=1)
                    -> Temporary table with deduplication  (cost=4000..4000 rows=20000) (actual time=33..33 rows=20000 loops=1)
                        -> Table scan on product  (cost=2000 rows=20000) (actual time=0.0253..14.3 rows=20000 loops=1)

    DROP PROCEDURE BatchInsertShowIssue;
    DROP PROCEDURE BatchInsertShowNormal;
    DROP TABLE product;

Performance test run with the patch. As can be seen, the performance with the repro data runs
as fast as the normal case, which ran fast even without the patch.

    #
    # Bug#34959356: Poor performance when using HASH field to check unique
    #
    CREATE TABLE product (
    value INT NOT NULL AUTO_INCREMENT,
    code VARCHAR(255),
    name VARCHAR(255),
    comment VARCHAR(255),
    platform VARCHAR(50),
    KEY idx_key (value)
    );
    CREATE PROCEDURE BatchInsertShowIssue(IN row_count int)
    BEGIN
    START TRANSACTION;
    SET @n = 1;
    REPEAT
    SET @str = (CONCAT('test', CAST(@n AS CHAR)));
    INSERT INTO product(code, name) VALUES(@str, @str);
    SET @n = @n + 1;
    UNTIL @n > row_count
    END REPEAT;
    COMMIT;
    END;
    //
    CALL BatchInsertShowIssue(20000);
    EXPLAIN ANALYZE
    SELECT COUNT(*)
    FROM ( SELECT ANY_VALUE(value)
    FROM product
    GROUP BY code, name, comment, platform ) derived;
    EXPLAIN
    -> Aggregate: count(0)  (cost=8505..8505 rows=1) (actual time=42.1..42.1 rows=1 loops=1)
        -> Table scan on derived  (cost=6253..6505 rows=20000) (actual time=38.4..41.1 rows=20000 loops=1)
            -> Materialize  (cost=6253..6253 rows=20000) (actual time=38.4..38.4 rows=20000 loops=1)
                -> Table scan on <temporary>  (cost=4000..4253 rows=20000) (actual time=33.1..36.4 rows=20000 loops=1)
                    -> Temporary table with deduplication  (cost=4000..4000 rows=20000) (actual time=33.1..33.1 rows=20000 loops=1)
                        -> Table scan on product  (cost=2000 rows=20000) (actual time=0.0232..14.3 rows=20000 loops=1)

    TRUNCATE product;
    CREATE PROCEDURE BatchInsertShowNormal(IN row_count int)
    BEGIN
    START TRANSACTION;
    SET @n = 1;
    REPEAT
    SET @str1 = (CONCAT('test', CAST(@n AS CHAR)));
    SET @str2 = (CONCAT('tset', CAST(@n AS CHAR)));
    INSERT INTO product(code, name) VALUES(@str1, @str2);
    SET @n = @n + 1;
    UNTIL @n > row_count
    END REPEAT;
    COMMIT;
    END;
    //
    CALL BatchInsertShowNormal(20000);
    EXPLAIN ANALYZE
    SELECT COUNT(*)
    FROM ( SELECT ANY_VALUE(value)
    FROM product
    GROUP BY code, name, comment, platform ) derived;
    EXPLAIN
    -> Aggregate: count(0)  (cost=8505..8505 rows=1) (actual time=41.7..41.7 rows=1 loops=1)
        -> Table scan on derived  (cost=6253..6505 rows=20000) (actual time=38..40.7 rows=20000 loops=1)
            -> Materialize  (cost=6253..6253 rows=20000) (actual time=38..38 rows=20000 loops=1)
                -> Table scan on <temporary>  (cost=4000..4253 rows=20000) (actual time=32.7..36 rows=20000 loops=1)
                    -> Temporary table with deduplication  (cost=4000..4000 rows=20000) (actual time=32.7..32.7 rows=20000 loops=1)
                        -> Table scan on product  (cost=2000 rows=20000) (actual time=0.0346..14.4 rows=20000 loops=1)

    DROP PROCEDURE BatchInsertShowIssue;
    DROP PROCEDURE BatchInsertShowNormal;
    DROP TABLE product;
[20 Dec 2023 12:33] Jon Stephens
Documented fix as follows in the MySQL 8.0.36 and 8.3.0 changelogs:

    The hashing algorithm employed yielded poor performance when
    using a HASH field to check for uniqueness.

Closed.