MySQL Bugs: #104897: Multi-valued index performance is too slow (Access too many rows than required)

Bug #104897	Multi-valued index performance is too slow (Access too many rows than required)
Submitted:	10 Sep 2021 6:29	Modified:	6 Aug 2024 13:18
Reporter:	Seunguck Lee	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	8.0.26, 8.0.17	OS:	Any
Assigned to:		CPU Architecture:	x86
Tags:	multi-valued index

Description:
The query which has "greater or less than (> or <)" condition on multi-valued index is too slow.
If there's only EQUAL condition, much faster on multi-valued index.

Following result shows the performance difference. If there's many rows in the table (regardless of output rows), performance difference is much bigger than following case. (For example, there's 30M rows on table, slower query takes about 5 minutes)

But according to the output of "EXPLAIN ANALYZE", there's almost no difference between slower and faster case. Output shows only difference on "time" and message about "p_at>=?" condition.

-- // Slower case
SELECT COUNT(*)
FROM test_mvi
WHERE type IN ('T1', 'T2')
  AND JSON_OVERLAPS(hids->'$', CAST('[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]' AS JSON))
  AND p_at>='2021-08-10 13:05:46.567187';
+----------+
| COUNT(*) |
+----------+
|       81 |
+----------+
1 row in set (2.01 sec)

| EXPLAIN
| -> Aggregate: count(0)  (cost=54.26 rows=95) (actual time=1950.447..1950.447 rows=1 loops=1)
    -> Filter: ((test_mvi.`type` in ('T1','T2')) and json_overlaps(cast(json_extract(hids,_utf8mb4'$') as unsigned array),json'[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]') and (test_mvi.p_at >= TIMESTAMP'2021-08-10 13:05:46.567187'))  (cost=44.76 rows=95) (actual time=0.102..1950.424 rows=81 loops=1)
        -> Index range scan on test_mvi using ix_type_hid_pat  (cost=44.76 rows=95) (actual time=0.087..1685.407 rows=99808 loops=1)
1 row in set (1.95 sec)

-- // Faster case
SELECT COUNT(*)
FROM test_mvi
WHERE type IN ('T1', 'T2')
  AND JSON_OVERLAPS(hids->'$', CAST('[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]' AS JSON));
+----------+
| COUNT(*) |
+----------+
|       81 |
+----------+
1 row in set (0.00 sec)

| EXPLAIN
| -> Aggregate: count(0)  (cost=54.26 rows=95) (actual time=1.482..1.482 rows=1 loops=1)
    -> Filter: ((test_mvi.`type` in ('T1','T2')) and json_overlaps(cast(json_extract(hids,_utf8mb4'$') as unsigned array),json'[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]'))  (cost=44.76 rows=95) (actual time=0.099..1.460 rows=81 loops=1)
        -> Index range scan on test_mvi using ix_type_hid_pat  (cost=44.76 rows=95) (actual time=0.085..0.966 rows=81 loops=1)

How to repeat:
-- // Create test table
CREATE TABLE test_mvi (
  id bigint NOT NULL AUTO_INCREMENT,
  type enum('T1','T2','T3') NOT NULL,
  p_at datetime(6) NOT NULL,
  hids json DEFAULT (json_array()),
  PRIMARY KEY (id),
  KEY ix_type_hid_pat (type,(cast(json_extract(hids,_utf8mb4'$') as unsigned array)),p_at)
);

-- // Load 10K sample data (I will attach later)

SELECT COUNT(*) FROM test_mvi;
+----------+
| COUNT(*) |
+----------+
|   100000 |
+----------+

-- // Compare query performance
SELECT COUNT(*)
FROM test_mvi
WHERE type IN ('T1', 'T2')
  AND JSON_OVERLAPS(hids->'$', CAST('[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]' AS JSON))
  AND p_at>='2021-08-10 13:05:46.567187';
+----------+
| COUNT(*) |
+----------+
|       81 |
+----------+
1 row in set (2.01 sec)

SELECT COUNT(*)
FROM test_mvi
WHERE type IN ('T1', 'T2')
  AND JSON_OVERLAPS(hids->'$', CAST('[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]' AS JSON));
+----------+
| COUNT(*) |
+----------+
|       81 |
+----------+
1 row in set (0.00 sec)

Suggested fix:
Looks server global status shows some clue why the "p_at>=?" condition query is so slow. 

--------------------------------------------------------
                       with (p_at>=?)  without (p_at>=?)
--------------------------------------------------------
Innodb_rows_read               422781                 92
Handler_commit                      1                  1
Handler_external_lock               2                  2
Handler_read_key                    8                  8
Handler_read_next              422773                 92
Handler_write                  422773                 92
--------------------------------------------------------

Total key entry of multi-valued index is 399997. Looks test_mvi rows are read for the number of multi-valued index key entry when "p_at>=?" condition is provided.

SELECT SUM(JSON_LENGTH(hids)) FROM test_mvi;
+------------------------+
| SUM(JSON_LENGTH(hids)) |
+------------------------+
|                 399997 |
+------------------------+

I think MySQL server can process this query by reading only required rows from table.

This is part of full test data. I have uploaded full test data via SFTP

Attachment: test_mvi-partial.sql.tar.gz (application/x-gzip, text), 264.30 KiB.

Hello Seunguck Lee,

Thank you for the report and test case.

regards,
Umesh

I could reproduce this issue in 8.0.28, and checked there is issue with index ordering. I added the index with changed ordering and just it makes count different.

I loaded in the SQL file and double it some times.
mysql> insert into test_mvi(type, p_at,hids) SELECT type, p_at,hids FROM test_mvi;
Query OK, 8380 rows affected (0.32 sec)
Records: 8380  Duplicates: 0  Warnings: 0

mysql> insert into test_mvi(type, p_at,hids) SELECT type, p_at,hids FROM test_mvi;
Query OK, 16760 rows affected (0.42 sec)
Records: 16760  Duplicates: 0  Warnings: 0

mysql> insert into test_mvi(type, p_at,hids) SELECT type, p_at,hids FROM test_mvi;
Query OK, 33520 rows affected (0.96 sec)
Records: 33520  Duplicates: 0  Warnings: 0

mysql> insert into test_mvi(type, p_at,hids) SELECT type, p_at,hids FROM test_mvi;
Query OK, 67040 rows affected (1.95 sec)
Records: 67040  Duplicates: 0  Warnings: 0

mysql> insert into test_mvi(type, p_at,hids) SELECT type, p_at,hids FROM test_mvi;
Query OK, 134080 rows affected (5.67 sec)
Records: 134080  Duplicates: 0  Warnings: 0

I added index with different ordering with exist index..
mysql> ALTER TABLE test_mvi DROP KEY ix_type_hid_pat_2, ADD KEY ix_type_hid_pat_2(`type`, `p_at`, (cast(json_extract(`hids`,_utf8mb4'$') as unsigned array)));
Query OK, 268160 rows affected (10.94 sec)
Records: 268160  Duplicates: 0  Warnings: 0

mysql> SHOW CREATE TABLE test_mvi\G
*************************** 1. row ***************************
       Table: test_mvi
Create Table: CREATE TABLE `test_mvi` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `type` enum('T1','T2','T3') NOT NULL,
  `p_at` datetime(6) NOT NULL,
  `hids` json DEFAULT (json_array()),
  PRIMARY KEY (`id`),
  KEY `ix_type_hid_pat` (`type`,(cast(json_extract(`hids`,_utf8mb4'$') as unsigned array)),`p_at`),
  KEY `ix_type_hid_pat_2` (`type`,`p_at`,(cast(json_extract(`hids`,_utf8mb4'$') as unsigned array)))
) ENGINE=InnoDB AUTO_INCREMENT=450741 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.08 sec)

And I ran same query but using other indexes, and checked there are differences. Although it's same query without hint, counts are different. 

-- with the new index order, it returns more counts.
mysql> SELECT COUNT(*) FROM test_mvi USE INDEX (ix_type_hid_pat_2) WHERE type IN ('T1')   AND JSON_OVERLAPS(hids->'$', CAST('[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]' AS JSON)) ;
+----------+
| COUNT(*) |
+----------+
|      896 |
+----------+
1 row in set (4.49 sec)

-- with other index - it returns row normally.
mysql> SELECT COUNT(*) FROM test_mvi USE INDEX (ix_type_hid_pat) WHERE type IN ('T1')   AND JSON_OVERLAPS(hids->'$', CAST('[613349414265683967, 617853013892792319, 613349377578106879, 617852977203904511]' AS JSON)) ;
+----------+
| COUNT(*) |
+----------+
|      224 |
+----------+
1 row in set (0.06 sec)

Hi Mr. Kim,

Thank you for your contribution. It will be very useful when it comes to improving the performance of queries that use many tuples in an index.

However, there is no RDBMS that will automatically create the best possible index for its future usage. That is why this job is left to DBA's, schema normalisers, query tuners etc. Simply a RDBMS can not predict what will the index be used for. It is , however, possible to make the analysis like yours.

Thank you for your contribution.

Documented fix as follows in the MySQL 8.0.40, 8.4.3, and 9.1.0 changelogs:

    A query using a greater-than (>) or less-than (<)
    comparison with a multi-valued index executed much more slowly
    than the same query using an equality (=) comparison with the
    same index.

Closed.

Thank you Jon.