Bug #118299 Multi-valued Index on Composite Key Returns Incomplete Results for Rows with Empty JSON Array('[]')
Submitted: 29 May 6:57 Modified: 21 Aug 2:18
Reporter: Yichang SONG (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:8.0.42, 8.4.5 OS:Any
Assigned to: CPU Architecture:Any

[29 May 6:57] Yichang SONG
Description:
A critical issue has been identified with MySQL's multi-valued index feature when it also includes non-virtual columns.

Specifically, when the multi-valued component (a JSON array cast to unsigned array) in a row is an empty JSON array ('[]'), the entire row unexpectedly fails to be indexed by this composite multi-valued index for the non-virtual part. This leads to queries utilizing this index returning incomplete or incorrect results, effectively "losing" rows that should otherwise be matched based on the non-virtual columns within the composite key.

How to repeat:
CREATE TABLE `t3` (
  `c1` int,
  `c2` json,
  KEY (`c1`,(cast(`c2` as unsigned array)))
);
INSERT INTO `t3`(c1,c2) values(1,'[]');
INSERT INTO `t3`(c1,c2) values(1,NULL);

mysql> SELECT * FROM t3;
+------+------+
| c1   | c2   |
+------+------+
|    1 | []   |
|    1 | NULL |
+------+------+
2 rows in set (0.00 sec)
(Result is correct)

mysql> SELECT * FROM t3 where `c1`=1;
+------+------+
| c1   | c2   |
+------+------+
|    1 | NULL |
+------+------+
1 row in set (0.00 sec)
(Only one row is returned.)

Observed Behavior:
When a WHERE clause references the non-virtual column (c1 in the provided example) that is the leading part of such a composite multi-valued index, rows where the JSON multi-valued part (c2) is an empty array ('[]') are silently omitted from the result set. However, a full table scan or primary key lookup correctly retrieves all rows. This discrepancy indicates an issue with how InnoDB handles rows with empty JSON arrays when inserting into such specific multi-valued composite indexes.

Root Cause (Suspected):
Based on analysis, it appears that during index insertion for a multi-valued index, if the virtual column derived from the JSON array (CAST(c2 AS UNSIGNED ARRAY)) evaluates to an empty array ('[]'), it is internally treated as UNIV_NO_INDEX_VALUE (as referenced in row0row.h line 92). While this might be an intentional behavior for a multi-valued index consisting solely of the JSON virtual column (as there are no values to index), when it's part of a composite index with other, indexable columns (like c1), this UNIV_NO_INDEX_VALUE for one part of the key causes the entire entry for that row to be skipped in the composite index. This prevents the c1 component of the index from being correctly recorded for rows with empty JSON arrays.

In contrast, NULL values for c2 do seem to allow the c1 component of the composite index to be correctly inserted and searchable, highlighting the specific problem with empty arrays.

Suggested fix:
Given that NULL values for the c2 column do result in an index entry for c1 (e.g., the row (1, NULL) is indexed for c1=1), the system should apply similar logic for empty JSON arrays ('[]'). When CAST(c2 AS UNSIGNED ARRAY) evaluates to an empty array, the index should still insert an entry based on the c1 component (e.g., (1, NULL) or (1, <some_special_empty_marker>)). This would ensure that rows with empty JSON arrays are not 'lost' when querying the c1 part of the composite index, thereby maintaining data consistency and query accuracy. The goal is to allow the c1 part of the composite index to be searchable for these rows, even if the multi-valued part has no actual values.

The multi-valued index logic can be revised such that UNIV_NO_INDEX_VALUE for the multi-valued part of a composite index does not cause the entire row's entry to be skipped from the index if other parts of the composite index (e.g., c1) are valid and indexable. An entry for that row should still be created in the index based on the indexable columns. This would allow queries on c1 to correctly locate all matching rows, even if the multi-valued c2 component is NULL or an empty array (which it currently doesn't for []).

If the direct solution is technically challenging to implement, consider restricting the creation of composite multi-valued indexes where a non-virtual column is combined with a multi-valued JSON virtual column that can potentially evaluate to an empty array. The server could either issue a warning or outright reject such CREATE INDEX statements to prevent users from creating an index that silently drops data. However, this limits the flexibility of multi-valued indexes and is less ideal than a direct fix to the indexing behavior.
[29 May 9:46] Yichang SONG
I examined WL#8955: Add support for multi-valued indexes, which sheds critical light on the handling of empty arrays and provides context for the observed behavior:
"
 If multi-value key part's generation function returns empty array, nothing
 is inserted into the index, and the data record will not be accessible via
 index scan. 

"

This statement explicitly confirms that the observed behavior—where no index entry is created for rows where CAST(c2 AS UNSIGNED ARRAY) evaluates to an empty array ('[]'), making them inaccessible via this index scan—is per design.

However, WL#8955 also includes crucial expectations and warnings:
"
F6) Regular access method (index scan, ref, range (plain or MRR)) over
 multiple-valued indexes should be supported.
......
F8) Usage of multi-valued index in any other way, except described in F6 (e.g
index scan, or using only prefix) should be prohibited, to avoid performance
regressions/wrong results.
"

F8 clearly anticipates scenarios where queries might lead to "wrong results" if they don't fully conform to the "regular access methods" described in F6. However, our current query, SELECT c1 FROM t3 WHERE c1=1;, while seemingly a "regular access method" as per F6, paradoxically results in "wrong results" due to the specific handling of empty arrays.

Given that, perhaps the fix for this bug could be implemented at the server layer by classifying SELECT c1 FROM t3 WHERE c1=1; as falling under F8 (prohibited usage), thereby preventing this query from automatically (or even with FORCE INDEX) selecting the multi-valued index.
[29 May 11:27] MySQL Verification Team
Hello Yichang Song,

Thank you for the report and feedback.

regards,
Umesh
[21 Aug 2:18] Yichang SONG
I'd like to add a further observation to this report. The issue described also appears to affect aggregate functions like COUNT(*).

For instance, using the same test case, the query:
SELECT COUNT(*) FROM t3 WHERE c1 = 1;
returns an incorrect result of 1, when the expected result is 2.

This behavior seems quite similar to what is described in the S1 (Critical) Bug #104898 (https://bugs.mysql.com/bug.php?id=104898), which also reports incorrect COUNT(*) results when a multi-valued index is utilized.

This suggests the problem isn't limited to failing to retrieve specific rows via SELECT *, but that the optimizer's use of this index can lead to incorrect outcomes for different types of queries, including aggregations. This could be particularly misleading for users who rely on such aggregate results for data analysis.