Bug #99687 Wrong result of SELECT DISTINCT JSON_ARRAYAGG
Submitted: 25 May 2020 14:30 Modified: 28 May 2020 0:03
Reporter: Qingping Zhu Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: DML Severity:S2 (Serious)
Version:8.0.20 OS:CentOS
Assigned to: CPU Architecture:Any
Tags: distinct, JSON_ARRAYAGG

[25 May 2020 14:30] Qingping Zhu
Description:
See how to repeat.

How to repeat:
-- Remove ONLY_FULL_GROUP_BY
SET sql_mode='NO_AUTO_VALUE_ON_ZERO,STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ENGINE_SUBSTITUTION'; 
DROP TABLE IF EXISTS t1; 
CREATE TABLE t1 
  ( 
     id            INT PRIMARY KEY, 
     bigint_field  BIGINT, 
     char_field    CHAR(10), 
     varchar_field VARCHAR(10), 
     text_field    TEXT 
  ); 
INSERT INTO t1 VALUES (580801, 580801, "580801", "580801", "580801");
INSERT INTO t1 VALUES (580901, 580901, "580901", "580901", "580901");
INSERT INTO t1 VALUES (581001, 581001, "581001", "581001", "581001");
INSERT INTO t1 VALUES (581101, 581101, "581101", "581101", "581101");
INSERT INTO t1 VALUES (581201, 581201, "581201", "581201", "581201");
INSERT INTO t1 VALUES (581301, 581301, "581301", "581301", "581301");
INSERT INTO t1 VALUES (581401, 581401, "581401", "581401", "581401");
INSERT INTO t1 VALUES (5807701, 5807701, "5807701", "5807701", "5807701");

-- Wrong result
SELECT DISTINCT JSON_ARRAYAGG(char_field), JSON_ARRAYAGG(2) FROM t1 WHERE (id=580701 or id=580801 or id=580901 or id=581001 or id=581101 or id=581201 or id=581301 or id=581401 or id=5807701) GROUP BY id, text_field ORDER BY id, text_field LIMIT 10;
+---------------------------+------------------+
| JSON_ARRAYAGG(char_field) | JSON_ARRAYAGG(2) |
+---------------------------+------------------+
| ["581001"]                | [2]              |
| ["581101"]                | [2]              |
| ["581201"]                | [2]              |
| ["581401"]                | [2]              |
| ["581301"]                | [2]              |
| ["580801"]                | [2]              |
| ["580901"]                | [2]              |
| ["5807701"]               | [2]              |
+---------------------------+------------------+

-- Correct result
SELECT DISTINCT JSON_ARRAYAGG(char_field), JSON_ARRAYAGG(2), id FROM t1 WHERE (id=580701 or id=580801 or id=580901 or id=581001 or id=581101 or id=581201 or id=581301 or id=581401 or id=5807701) GROUP BY id, text_field ORDER BY id, text_field LIMIT 10;
+---------------------------+------------------+---------+
| JSON_ARRAYAGG(char_field) | JSON_ARRAYAGG(2) | id      |
+---------------------------+------------------+---------+
| ["580801"]                | [2]              |  580801 |
| ["580901"]                | [2]              |  580901 |
| ["581001"]                | [2]              |  581001 |
| ["581101"]                | [2]              |  581101 |
| ["581201"]                | [2]              |  581201 |
| ["581301"]                | [2]              |  581301 |
| ["581401"]                | [2]              |  581401 |
| ["5807701"]               | [2]              | 5807701 |
+---------------------------+------------------+---------+

Suggested fix:
mysql> EXPLAIN FORMAT=TREE SELECT DISTINCT JSON_ARRAYAGG(char_field), JSON_ARRAYAGG(2) FROM t1 WHERE (id=580701 or id=580801 or id=580901 or id=581001 or id=581101 or id=581201 or id=581301 or id=581401 or id=5807701) GROUP BY id, text_field ORDER BY id, text_field LIMIT 10;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Limit: 10 row(s)
    -> Sort with duplicate removal: `JSON_ARRAYAGG(char_field)`, `JSON_ARRAYAGG(2)`
        -> Stream results
            -> Group aggregate: json_arrayagg(t1.char_field), json_arrayagg(2)
                -> Sort: t1.id, t1.text_field  (cost=1.05 rows=8)
                    -> Filter: ((t1.id = 580701) or (t1.id = 580801) or (t1.id = 580901) or (t1.id = 581001) or (t1.id = 581101) or (t1.id = 581201) or (t1.id = 581301) or (t1.id = 581401) or (t1.id = 5807701))
                        -> Table scan on t1
 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In my opinion , optimizer need to add an extra SORT operator above "Sort with duplicate removal".
[25 May 2020 15:15] MySQL Verification Team
Thank you for the bug report.
[28 May 2020 0:03] Jon Stephens
Documented fix as follows in the MySQL 8.0.21 changelog:

    When performing a filesort for removing duplicates, such as when
    executing SELECT DISTINCT, it may be necessary to perform
    another sort afterwards to satisfy an ORDER BY. In cases where
    such an ORDER BY had been pushed down into the first table of a
    join, as opposed to the join as a whole, this final sort was not
    actually performed.

Closed.