Bug #100107 MySQL 8.0.20 Bug? Groups duplicated when using GROUP BY and COUNT(DISTINCT )
Submitted: 3 Jul 2020 16:57 Modified: 15 Jul 2020 14:17
Reporter: John Pauler Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: DML Severity:S3 (Non-critical)
Version:8.0.20 OS:MacOS
Assigned to: CPU Architecture:Any

[3 Jul 2020 16:57] John Pauler
Description:
The same code consistently produces the expected result when run using MySQL 8.0.16, but consistently produces the confusing result when run in MySQL 8.0.20. Multiple students have confirmed the same bug presenting when run on 8.0.20 on their machines, all running MySQL Community Server locally, using MySQL Workbench.

The code which produces the duplicated rows is immediately below. Pretty simple query to create a monthly trend of a count of website sessions for a given time period:

SELECT 
    YEAR(website_sessions.created_at) AS yr, 
    MONTH(website_sessions.created_at) AS mon,
    COUNT(DISTINCT website_session_id) AS session_volume 
FROM website_sessions
WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' 
GROUP BY 1,2;

The query shown immediately above consistently produces 2 groups for 2012-11, 3 groups for 2012-12, 3 groups for 2013-1, and 2 groups for 2013-2.

yr, mon, session_volume
2012, 4, 3734
2012, 5, 3736
2012, 6, 3963
2012, 7, 4249
2012, 8, 6097
2012, 9, 6546
2012, 10, 8183
2012, 11, 13979
2012, 12, 9827
2012, 11, 32
2012, 12, 208
2013, 1, 3545
2012, 12, 37
2013, 1, 2820
2013, 2, 4350
2013, 1, 36
2013, 2, 2818
2013, 3, 6264
2013, 4, 1209

If we remove DISTINCT from inside the COUNT( ) function, the duplication does not present itself. The code immediately below runs fine, without dupes:

SELECT 
    YEAR(website_sessions.created_at) AS yr, 
    MONTH(website_sessions.created_at) AS mon,
    COUNT(website_session_id) AS session_volume 
FROM website_sessions
WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' 
GROUP BY 1,2;

Alternatively, if we use DATE_FORMAT( ) instead of the YEAR() and MONTH( ) functions to create our trend groups, we can still use DISTINCT with no duplication. See query immediately below:

SELECT 
    DATE_FORMAT(website_sessions.created_at,'%Y') AS yr, 
    DATE_FORMAT(website_sessions.created_at,'%M') AS mon,
    COUNT(DISTINCT website_session_id) AS session_volume
FROM website_sessions
WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' 
GROUP BY 1,2;

The first of the 3 queries is the only one that produces duplicate groups. The second and third query never produces duplicates.

We have used the same COUNT(DISTINCT ) methodology for years, without seeing this duplication issue that has only occurred in version 8.0.20 to my knowledge.

Anyone have any ideas how we can be seeing the same values in our GROUP columns when using GROUP BY here? Or any ideas as to why eliminating DISTINCT or using DATE_FORMAT( ) does not produce the same problem?

How to repeat:
Run the following query on the database in question using MySQL 8.0.20 (I am aware you don't have this particular database. I could send you a large create script if that would be helpful): 

SELECT 
    YEAR(website_sessions.created_at) AS yr, 
    MONTH(website_sessions.created_at) AS mon,
    COUNT(DISTINCT website_session_id) AS session_volume 
FROM website_sessions
WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' 
GROUP BY 1,2;
[3 Jul 2020 17:04] MySQL Verification Team
Thank you for the bug report. Please attach the complete test case, create table, insert data, queries statements with actual result and expected result. Thanks.
[3 Jul 2020 17:36] John Pauler
I added file mysql-bug-data-100107 via SFTP site to /support/incoming.

Test case is to run the query below, using MySQL version 8.0.20: 

SELECT 
    YEAR(website_sessions.created_at) AS yr, 
    MONTH(website_sessions.created_at) AS mon,
    COUNT(DISTINCT website_session_id) AS session_volume 
FROM website_sessions
WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' 
GROUP BY 1,2;

Expected result is a normal GROUP BY output, where the columns specified in the GROUP BY do not get duplicated. 

But instead, you will likely see the following duped results, with the same values in the GROUP BY columns repeating unexpectedly: 

yr, mon, session_volume
2012, 4, 3734
2012, 5, 3736
2012, 6, 3963
2012, 7, 4249
2012, 8, 6097
2012, 9, 6546
2012, 10, 8183
2012, 11, 13979
2012, 12, 9827
2012, 11, 32
2012, 12, 208
2013, 1, 3545
2012, 12, 37
2013, 1, 2820
2013, 2, 4350
2013, 1, 36
2013, 2, 2818
2013, 3, 6264
2013, 4, 1209

Let me know if I can provide any more information. Thank you for your help!
[3 Jul 2020 17:38] John Pauler
This is the expected result from the query (although this is not what we see): 

yr, mon, session_volume
2012, 4, 3734
2012, 5, 3736
2012, 6, 3963
2012, 7, 4249
2012, 8, 6097
2012, 9, 6546
2012, 10, 8183
2012, 11, 14011
2012, 12, 10072
2013, 1, 6401
2013, 2, 7168
2013, 3, 6264
2013, 4, 1209
[7 Jul 2020 0:32] MySQL Verification Team
Hi,

File mysql-bug-data-100107 nor .zip nor .tar.gz nor .tgz does not exist on our sftp, are you sure you uploaded it? Having a script that creates the table and insert the data would be incredibly useful. If you compress the script it should be below 3MB so you can upload directly to the bug.

With regards to the reported problem, what is the SQL_MODE on the server where you see the results you like (8.0.16) and on the server where you do not see results you like (8.0.20)?

If you align the sql_mode values between the servers, do you get the same results?

Kind regards
Bogdan

p.s. I doubt this is a bug, but before I get your test case I can't be 100% sure. Feel free to look at:

https://dev.mysql.com/doc/refman/8.0/en/sql-mode.html
https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html

> ONLY_FULL_GROUP_BY also affects handling of queries that use DISTINCT and ORDER BY

>  To order the result, duplicates must be eliminated first. But to do so,
> should we keep the first row or the third? This arbitrary choice influences
> the retained value of c3, which in turn influences ordering and makes it
> arbitrary as well. To prevent this problem, a query that has DISTINCT and
> ORDER BY is rejected as invalid if any ORDER BY expression does not satisfy
> at least one of these conditions:
>
>    The expression is equal to one in the select list
>
>    All columns referenced by the expression and belonging to the query's
> selected tables are elements of the select list
[7 Jul 2020 12:00] MySQL Verification Team
Hi,

Found the file :) (it was .sql so mysql-bug-data-100107.sql ).

For start, default 8.0.20

ERROR 1292 (22007) at line 35: Incorrect datetime value: '2012-03-25 02:23:06' for column 'created_at' at row 898

mysql [localhost:8020] {msandbox} ((none)) > select @@sql_mode;
+-----------------------------------------------------------------------------------------------------------------------+
| @@sql_mode                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
+-----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

clearing sql_mode to be "", imported your data.

Lemme now see what do we have here :)

all best
Bogdan
[7 Jul 2020 12:03] MySQL Verification Team
mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > SELECT
    ->     YEAR(website_sessions.created_at) AS yr,
    ->     MONTH(website_sessions.created_at) AS mon,
    ->     COUNT(DISTINCT website_session_id) AS session_volume
    -> FROM website_sessions
    -> WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05'
    -> GROUP BY 1,2;
+------+------+----------------+
| yr   | mon  | session_volume |
+------+------+----------------+
| 2012 |    4 |           3734 |
| 2012 |    5 |           3736 |
| 2012 |    6 |           3963 |
| 2012 |    7 |           4249 |
| 2012 |    8 |           6097 |
| 2012 |    9 |           6546 |
| 2012 |   10 |           8183 |
| 2012 |   11 |          13979 |
| 2012 |   12 |           9827 |
| 2012 |   11 |             32 |
| 2012 |   12 |            208 |
| 2013 |    1 |           3545 |
| 2012 |   12 |             37 |
| 2013 |    1 |           2820 |
| 2013 |    2 |           4350 |
| 2013 |    1 |             36 |
| 2013 |    2 |           2818 |
| 2013 |    3 |           6264 |
| 2013 |    4 |           1209 |
+------+------+----------------+
19 rows in set (0.24 sec)

mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > select @@sql_mode;
+------------+
| @@sql_mode |
+------------+
|            |
+------------+
1 row in set (0.00 sec)
[7 Jul 2020 12:04] MySQL Verification Team
mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > select @@sql_mode;
+-----------------------------------------------------------------------------------------------------------------------+
| @@sql_mode                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION |
+-----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > SELECT
    ->     YEAR(website_sessions.created_at) AS yr,
    ->     MONTH(website_sessions.created_at) AS mon,
    ->     COUNT(DISTINCT website_session_id) AS session_volume
    -> FROM website_sessions
    -> WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05'
    -> GROUP BY 1,2;
+------+------+----------------+
| yr   | mon  | session_volume |
+------+------+----------------+
| 2012 |    4 |           3714 |
| 2012 |    5 |           3728 |
| 2012 |    6 |           3990 |
| 2012 |    7 |           4227 |
| 2012 |    8 |           6089 |
| 2012 |    9 |           6541 |
| 2012 |   10 |           8158 |
| 2012 |   11 |           7702 |
| 2012 |   12 |             16 |
| 2012 |   11 |           6342 |
| 2012 |   12 |           9835 |
| 2013 |    1 |             12 |
| 2012 |   12 |            220 |
| 2013 |    1 |           3545 |
| 2013 |    2 |             20 |
| 2013 |    1 |           2837 |
| 2013 |    2 |           4356 |
| 2013 |    3 |             14 |
| 2013 |    2 |           2777 |
| 2013 |    3 |           6273 |
| 2013 |    4 |           1173 |
+------+------+----------------+
21 rows in set (0.25 sec)
[7 Jul 2020 15:13] John Pauler
Thanks for your help. 

I see that you posted your results from running the query, which seems to confirm you are seeing the bug as well (the duplicated rows, rather than having the rows aggregated as they usually are when using GROUP BY). 

Any ideas why we are both experiencing this bug? And why it did not present until version 8.0.20?
[7 Jul 2020 16:22] MySQL Verification Team
Hi John,

I am not sure this is a bug, need to do some research first.

I'll update the bug report soon

Thanks
Bogdan
[7 Jul 2020 16:48] John Pauler
Thanks Bogdan. 

Just to over-communicate why I suspect there is a bug...

In your output, there are 3 identical groups created for yr = 2012 mon = 12. 

Because the query is only including yr and mon in GROUP BY, we would expect these 3 identical groups to be collapsed into 1 single row in the output, as they have identical values for all columns named in the GROUP BY. 

Only mentioning because I saw you mention this might not be a bug, and want to make sure you understand my concern with the output that is being created currently. 

Also note: if you were to run this in version 8.0.16, you would NOT see duplicate rows (in case this helps your investigation) 

Thanks again for the help!
[7 Jul 2020 16:50] John Pauler
Similarly, if you remove DISTINCT from the query, you will see the 3 rows collapse into 1 row in the output. 

Again, this is not expected that including DISTINCT inside a COUNT( ) would create additional rows in an output. 

(in case this helps)
[9 Jul 2020 11:25] MySQL Verification Team
Hi,

I'm verifying this as a bug. Took a while to go through the specification. Whatever way I look at it 

SELECT A,B,COUNT() GROUP BY A,B

cannot return different results depending on what's inside count() and also I cannot explain this

| 2012 |   11 |           7702 |
| 2012 |   11 |           6342 |

| 2012 |   12 |             16 |
| 2012 |   12 |           9835 |
| 2012 |   12 |            220 |

Thanks for the report
Kind regards
Bogdan
[9 Jul 2020 13:14] John Pauler
Thanks Bogdan. Appreciate your help looking into this!
[9 Jul 2020 13:53] MySQL Verification Team
Hi John,

Well, the changes related to aligning how "group by" in MySQL work with how SQL specification states it should work are extensive and while we all know how it worked in MySQL the standard is huge so takes a while. 99% of reported cases are actually not a bug but in this case, I think it is a genuine bug. We'll see when the dev team specialized for this start working on the case what they will say.

all best
Bogdan
[15 Jul 2020 14:01] Erlend Dahl
I have verified that the behaviour changes from 8.0.19 to 8.0.20. It stays the same on 8.0.21 (released this week).

However, on recent trunk (8.0.22), the server reverts back to the 8.0.19 behaviour.
[15 Jul 2020 14:17] Erlend Dahl
Posted by developer:
 
Fixed in 8.0.22 under the heading of

Bug#31168097 SIG 6 ASSERTION `0' IN TIME_ZONE_UTC::GET_NAME AT TZTIME.CC:1182
[16 Mar 2021 10:17] Patrick Rademaker
Although it does not have the same performance as referring to the YEAR(), MONTH(), etc. functions used in the columns, we can workaround this bug by not-using these date functions.

For example:

SELECT 
    YEAR(website_sessions.created_at) AS yr, 
    MONTH(website_sessions.created_at) AS mon,
    COUNT(DISTINCT website_session_id) AS session_volume 
FROM website_sessions
WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' 
GROUP BY DATE_FORMAT(website_sessions.created_at, '%Y-%m')

And for the use of timestamp fields we can use FROM_UNIXTIME() with it's second parameter as used in the DATE_FORMAT().
[16 Mar 2021 18:03] MySQL Verification Team
Hi Patrick,

please just upgrade to 8.0.23 and no workaround is needed

all best
Bogdan