Bug #100107 | MySQL 8.0.20 Bug? Groups duplicated when using GROUP BY and COUNT(DISTINCT ) | ||
---|---|---|---|
Submitted: | 3 Jul 2020 16:57 | Modified: | 15 Jul 2020 14:17 |
Reporter: | John Pauler | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: DML | Severity: | S3 (Non-critical) |
Version: | 8.0.20 | OS: | MacOS |
Assigned to: | CPU Architecture: | Any |
[3 Jul 2020 16:57]
John Pauler
[3 Jul 2020 17:04]
MySQL Verification Team
Thank you for the bug report. Please attach the complete test case, create table, insert data, queries statements with actual result and expected result. Thanks.
[3 Jul 2020 17:36]
John Pauler
I added file mysql-bug-data-100107 via SFTP site to /support/incoming. Test case is to run the query below, using MySQL version 8.0.20: SELECT YEAR(website_sessions.created_at) AS yr, MONTH(website_sessions.created_at) AS mon, COUNT(DISTINCT website_session_id) AS session_volume FROM website_sessions WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' GROUP BY 1,2; Expected result is a normal GROUP BY output, where the columns specified in the GROUP BY do not get duplicated. But instead, you will likely see the following duped results, with the same values in the GROUP BY columns repeating unexpectedly: yr, mon, session_volume 2012, 4, 3734 2012, 5, 3736 2012, 6, 3963 2012, 7, 4249 2012, 8, 6097 2012, 9, 6546 2012, 10, 8183 2012, 11, 13979 2012, 12, 9827 2012, 11, 32 2012, 12, 208 2013, 1, 3545 2012, 12, 37 2013, 1, 2820 2013, 2, 4350 2013, 1, 36 2013, 2, 2818 2013, 3, 6264 2013, 4, 1209 Let me know if I can provide any more information. Thank you for your help!
[3 Jul 2020 17:38]
John Pauler
This is the expected result from the query (although this is not what we see): yr, mon, session_volume 2012, 4, 3734 2012, 5, 3736 2012, 6, 3963 2012, 7, 4249 2012, 8, 6097 2012, 9, 6546 2012, 10, 8183 2012, 11, 14011 2012, 12, 10072 2013, 1, 6401 2013, 2, 7168 2013, 3, 6264 2013, 4, 1209
[7 Jul 2020 0:32]
MySQL Verification Team
Hi, File mysql-bug-data-100107 nor .zip nor .tar.gz nor .tgz does not exist on our sftp, are you sure you uploaded it? Having a script that creates the table and insert the data would be incredibly useful. If you compress the script it should be below 3MB so you can upload directly to the bug. With regards to the reported problem, what is the SQL_MODE on the server where you see the results you like (8.0.16) and on the server where you do not see results you like (8.0.20)? If you align the sql_mode values between the servers, do you get the same results? Kind regards Bogdan p.s. I doubt this is a bug, but before I get your test case I can't be 100% sure. Feel free to look at: https://dev.mysql.com/doc/refman/8.0/en/sql-mode.html https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html > ONLY_FULL_GROUP_BY also affects handling of queries that use DISTINCT and ORDER BY > To order the result, duplicates must be eliminated first. But to do so, > should we keep the first row or the third? This arbitrary choice influences > the retained value of c3, which in turn influences ordering and makes it > arbitrary as well. To prevent this problem, a query that has DISTINCT and > ORDER BY is rejected as invalid if any ORDER BY expression does not satisfy > at least one of these conditions: > > The expression is equal to one in the select list > > All columns referenced by the expression and belonging to the query's > selected tables are elements of the select list
[7 Jul 2020 12:00]
MySQL Verification Team
Hi, Found the file :) (it was .sql so mysql-bug-data-100107.sql ). For start, default 8.0.20 ERROR 1292 (22007) at line 35: Incorrect datetime value: '2012-03-25 02:23:06' for column 'created_at' at row 898 mysql [localhost:8020] {msandbox} ((none)) > select @@sql_mode; +-----------------------------------------------------------------------------------------------------------------------+ | @@sql_mode | +-----------------------------------------------------------------------------------------------------------------------+ | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION | +-----------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) clearing sql_mode to be "", imported your data. Lemme now see what do we have here :) all best Bogdan
[7 Jul 2020 12:03]
MySQL Verification Team
mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > SELECT -> YEAR(website_sessions.created_at) AS yr, -> MONTH(website_sessions.created_at) AS mon, -> COUNT(DISTINCT website_session_id) AS session_volume -> FROM website_sessions -> WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' -> GROUP BY 1,2; +------+------+----------------+ | yr | mon | session_volume | +------+------+----------------+ | 2012 | 4 | 3734 | | 2012 | 5 | 3736 | | 2012 | 6 | 3963 | | 2012 | 7 | 4249 | | 2012 | 8 | 6097 | | 2012 | 9 | 6546 | | 2012 | 10 | 8183 | | 2012 | 11 | 13979 | | 2012 | 12 | 9827 | | 2012 | 11 | 32 | | 2012 | 12 | 208 | | 2013 | 1 | 3545 | | 2012 | 12 | 37 | | 2013 | 1 | 2820 | | 2013 | 2 | 4350 | | 2013 | 1 | 36 | | 2013 | 2 | 2818 | | 2013 | 3 | 6264 | | 2013 | 4 | 1209 | +------+------+----------------+ 19 rows in set (0.24 sec) mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > select @@sql_mode; +------------+ | @@sql_mode | +------------+ | | +------------+ 1 row in set (0.00 sec)
[7 Jul 2020 12:04]
MySQL Verification Team
mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > select @@sql_mode; +-----------------------------------------------------------------------------------------------------------------------+ | @@sql_mode | +-----------------------------------------------------------------------------------------------------------------------+ | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION | +-----------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql [localhost:8020] {msandbox} (mavenfuzzyfactory) > SELECT -> YEAR(website_sessions.created_at) AS yr, -> MONTH(website_sessions.created_at) AS mon, -> COUNT(DISTINCT website_session_id) AS session_volume -> FROM website_sessions -> WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' -> GROUP BY 1,2; +------+------+----------------+ | yr | mon | session_volume | +------+------+----------------+ | 2012 | 4 | 3714 | | 2012 | 5 | 3728 | | 2012 | 6 | 3990 | | 2012 | 7 | 4227 | | 2012 | 8 | 6089 | | 2012 | 9 | 6541 | | 2012 | 10 | 8158 | | 2012 | 11 | 7702 | | 2012 | 12 | 16 | | 2012 | 11 | 6342 | | 2012 | 12 | 9835 | | 2013 | 1 | 12 | | 2012 | 12 | 220 | | 2013 | 1 | 3545 | | 2013 | 2 | 20 | | 2013 | 1 | 2837 | | 2013 | 2 | 4356 | | 2013 | 3 | 14 | | 2013 | 2 | 2777 | | 2013 | 3 | 6273 | | 2013 | 4 | 1173 | +------+------+----------------+ 21 rows in set (0.25 sec)
[7 Jul 2020 15:13]
John Pauler
Thanks for your help. I see that you posted your results from running the query, which seems to confirm you are seeing the bug as well (the duplicated rows, rather than having the rows aggregated as they usually are when using GROUP BY). Any ideas why we are both experiencing this bug? And why it did not present until version 8.0.20?
[7 Jul 2020 16:22]
MySQL Verification Team
Hi John, I am not sure this is a bug, need to do some research first. I'll update the bug report soon Thanks Bogdan
[7 Jul 2020 16:48]
John Pauler
Thanks Bogdan. Just to over-communicate why I suspect there is a bug... In your output, there are 3 identical groups created for yr = 2012 mon = 12. Because the query is only including yr and mon in GROUP BY, we would expect these 3 identical groups to be collapsed into 1 single row in the output, as they have identical values for all columns named in the GROUP BY. Only mentioning because I saw you mention this might not be a bug, and want to make sure you understand my concern with the output that is being created currently. Also note: if you were to run this in version 8.0.16, you would NOT see duplicate rows (in case this helps your investigation) Thanks again for the help!
[7 Jul 2020 16:50]
John Pauler
Similarly, if you remove DISTINCT from the query, you will see the 3 rows collapse into 1 row in the output. Again, this is not expected that including DISTINCT inside a COUNT( ) would create additional rows in an output. (in case this helps)
[9 Jul 2020 11:25]
MySQL Verification Team
Hi, I'm verifying this as a bug. Took a while to go through the specification. Whatever way I look at it SELECT A,B,COUNT() GROUP BY A,B cannot return different results depending on what's inside count() and also I cannot explain this | 2012 | 11 | 7702 | | 2012 | 11 | 6342 | | 2012 | 12 | 16 | | 2012 | 12 | 9835 | | 2012 | 12 | 220 | Thanks for the report Kind regards Bogdan
[9 Jul 2020 13:14]
John Pauler
Thanks Bogdan. Appreciate your help looking into this!
[9 Jul 2020 13:53]
MySQL Verification Team
Hi John, Well, the changes related to aligning how "group by" in MySQL work with how SQL specification states it should work are extensive and while we all know how it worked in MySQL the standard is huge so takes a while. 99% of reported cases are actually not a bug but in this case, I think it is a genuine bug. We'll see when the dev team specialized for this start working on the case what they will say. all best Bogdan
[15 Jul 2020 14:01]
Erlend Dahl
I have verified that the behaviour changes from 8.0.19 to 8.0.20. It stays the same on 8.0.21 (released this week). However, on recent trunk (8.0.22), the server reverts back to the 8.0.19 behaviour.
[15 Jul 2020 14:17]
Erlend Dahl
Posted by developer: Fixed in 8.0.22 under the heading of Bug#31168097 SIG 6 ASSERTION `0' IN TIME_ZONE_UTC::GET_NAME AT TZTIME.CC:1182
[16 Mar 2021 10:17]
Patrick Rademaker
Although it does not have the same performance as referring to the YEAR(), MONTH(), etc. functions used in the columns, we can workaround this bug by not-using these date functions. For example: SELECT YEAR(website_sessions.created_at) AS yr, MONTH(website_sessions.created_at) AS mon, COUNT(DISTINCT website_session_id) AS session_volume FROM website_sessions WHERE website_sessions.created_at BETWEEN '2012-04-01' AND '2013-04-05' GROUP BY DATE_FORMAT(website_sessions.created_at, '%Y-%m') And for the use of timestamp fields we can use FROM_UNIXTIME() with it's second parameter as used in the DATE_FORMAT().
[16 Mar 2021 18:03]
MySQL Verification Team
Hi Patrick, please just upgrade to 8.0.23 and no workaround is needed all best Bogdan