Bug #82547 COUNT DISTINCT failure in 5.7
Submitted: 11 Aug 2016 16:58 Modified: 11 Aug 2016 21:57
Reporter: Matthew Boehm Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: General Severity:S2 (Serious)
Version:5.7.10 OS:Any
Assigned to: CPU Architecture:Any
Tags: aggregate, DML

[11 Aug 2016 16:58] Matthew Boehm
Description:
This is a regression failure in 5.7 vs 5.6. In 5.6, these queries work fine and return correct counts. In 5.7, counts are not correct.

Logically speaking, you cannot have MORE distinct values when comparing less information. 

Consider the values "Bear", "Bearly", "Bearlying"

There are 3 distinct values above when comparing the entire length. When only considering the LEFT(6), there are 2 distinct values: "Bear" and "Bearly".  When only considering the LEFT(4), there is only 1 distinct value: "Bear"

Example DML:
  SELECT COUNT(DISTINCT(title)), COUNT(DISTINCT LEFT(title, 100)) FROM title;

How to repeat:
Import the following table and data. There should be 1,543,719 rows after import.
 https://s3-us-west-1.amazonaws.com/matthewb-stuff/title.sql.gz

Run the following query

SELECT COUNT(DISTINCT(title)), COUNT(DISTINCT LEFT(title, 100)) FROM title;

Under MySQL 5.6, this is the result:
+------------------------+----------------------------------+
| COUNT(DISTINCT(title)) | COUNT(DISTINCT LEFT(title, 100)) |
+------------------------+----------------------------------+
|                 998335 |                           998320 |
+------------------------+----------------------------------+

Under MySQL 5.7, this is the result:
+------------------------+----------------------------------+
| COUNT(DISTINCT(title)) | COUNT(DISTINCT LEFT(title, 100)) |
+------------------------+----------------------------------+
|                 998335 |                          1214068 |
+------------------------+----------------------------------+

Further evidence that the issue is contained in the COUNT(DISTINCT):

mysql> pager wc -l
PAGER set to 'wc -l'
mysql> SELECT DISTINCT LEFT(title, 100) FROM title;
..
998320 rows in set (48.83 sec)

The above value of 998,320 is correct, logically speaking, as this value is less than the total number of distinct entries as explained above in the description. This is the same value produced in 5.6 and is the expected result of the DML in question.

The breakdown happens somewhere in between lengths of 7 & 8.

LEFT(7) and below return correct counts. LEFT(8) and above return incorrect counts.

Suggested fix:
Unknown.
[11 Aug 2016 16:59] Matthew Boehm
Possibly related bug: http://bugs.mysql.com/bug.php?id=80310
[11 Aug 2016 21:02] MySQL Verification Team
Thank you for the bug report. I couldn't repeat with most recent source:

C:\dbs>c:\dbs\5.6\bin\mysql -uroot --port=3560 -p --prompt="mysql 5.6 > "
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.6.33 Source distribution 2016-JUL-09

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.6 > use title
Database changed
mysql 5.6 > SELECT COUNT(DISTINCT(title)), COUNT(DISTINCT LEFT(title, 100)) FROM title;
+------------------------+----------------------------------+
| COUNT(DISTINCT(title)) | COUNT(DISTINCT LEFT(title, 100)) |
+------------------------+----------------------------------+
|                 998335 |                           998320 |
+------------------------+----------------------------------+
1 row in set (37.22 sec)
***************************************************************************************
C:\dbs>c:\dbs\5.7\bin\mysql -uroot -p --port=3570 --prompt="mysql 5.7 > "
Enter password: ******
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.15 Source distribution 2016-JUL-09

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.7 > use title
Database changed
mysql 5.7 > SELECT COUNT(DISTINCT(title)), COUNT(DISTINCT LEFT(title, 100)) FROM title;
+------------------------+----------------------------------+
| COUNT(DISTINCT(title)) | COUNT(DISTINCT LEFT(title, 100)) |
+------------------------+----------------------------------+
|                 998335 |                           998320 |
+------------------------+----------------------------------+
1 row in set (33.74 sec)
[11 Aug 2016 21:57] Matthew Boehm
I upgraded to 5.7.13 and do not see this issue any longer. Didn't realize I wasn't on the latest version of 5.7. Thanks.