MySQL Bugs: #79648: SUM(distinct ...) gives wrong result

Bug #79648	SUM(distinct ...) gives wrong result
Submitted:	15 Dec 2015 13:53	Modified:	16 Jan 2016 15:45
Reporter:	Øystein Grøvlen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S2 (Serious)
Version:	5.7.9	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
If number of distinct values are very large or variable max_heap_table_size is set low, SUM(distinct ...) may give wrong result.

Problem can be reproduced by running the disabled test main.sum_distinct-big.  Test was disabled since it used to fail when max_heap_table_size is 16k. However, in MySQL 5.7, a significant larger buffer is needed in order for the test to succeed.  While max_heap_table_size=22k was sufficient in 5.6, 512k is needed in 5.7. (It fails with 21k and 511k, respectively.)

I have run git bisect, and the offending commit is:

commit 341696fa105d8010457915fc1eab4d14a2ceeaef
Author: Tor Didriksen <tor.didriksen@oracle.com>
Date:   Wed Nov 5 11:01:23 2014 +0100

    Bug#19917028 TEMPLATE-BASED QUEUE/HEAP IN MERGE_WALK AND ROR_UNION

    Our QUEUE implementation should be replaced by a templatized C++ version.
    Templatized heaps are
     - type safe
     - easier to read/maintain
     - faster

See How-to-repeat for steps to reproduced.  Table has 2M rows with 16k unique values:

SELECT SUM(DISTINCT id) sm FROM t1;
sm
134225920
SET max_heap_table_size=511*1024;
SELECT SUM(DISTINCT id) sm FROM t1;
sm
134242273

I suspect this means that our Unique algorithm in 5.7 no longer works when the buffer overflows.  If so, for
larger tables, the default setting may also give wrong results.  Testing this on DBT3 scale factor 1 shows that something is fishy:

mysql> select sum(distinct l_orderkey) from lineitem;
+--------------------------+
| sum(distinct l_orderkey) |
+--------------------------+
|            4499987250000 |
+--------------------------+
1 row in set (5,27 sec)

mysql> select sum(distinct l_orderkey) from lineitem use index();
+--------------------------+
| sum(distinct l_orderkey) |
+--------------------------+
|            4499993541458 |
+--------------------------+
1 row in set (5,43 sec)

In MySQL 5.6 both queries give the same result.

How to repeat:
Inspired by sum_distinct-big.test:

CREATE TABLE t1 (id INTEGER);
CREATE TABLE t2 (id INTEGER);

INSERT INTO t1 (id) VALUES (1), (1), (1),(1);
INSERT INTO t1 (id) SELECT id FROM t1; /* 8 */
INSERT INTO t1 (id) SELECT id FROM t1; /* 16 */
INSERT INTO t1 (id) SELECT id FROM t1; /* 32 */
INSERT INTO t1 (id) SELECT id FROM t1; /* 64 */
INSERT INTO t1 (id) SELECT id FROM t1; /* 128 */
INSERT INTO t1 SELECT id+1 FROM t1;
INSERT INTO t1 SELECT id+2 FROM t1;
INSERT INTO t1 SELECT id+4 FROM t1;
INSERT INTO t1 SELECT id+8 FROM t1;
INSERT INTO t1 SELECT id+16 FROM t1;
INSERT INTO t1 SELECT id+32 FROM t1;
INSERT INTO t1 SELECT id+64 FROM t1;
INSERT INTO t1 SELECT id+128 FROM t1;
INSERT INTO t1 SELECT id+256 FROM t1;
INSERT INTO t1 SELECT id+512 FROM t1;
INSERT INTO t1 SELECT id+1024 FROM t1;
INSERT INTO t1 SELECT id+2048 FROM t1;
INSERT INTO t1 SELECT id+4096 FROM t1;
INSERT INTO t1 SELECT id+8192 FROM t1;

SELECT SUM(DISTINCT id) sm FROM t1;

SET max_heap_table_size=511*1024;

SELECT SUM(DISTINCT id) sm FROM t1;

Noted in 5.7.11, 5.8.0 changelogs.

Queries using SUM(DISTINCT) could produce incorrect results when
there were many distinct values.