MySQL Bugs: #22555: STDDEV yields positive result for groups with only one row

Bug #22555	STDDEV yields positive result for groups with only one row
Submitted:	21 Sep 2006 17:09	Modified:	4 May 2007 11:36
Reporter:	Luis Mochan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	5.0.26-BK, 5.0.24a-Debian_3-log	OS:	Linux (linux)
Assigned to:	Alexey Kopytov	CPU Architecture:	Any
Tags:	GROUP, STDDEV, VARIANCE

Description:
When I calculate the standard deviation of some expressions over groups of rows, some of which turn out to have only one member the result is sometimes positive instead of zero. 

I have found the error when calculating STD(exp) and exp involves quotients of smallints. I haven't found the error while calculating STD(column_name). Nevertheless, I don't believe this is a rounding error, as STD(exp) should be zero for only one datum, even if exp has roundoff errors. 

It seems that this is related to Bug #3693, which was declared a non-bug, but I do consider it a serious bug as it yields wrong results which are unrelated to roundoff.

How to repeat:
Consider the following example:

mysql> create temporary table ja (t1 smallint,t2 smallint,t3 smallint);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into ja values (1,53,78),(2,17,78),(3,18,76);
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select t1,count(*),std(t2/t3) from ja group by t1;
+------+----------+------------+
| t1   | count(*) | std(t2/t3) |
+------+----------+------------+
|    1 |        1 | 0.00000000 | 
|    2 |        1 | 0.00460760 | 
|    3 |        1 | 0.00446542 | 
+------+----------+------------+
3 rows in set (0.00 sec)

mysql> 

As can be seen, there is only one row for each value of t1. Nevertheless, the calculated standard deviation is zero only for the first group.

Thank you for the report.

What the result you expect to get?

The result for the population standard deviation with only one datum should have been exactly zero as all the data (the one data point) are exactly equal to the average, there can be no deviation at all.

I a bit changed your example to:

create temporary table ja (t1 double,t2 double,t3 double);
insert into ja values (1,53,78),(2,17,78),(3,18,76);

because std converts integer values to double and get different results on different machines:

(Windows)
mysql> select t1,count(*),std(t2/t3) from ja group by t1;
+------+----------+------------+
| t1   | count(*) | std(t2/t3) |
+------+----------+------------+
|    1 |        1 |          0 |
|    2 |        1 |          0 |
|    3 |        1 |          0 |
+------+----------+------------+
3 rows in set (0.01 sec)

(Linux)
mysql> select t1,count(*),std(t2/t3) from ja group by t1;
+------+----------+---------------------+
| t1   | count(*) | std(t2/t3)          |
+------+----------+---------------------+
|    1 |        1 | 4.9773506648913e-09 |
|    2 |        1 | 8.9609554959116e-10 |
|    3 |        1 |                   0 |
+------+----------+---------------------+
3 rows in set (0.04 sec)

(Red Hat Linux)
mysql> select t1,count(*),std(t2/t3) from ja group by t1;
+------+----------+---------------------+
| t1   | count(*) | std(t2/t3)          |
+------+----------+---------------------+
|    1 |        1 |                   0 |
|    2 |        1 | 8.9609554959116e-10 |
|    3 |        1 | 2.2225877144417e-09 |
+------+----------+---------------------+
3 rows in set (0.08 sec)

So it seems like rounding error too.

The bug might somehow be related to roundoff errors, but it is still a bug. Even if t2/t3 contains truncation errors, the sum:

sum(power(t2/t3,2))/count(*)-power(sum(t2/t3)/count(*),2)
should yield
power(t2/t3,2)-power(t2/t3,2)= 0
when there is only one data point, independently of the precision in the calculation of t2/t3.

On the other hand, the size of the errors in the example I submitted seem to be much larger than a simple roundoff error and makes me doubt on the quality of calculations with more than one data point. 

Doesn't your example with double floats confirm my opinion? How many decimal digits to a double? The error is too large. Doesn't double conform to standards in both windows and linux?

The change between windows and linux is due to a change in operating system (??) or to a change in hardware?

Sveta is right, it's a rounding error.
It's big in your example, because you use integer data types. Starting from 5.0, INT/INT is DECIMAL, not DOUBLE. And the precision of decimal division is defined by div_precision_increment server variable: http://dev.mysql.com/doc/refman/5.0/en/arithmetic-functions.html

mysql> select 5/7, @@div_precision_increment;
+--------+---------------------------+
| 5/7    | @@div_precision_increment |
+--------+---------------------------+
| 0.7143 |                         4 |
+--------+---------------------------+
mysql> set div_precision_increment=20;
Query OK, 0 rows affected (0.00 sec)
mysql> select 5/7, @@div_precision_increment;
+------------------------+---------------------------+
| 5/7                    | @@div_precision_increment |
+------------------------+---------------------------+
| 0.71428571428571428571 |                        20 |
+------------------------+---------------------------+
1 row in set (0.00 sec)
...
mysql> select t1,count(*),std(t2/t3) from ja group by t1;
+------+----------+----------------------------------+
| t1   | count(*) | std(t2/t3)                       |
+------+----------+----------------------------------+
|    1 |        1 | 0.000000000000000000000000000000 |
|    2 |        1 | 0.000000000000000000000000000000 |
|    3 |        1 | 0.000000000000000000000000000000 |
+------+----------+----------------------------------+
3 rows in set (0.00 sec)

Sergei, 

Thanks for the clarification about the size of the rounding errors and about the DECIMAL type (I was not aware). Nevertheless, I insist once more that it is a true bug. 

First, the double type should have given around 14-15 significant digits, so an error of 1.e-9 is unacceptably large for the simple operations involved in calculating the standard deviation of one data point.

Second, going back to my example with smallints, in the original example I inserted the values:

     INSERT INTO ja VALUES (1,53,78),(2,17,78),(3,18,76);

obtaining

t1	STD(t2/t3)
1	0.00000000
2	0.00460760
3	0.00446542

If this were only a roundoff error and not a bug, then 0 would be the correct result for the row (1,53,78) and 0.00460760 would be the correct approximate result for the second row (2,17,78). Nevertheless, if I eliminate the first and third rows by initializing my table through

     INSERT INTO ja VALUES (2,17,78);

I obtain

t1	STD(t2/t3)
2	0.00000000

the expected result (0.00000000) instead of the approximate result. I also get the exact result if I remove the first and second rows and leave only the third. Why would a roundoff error depend on the existence of other rows? 

I performed an additional test that may throw some light on this problem. Consider the following query:

SELECT t1, STD(t2/t3),AVG(POWER(t2/t3,2)),POWER(AVG(t2/t3),2) FROM ja GROUP BY t1;

If I run it over the database with three rows I get

t1	STD(t2/t3)	AVG(POWER(t2/t3,2))	POWER(AVG(t2/t3),2)
1	0.00000000	0.46170282642538	0.46172025
2	0.00460760	0.047501643241946	0.04748041
3	0.00446542	0.056094182700831	0.05607424

So it seems that two different precissions are being employed in different parts of the calculation.

Nevertheless, if I remove the first and third rows I get

t1	STD(t2/t3)	AVG(POWER(t2/t3,2))	POWER(AVG(t2/t3),2)
2	0.00000000	0.047501643241946	0.047501643241946

which shows that the precission of the fourth column is different if I have a different number of rows in the table! Isn't this suspicious?

If I get back to my three-rowed table and calculate STD from its definition as

SELECT t1, STD(t2/t3),SQRT(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2)) FROM ja GROUP BY t1;

I get

t1	STD(t2/t3)	SQRT(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2))
1	0.00000000	NULL
2	0.00460760	0.0046079542039908
3	0.00446542	0.0044657251181662

which shows similar (but not identical) discrepancies from the expected result (exact 0). I don't understand why is there a NULL in the first row, but anyway, it seems that problems arise from using different precissions in different parts of the calculation.

To finish, I looked at the example with double floats. If I

   CREATE TEMPORARY TABLE ja (t1 DOUBLE, t2 DOUBLE, t3 DOUBLE);
   INSERT INTO ja VALUES (1,53,78),(2,17,78),(3,18,76);
   SELECT t1, STD(t2/t3),(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2)) FROM ja GROUP BY t1;

I obtain

t1	STD(t2/t3)	(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2))
1	0	0
2	8.9609554959116e-10	0
3	2.2225877144417e-09	0

which shows that use of the formula yields the exact result, while the library function yields a unexpectedly large error.

I've got the similar results as in your last example on 5.0.26-BK on Linux. So, these errors of 1.e-9 in STDDEV really looks like a bug. Using DOUBLE(53,20) as type had not changes anything.

What /I/ find interesting about them is that exactly one row is always correct.  That makes me hypothesize that there's un-re-initialized memory being used between rows.

I'm tackling the problem of catastrophic cancellation first, and then looking for other problems when I'm finished.

We store one number in a double (64 bits), the other is cached in the FPU (80 bits), and when the two are subtracted, we get some tiny number left over.  When that tiny number is square-rooted (to make variance in to stddev), the number of digits between the decimal and the c-c error is (roughly) halved, thus making a small error a much bigger one.

I plan to look for some expression of the VARIANCE function that is not susceptible to catastrophic cancellation.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16540

ChangeSet@1.2299, 2006-12-06 13:47:13-05:00, cmiller@zippy.cornsilk.net +4 -0
  Bug#22555: STDDEV yields positive result for groups with only one row
  
  When only one row was present, the subtraction of nearly the same number 
  resulted in catastropic cancellation, introducing an error in the 
  VARIANCE calculation near 1e-15.  That was sqrt()ed to get STDDEV, the 
  error was escallated to near 1e-8.  
  
  The simple fix of testing for a row count of 1 and forcing that to yield 
  0.0 is insufficient, as two rows of the same value should also have a
  variance of 0.0, yet the error would be about the same.
  
  So, this patch changes the formula that computes the VARIANCE to be one
  that is not subject to catastrophic cancellation.
  
  In addition, it now uses only (faster-than-decimal) floating point numbers
  to calculate, and renders that to other types on demand.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/17338

ChangeSet@1.2299, 2006-12-22 15:37:37-05:00, cmiller@zippy.cornsilk.net +5 -0
  Bug#22555: STDDEV yields positive result for groups with only one row
  
  When only one row was present, the subtraction of nearly the same number 
  resulted in catastropic cancellation, introducing an error in the 
  VARIANCE calculation near 1e-15.  That was sqrt()ed to get STDDEV, the 
  error was escallated to near 1e-8.  
  
  The simple fix of testing for a row count of 1 and forcing that to yield 
  0.0 is insufficient, as two rows of the same value should also have a
  variance of 0.0, yet the error would be about the same.
  
  So, this patch changes the formula that computes the VARIANCE to be one
  that is not subject to catastrophic cancellation.
  
  In addition, it now uses only (faster-than-decimal) floating point numbers
  to calculate, and renders that to other types on demand.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/17394

ChangeSet@1.2360, 2006-12-26 12:42:54-07:00, tsmith@siva.hindu.god +2 -0
  In func_group.test, round the results of std() for some calls, because Windows' sqrt() function appears to return fewer "significant" digits than the Unix implementations.
  This is for bug #22555.

Available in 5.0.36, 5.1.15-beta.

Noted in 5.0.36, 5.1.15 changelogs.

It's still wrong ...

Errors are (from /opt/mysql-5.0.37/mysql-test/var/log/mysqltest-time) :
mysqltest: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/func_group.result 2007-03-14 17:38:13.000000000 +0300
--- r/func_group.reject 2007-03-14 19:03:06.834875440 +0300
***************
*** 1182,1190 ****
  3     4       0.00000000
  select i, count(*), std(o1/o2) from bug22555 group by i order by i;
  i     count(*)        std(o1/o2)
! 1     4       0
! 2     4       0
! 3     4       0
  select i, count(*), std(e1/e2) from bug22555 group by i order by i;
  i     count(*)        std(e1/e2)
  1     4       0.00000000
--- 1182,1190 ----
  3     4       0.00000000
  select i, count(*), std(o1/o2) from bug22555 group by i order by i;
  i     count(*)        std(o1/o2)
! 1     4       1.9708095987452e-17
! 2     4       1.9712988154832e-18
! 3     4       9.1019119466255e-18
  select i, count(*), std(e1/e2) from bug22555 group by i order by i;
  i     count(*)        std(e1/e2)
  1     4       0.00000000
***************
*** 1208,1216 ****
  3     4       0.000000000000000000000000000000
  select i, count(*), std(o1/o2) from bug22555 group by i order by i;
  i     count(*)        std(o1/o2)
! 1     4       0
! 2     4       0
! 3     4       0
  select i, count(*), std(e1/e2) from bug22555 group by i order by i;
  i     count(*)        std(e1/e2)
  1     4       0.000000000000000000000000000000
--- 1208,1216 ----
  3     4       0.000000000000000000000000000000
  select i, count(*), std(o1/o2) from bug22555 group by i order by i;
  i     count(*)        std(o1/o2)
! 1     4       1.9708095987452e-17
! 2     4       1.9712988154832e-18
! 3     4       9.1019119466255e-18
  select i, count(*), std(e1/e2) from bug22555 group by i order by i;
  i     count(*)        std(e1/e2)
  1     4       0.000000000000000000000000000000
***************
*** 1233,1241 ****
  3     4       0.000000000000000000000000000000
  select i, count(*), std(o1/o2) from bug22555 group by i order by i;
  i     count(*)        std(o1/o2)
! 1     4       0
! 2     4       0
! 3     4       0
  select i, count(*), std(e1/e2) from bug22555 group by i order by i;
  i     count(*)        std(e1/e2)
  1     4       0.000000000000000000000000000000
--- 1233,1241 ----
  3     4       0.000000000000000000000000000000
  select i, count(*), std(o1/o2) from bug22555 group by i order by i;
  i     count(*)        std(o1/o2)
! 1     4       1.9708095987452e-17
! 2     4       1.9712988154832e-18
! 3     4       9.1019119466255e-18
  select i, count(*), std(e1/e2) from bug22555 group by i order by i;
  i     count(*)        std(e1/e2)
  1     4       0.000000000000000000000000000000
-------------------------------------------------------

Closing this bug since I cannot reproduce it on mysql-enterprise-gpl-5.0.38-linux-i686-glibc23 and up.

I reproduce the same error as Filip with MySQL 5.0.37 after switching from gcc 4.2.2 to gcc 4.3.1. The error reproduces for 32-bit builds but not for 64-bit buidls. And by 'same', I mean that my test diff has the same values:
1	4	1.9708095987452e-17
2	4	1.9712988154832e-18
3	4	9.1019119466255e-18

I compile with -O2, not -O3

But I cannot reproduce this in 5.0.75. Nor can I reproduce in 5.0.41. I will look for the enterprise 5.0.38 release next.