Bug #22555 | STDDEV yields positive result for groups with only one row | ||
---|---|---|---|
Submitted: | 21 Sep 2006 17:09 | Modified: | 4 May 2007 11:36 |
Reporter: | Luis Mochan | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server | Severity: | S2 (Serious) |
Version: | 5.0.26-BK, 5.0.24a-Debian_3-log | OS: | Linux (linux) |
Assigned to: | Alexey Kopytov | CPU Architecture: | Any |
Tags: | GROUP, STDDEV, VARIANCE |
[21 Sep 2006 17:09]
Luis Mochan
[21 Sep 2006 20:56]
Sveta Smirnova
Thank you for the report. What the result you expect to get?
[22 Sep 2006 12:43]
Luis Mochan
The result for the population standard deviation with only one datum should have been exactly zero as all the data (the one data point) are exactly equal to the average, there can be no deviation at all.
[23 Sep 2006 13:34]
Sveta Smirnova
I a bit changed your example to: create temporary table ja (t1 double,t2 double,t3 double); insert into ja values (1,53,78),(2,17,78),(3,18,76); because std converts integer values to double and get different results on different machines: (Windows) mysql> select t1,count(*),std(t2/t3) from ja group by t1; +------+----------+------------+ | t1 | count(*) | std(t2/t3) | +------+----------+------------+ | 1 | 1 | 0 | | 2 | 1 | 0 | | 3 | 1 | 0 | +------+----------+------------+ 3 rows in set (0.01 sec) (Linux) mysql> select t1,count(*),std(t2/t3) from ja group by t1; +------+----------+---------------------+ | t1 | count(*) | std(t2/t3) | +------+----------+---------------------+ | 1 | 1 | 4.9773506648913e-09 | | 2 | 1 | 8.9609554959116e-10 | | 3 | 1 | 0 | +------+----------+---------------------+ 3 rows in set (0.04 sec) (Red Hat Linux) mysql> select t1,count(*),std(t2/t3) from ja group by t1; +------+----------+---------------------+ | t1 | count(*) | std(t2/t3) | +------+----------+---------------------+ | 1 | 1 | 0 | | 2 | 1 | 8.9609554959116e-10 | | 3 | 1 | 2.2225877144417e-09 | +------+----------+---------------------+ 3 rows in set (0.08 sec) So it seems like rounding error too.
[23 Sep 2006 17:18]
Luis Mochan
The bug might somehow be related to roundoff errors, but it is still a bug. Even if t2/t3 contains truncation errors, the sum: sum(power(t2/t3,2))/count(*)-power(sum(t2/t3)/count(*),2) should yield power(t2/t3,2)-power(t2/t3,2)= 0 when there is only one data point, independently of the precision in the calculation of t2/t3. On the other hand, the size of the errors in the example I submitted seem to be much larger than a simple roundoff error and makes me doubt on the quality of calculations with more than one data point. Doesn't your example with double floats confirm my opinion? How many decimal digits to a double? The error is too large. Doesn't double conform to standards in both windows and linux? The change between windows and linux is due to a change in operating system (??) or to a change in hardware?
[23 Sep 2006 17:34]
Sergei Golubchik
Sveta is right, it's a rounding error. It's big in your example, because you use integer data types. Starting from 5.0, INT/INT is DECIMAL, not DOUBLE. And the precision of decimal division is defined by div_precision_increment server variable: http://dev.mysql.com/doc/refman/5.0/en/arithmetic-functions.html mysql> select 5/7, @@div_precision_increment; +--------+---------------------------+ | 5/7 | @@div_precision_increment | +--------+---------------------------+ | 0.7143 | 4 | +--------+---------------------------+ mysql> set div_precision_increment=20; Query OK, 0 rows affected (0.00 sec) mysql> select 5/7, @@div_precision_increment; +------------------------+---------------------------+ | 5/7 | @@div_precision_increment | +------------------------+---------------------------+ | 0.71428571428571428571 | 20 | +------------------------+---------------------------+ 1 row in set (0.00 sec) ... mysql> select t1,count(*),std(t2/t3) from ja group by t1; +------+----------+----------------------------------+ | t1 | count(*) | std(t2/t3) | +------+----------+----------------------------------+ | 1 | 1 | 0.000000000000000000000000000000 | | 2 | 1 | 0.000000000000000000000000000000 | | 3 | 1 | 0.000000000000000000000000000000 | +------+----------+----------------------------------+ 3 rows in set (0.00 sec)
[26 Sep 2006 3:14]
Luis Mochan
Sergei, Thanks for the clarification about the size of the rounding errors and about the DECIMAL type (I was not aware). Nevertheless, I insist once more that it is a true bug. First, the double type should have given around 14-15 significant digits, so an error of 1.e-9 is unacceptably large for the simple operations involved in calculating the standard deviation of one data point. Second, going back to my example with smallints, in the original example I inserted the values: INSERT INTO ja VALUES (1,53,78),(2,17,78),(3,18,76); obtaining t1 STD(t2/t3) 1 0.00000000 2 0.00460760 3 0.00446542 If this were only a roundoff error and not a bug, then 0 would be the correct result for the row (1,53,78) and 0.00460760 would be the correct approximate result for the second row (2,17,78). Nevertheless, if I eliminate the first and third rows by initializing my table through INSERT INTO ja VALUES (2,17,78); I obtain t1 STD(t2/t3) 2 0.00000000 the expected result (0.00000000) instead of the approximate result. I also get the exact result if I remove the first and second rows and leave only the third. Why would a roundoff error depend on the existence of other rows? I performed an additional test that may throw some light on this problem. Consider the following query: SELECT t1, STD(t2/t3),AVG(POWER(t2/t3,2)),POWER(AVG(t2/t3),2) FROM ja GROUP BY t1; If I run it over the database with three rows I get t1 STD(t2/t3) AVG(POWER(t2/t3,2)) POWER(AVG(t2/t3),2) 1 0.00000000 0.46170282642538 0.46172025 2 0.00460760 0.047501643241946 0.04748041 3 0.00446542 0.056094182700831 0.05607424 So it seems that two different precissions are being employed in different parts of the calculation. Nevertheless, if I remove the first and third rows I get t1 STD(t2/t3) AVG(POWER(t2/t3,2)) POWER(AVG(t2/t3),2) 2 0.00000000 0.047501643241946 0.047501643241946 which shows that the precission of the fourth column is different if I have a different number of rows in the table! Isn't this suspicious? If I get back to my three-rowed table and calculate STD from its definition as SELECT t1, STD(t2/t3),SQRT(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2)) FROM ja GROUP BY t1; I get t1 STD(t2/t3) SQRT(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2)) 1 0.00000000 NULL 2 0.00460760 0.0046079542039908 3 0.00446542 0.0044657251181662 which shows similar (but not identical) discrepancies from the expected result (exact 0). I don't understand why is there a NULL in the first row, but anyway, it seems that problems arise from using different precissions in different parts of the calculation. To finish, I looked at the example with double floats. If I CREATE TEMPORARY TABLE ja (t1 DOUBLE, t2 DOUBLE, t3 DOUBLE); INSERT INTO ja VALUES (1,53,78),(2,17,78),(3,18,76); SELECT t1, STD(t2/t3),(AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2)) FROM ja GROUP BY t1; I obtain t1 STD(t2/t3) (AVG(POWER(t2/t3,2))-POWER(AVG(t2/t3),2)) 1 0 0 2 8.9609554959116e-10 0 3 2.2225877144417e-09 0 which shows that use of the formula yields the exact result, while the library function yields a unexpectedly large error.
[28 Sep 2006 11:11]
Valeriy Kravchuk
I've got the similar results as in your last example on 5.0.26-BK on Linux. So, these errors of 1.e-9 in STDDEV really looks like a bug. Using DOUBLE(53,20) as type had not changes anything.
[12 Nov 2006 3:22]
Chad MILLER
What /I/ find interesting about them is that exactly one row is always correct. That makes me hypothesize that there's un-re-initialized memory being used between rows.
[27 Nov 2006 22:48]
Chad MILLER
I'm tackling the problem of catastrophic cancellation first, and then looking for other problems when I'm finished. We store one number in a double (64 bits), the other is cached in the FPU (80 bits), and when the two are subtracted, we get some tiny number left over. When that tiny number is square-rooted (to make variance in to stddev), the number of digits between the decimal and the c-c error is (roughly) halved, thus making a small error a much bigger one. I plan to look for some expression of the VARIANCE function that is not susceptible to catastrophic cancellation.
[6 Dec 2006 18:47]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/16540 ChangeSet@1.2299, 2006-12-06 13:47:13-05:00, cmiller@zippy.cornsilk.net +4 -0 Bug#22555: STDDEV yields positive result for groups with only one row When only one row was present, the subtraction of nearly the same number resulted in catastropic cancellation, introducing an error in the VARIANCE calculation near 1e-15. That was sqrt()ed to get STDDEV, the error was escallated to near 1e-8. The simple fix of testing for a row count of 1 and forcing that to yield 0.0 is insufficient, as two rows of the same value should also have a variance of 0.0, yet the error would be about the same. So, this patch changes the formula that computes the VARIANCE to be one that is not subject to catastrophic cancellation. In addition, it now uses only (faster-than-decimal) floating point numbers to calculate, and renders that to other types on demand.
[22 Dec 2006 20:37]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/17338 ChangeSet@1.2299, 2006-12-22 15:37:37-05:00, cmiller@zippy.cornsilk.net +5 -0 Bug#22555: STDDEV yields positive result for groups with only one row When only one row was present, the subtraction of nearly the same number resulted in catastropic cancellation, introducing an error in the VARIANCE calculation near 1e-15. That was sqrt()ed to get STDDEV, the error was escallated to near 1e-8. The simple fix of testing for a row count of 1 and forcing that to yield 0.0 is insufficient, as two rows of the same value should also have a variance of 0.0, yet the error would be about the same. So, this patch changes the formula that computes the VARIANCE to be one that is not subject to catastrophic cancellation. In addition, it now uses only (faster-than-decimal) floating point numbers to calculate, and renders that to other types on demand.
[26 Dec 2006 19:43]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/17394 ChangeSet@1.2360, 2006-12-26 12:42:54-07:00, tsmith@siva.hindu.god +2 -0 In func_group.test, round the results of std() for some calls, because Windows' sqrt() function appears to return fewer "significant" digits than the Unix implementations. This is for bug #22555.
[31 Jan 2007 19:15]
Chad MILLER
Available in 5.0.36, 5.1.15-beta.
[31 Jan 2007 19:28]
Paul DuBois
Noted in 5.0.36, 5.1.15 changelogs.
[14 Mar 2007 16:46]
Filip Krejčí
It's still wrong ... Errors are (from /opt/mysql-5.0.37/mysql-test/var/log/mysqltest-time) : mysqltest: Result length mismatch (the last lines may be the most important ones) Below are the diffs between actual and expected results: ------------------------------------------------------- *** r/func_group.result 2007-03-14 17:38:13.000000000 +0300 --- r/func_group.reject 2007-03-14 19:03:06.834875440 +0300 *************** *** 1182,1190 **** 3 4 0.00000000 select i, count(*), std(o1/o2) from bug22555 group by i order by i; i count(*) std(o1/o2) ! 1 4 0 ! 2 4 0 ! 3 4 0 select i, count(*), std(e1/e2) from bug22555 group by i order by i; i count(*) std(e1/e2) 1 4 0.00000000 --- 1182,1190 ---- 3 4 0.00000000 select i, count(*), std(o1/o2) from bug22555 group by i order by i; i count(*) std(o1/o2) ! 1 4 1.9708095987452e-17 ! 2 4 1.9712988154832e-18 ! 3 4 9.1019119466255e-18 select i, count(*), std(e1/e2) from bug22555 group by i order by i; i count(*) std(e1/e2) 1 4 0.00000000 *************** *** 1208,1216 **** 3 4 0.000000000000000000000000000000 select i, count(*), std(o1/o2) from bug22555 group by i order by i; i count(*) std(o1/o2) ! 1 4 0 ! 2 4 0 ! 3 4 0 select i, count(*), std(e1/e2) from bug22555 group by i order by i; i count(*) std(e1/e2) 1 4 0.000000000000000000000000000000 --- 1208,1216 ---- 3 4 0.000000000000000000000000000000 select i, count(*), std(o1/o2) from bug22555 group by i order by i; i count(*) std(o1/o2) ! 1 4 1.9708095987452e-17 ! 2 4 1.9712988154832e-18 ! 3 4 9.1019119466255e-18 select i, count(*), std(e1/e2) from bug22555 group by i order by i; i count(*) std(e1/e2) 1 4 0.000000000000000000000000000000 *************** *** 1233,1241 **** 3 4 0.000000000000000000000000000000 select i, count(*), std(o1/o2) from bug22555 group by i order by i; i count(*) std(o1/o2) ! 1 4 0 ! 2 4 0 ! 3 4 0 select i, count(*), std(e1/e2) from bug22555 group by i order by i; i count(*) std(e1/e2) 1 4 0.000000000000000000000000000000 --- 1233,1241 ---- 3 4 0.000000000000000000000000000000 select i, count(*), std(o1/o2) from bug22555 group by i order by i; i count(*) std(o1/o2) ! 1 4 1.9708095987452e-17 ! 2 4 1.9712988154832e-18 ! 3 4 9.1019119466255e-18 select i, count(*), std(e1/e2) from bug22555 group by i order by i; i count(*) std(e1/e2) 1 4 0.000000000000000000000000000000 -------------------------------------------------------
[4 May 2007 11:36]
Alexey Kopytov
Closing this bug since I cannot reproduce it on mysql-enterprise-gpl-5.0.38-linux-i686-glibc23 and up.
[11 Mar 2009 20:22]
Mark Callaghan
I reproduce the same error as Filip with MySQL 5.0.37 after switching from gcc 4.2.2 to gcc 4.3.1. The error reproduces for 32-bit builds but not for 64-bit buidls. And by 'same', I mean that my test diff has the same values: 1 4 1.9708095987452e-17 2 4 1.9712988154832e-18 3 4 9.1019119466255e-18
[11 Mar 2009 20:25]
Mark Callaghan
I compile with -O2, not -O3
[11 Mar 2009 21:09]
Mark Callaghan
But I cannot reproduce this in 5.0.75. Nor can I reproduce in 5.0.41. I will look for the enterprise 5.0.38 release next.