Bug #57932 | query with AVG(DISTINCT) returns NULL if last aggregated value was NULL | ||
---|---|---|---|
Submitted: | 2 Nov 2010 15:11 | Modified: | 10 Jan 2011 3:26 |
Reporter: | SaiKumar V | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Optimizer | Severity: | S2 (Serious) |
Version: | 5.5, mysql-next-mr-opt-backporting | OS: | Linux (Ubuntu x86, RedHat x64) |
Assigned to: | Guilhem Bichot | CPU Architecture: | Any |
Tags: | regression |
[2 Nov 2010 15:11]
SaiKumar V
[2 Nov 2010 15:17]
SaiKumar V
Bug found in mysql-next-mr-opt-backporting revision id :- roy.lyseng@oracle.com-20101026104350-eawnur5qjrsp0342
[22 Nov 2010 21:21]
Guilhem Bichot
This must be fixed before jcl=4 becomes default.
[23 Nov 2010 14:44]
Guilhem Bichot
Happens in 5.5-bugteam, is not related to join buffering. This testcase below shows the same bug in 5.5-bugteam revision-id:guilhem@mysql.com-20101122091346-ihu8rkbil4n9s2um : the SELECT returns NULL. CREATE TABLE C (col_int_nokey int(11)); INSERT INTO C VALUES (7); INSERT INTO C VALUES (NULL); SELECT AVG(DISTINCT OUTR.col_int_nokey) FROM C AS OUTR LEFT JOIN C AS OUTR2 ON OUTR.col_int_nokey = OUTR2.col_int_nokey; # returns NULL! DROP TABLE C; The order of INSERTs is important: if inserting NULL then 7, bug doesn't happen. Note1: this is why in the testcases of the previous postings, the bug happened only with join buffering: join buffering changes order of rows produced by the left join, which are then fed to the AVG(DISTINCT) function: - without join buffering, NULL was sent first to the function; bug wasn't seen - with join buffering, non-NULL was sent first to the function, bug was seen. But the bug exists without join buffering too, just make sure to feed rows in the "nasty" order. Note2: why AVG should return 7 and not NULL: - because AVG(DISTINCT x) is expected to ignore NULL values of x: http://dev.mysql.com/doc/refman/5.5/en/group-by-functions.html says: "Unless otherwise stated, group functions ignore NULL values. " - I verified that Oracle returns 7. Note3: the revision which introduced this bug is: 2891 Georgi Kodinov 2009-09-28 revision-id:joro@sun.com-20090928072125-cbx5j0v2oe5d7av6 Ported WL#3220 to mysql-next-mr. (I verified by testing this revision and its parent). It's logical as WL#3220 is about aggregate functions.
[1 Dec 2010 8:34]
Guilhem Bichot
Simplest testcase: CREATE TABLE C (col_int_nokey int(11)); INSERT INTO C VALUES (7); INSERT INTO C VALUES (NULL); SELECT avg(distinct col_int_nokey) FROM C; returns NULL which is wrong.
[1 Dec 2010 10:35]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/125634 3125 Guilhem Bichot 2010-12-01 Fix for Bug#57932 "query with avg returns incorrect results": when there was one NULL value, AVG(DISTINCT) could forget about other values. First possible patch. @ mysql-test/r/func_group.result before the code fix, both SELECTs would return NULL @ sql/item_sum.cc Assume we are executing "SELECT AVG([DISTINCT] some_field) FROM some_table". and some_field is the single field of some_table for simplicity. Each time a row is processed (evaluate_join_record()-> end_send_group()->update_sum_func()) an aggregator is notified, which itself notifies an Item_sum_avg. Without DISTINCT, this Item_sum_avg immediately increments its internal "sum of values" and "count of values" (the latter being Item_sum_avg::count). The count is incremented only if the row's value is not NULL (in Item_sum_avg::add()), per AVG() semantices. This row's value is available in args[0] of Item_sum_avg ("args[0]" stands for "the first argument of the item": it's an Item_field which automatically receives the row's value when a row is read from the table). bool Item_sum_avg::add() { if (Item_sum_sum::add()) << calculates the sum (ignores NULL) return TRUE; if (!args[0]->null_value)<<if added value is not NULL count++; <<increment "count" return FALSE; } and everything works. With DISTINCT, when a row is processed by evaluate_join_record(), Item_sum_avg does no immediate computation, rather stores the row's value in a tree (to throw the value away if it is a duplicate of previous value, otherwise to remember all distinct values). It's only when it's time to send the average to the user (at end of the query: sub_select(end_of_records=true)->end_send_group()-> select_send->send_data()->Protocol::send_result_set_row()-> Item::send()->Item_sum_avg->val_str()), that we iterate over the tree, compute the sum and count: for this, for each element of the tree, Item_sum_avg::add() is called and has the same two steps as before: * Item_sum_sum::add() updates the sum (finding the tree element's value correctly, and determining correctly its NULLness - look for "arg_is_null" in that function) * the "if (!args[0]->null_value)" test right after, breaks: it uses args[0], which isn't the tree's element but rather the value for the last row processed by evaluate_join_record(). So if that last row was NULL, "count" stays 0 for each row, and AVG() then returns NULL (count==0 => NULL, per AVG() semantics). The fix is to use the correct NULLness in the if(), which is as determined by Item_sum_sum::add(). We do this by moving Item_sum_avg::count to Item_sum_sum, and letting Item_sum_sum::add() use its local "arg_is_null" to correctly increment "count".
[1 Dec 2010 10:38]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/125635 3125 Guilhem Bichot 2010-12-01 Fix for Bug#57932 "query with avg returns incorrect results": when there was one NULL value, AVG(DISTINCT) could forget about other values. Second possible patch. @ mysql-test/r/func_group.result before the code fix, both SELECTs would return NULL @ sql/item_sum.cc This time the fix is to add an "out parameter" to Item_sum_sum::add(), where it can mention whether it found a NULL value; if so, Item_sum_avg() knows it shouldn't increment "count". In detail, Item_sum_sum::add() being virtual, we don't add a new parameter to it (would change too many signatures), we rather define Item_sum_sum::add2, which is an extension of add() (has the "out parameter") and is used solely by Item_sum_avg::add().
[1 Dec 2010 10:46]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/125636 3125 Guilhem Bichot 2010-12-01 Fix for Bug#57932 "query with avg returns incorrect results": when there was one NULL value, AVG(DISTINCT) could forget about other values. Third possible patch. @ mysql-test/r/func_group.result before the code fix, both SELECTs would return NULL @ sql/item_sum.cc This time the fix is to let Item_sum_sum::add() store in a member variable whether the value it just saw was NULL. Item_sum_avg() can test it and know it shouldn't increment "count".
[1 Dec 2010 12:38]
Georgi Kodinov
Guilhem, Why not add this as members of the aggregator ? e.g. Aggregator_distinct::arg_val_decimal(my_decimal * value) { if (use_distinct_values) return table->field[0]->val_decimal (value); else return item_sum->args[0]->val_decimal(value); } Aggregator_distinct::arg_val_real() { if (use_distinct_values) return table->field[0]->val_real (); else return item_sum->args[0]->val_real(); } Aggregator_distinct::arg_is_null() { if (use_distinct_values) return table->field[0]->is_null() else return item_sum->args[0]->null_value; } And add the corresponding members in Aggregator and Aggregator_simple(). Maybe we can even make use_distinct_values private ? In this way we can then remove some if()s from Item_sum_sum
[2 Dec 2010 13:38]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/125807 3125 Guilhem Bichot 2010-12-02 Fix for Bug#57932 "query with avg returns incorrect results": when there was one NULL value, AVG(DISTINCT) could forget about other values. Fourth possible patch suggested by Joro. @ mysql-test/r/func_group.result before the code fix, both SELECTs would return NULL @ sql/item_sum.cc This time the fix is to let the aggregator tell whether the value it just saw was NULL. The aggregator knows where to get the info thanks to virtual functions. Item_sum_sum::add() now asks the aggregator. Item_sum_avg() also asks the aggregator and then knows it shouldn't increment "count".
[7 Dec 2010 15:59]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/126232 3174 Guilhem Bichot 2010-12-07 Fix for Bug#57932 "query with avg returns incorrect results": when there was one NULL value, AVG(DISTINCT) could forget about other values. See commit comment of item_sum.cc. @ mysql-test/r/func_group.result before the code fix, both SELECTs would return NULL @ sql/item_sum.cc Assume we are executing "SELECT AVG([DISTINCT] some_field) FROM some_table". and some_field is the single field of some_table for simplicity. Each time a row is processed (evaluate_join_record()-> end_send_group()->update_sum_func()) an aggregator is notified, which itself notifies an Item_sum_avg. Without DISTINCT, this Item_sum_avg immediately increments its internal "sum of values" and "count of values" (the latter being Item_sum_avg::count). The count is incremented only if the row's value is not NULL (in Item_sum_avg::add()), per AVG() semantices. This row's value is available in args[0] of Item_sum_avg ("args[0]" stands for "the first argument of the item": it's an Item_field which automatically receives the row's value when a row is read from the table). bool Item_sum_avg::add() { if (Item_sum_sum::add()) << calculates the sum (ignores NULL) return TRUE; if (!args[0]->null_value)<<if added value is not NULL count++; <<increment "count" return FALSE; } and everything works. With DISTINCT, when a row is processed by evaluate_join_record(), Item_sum_avg does no immediate computation, rather stores the row's value in a tree (to throw the value away if it is a duplicate of previous value, otherwise to remember all distinct values). It's only when it's time to send the average to the user (at end of the query: sub_select(end_of_records=true)->end_send_group()-> select_send->send_data()->Protocol::send_result_set_row()-> Item::send()->Item_sum_avg->val_str()), that we iterate over the tree, compute the sum and count: for this, for each element of the tree, Item_sum_avg::add() is called and has the same two steps as before: * Item_sum_sum::add() updates the sum (finding the tree element's value correctly, and determining correctly its NULLness - look for "arg_is_null" in that function) * the "if (!args[0]->null_value)" test right after, breaks: it uses args[0], which isn't the tree's element but rather the value for the last row processed by evaluate_join_record(). So if that last row was NULL, "count" stays 0 for each row, and AVG() then returns NULL (count==0 => NULL, per AVG() semantics). The fix is to let the aggregator tell whether the value it just saw was NULL. The aggregator knows where to get the info thanks to virtual functions. Item_sum_sum::add() now asks the aggregator. Item_sum_avg() also asks the aggregator and then knows it shouldn't increment "count". @ sql/item_sum.h Aggregator can now tell about value/NULLness of just-aggregated value
[7 Dec 2010 16:21]
Guilhem Bichot
queued to 5.5-bugteam and trunk-bugfixing
[17 Dec 2010 12:53]
Bugs System
Pushed into mysql-5.5 5.5.9 (revid:georgi.kodinov@oracle.com-20101217124733-p1ivu6higouawv8l) (version source revid:guilhem@mysql.com-20101207155932-jq0lsppj7urfcupq) (merge vers: 5.5.8) (pib:24)
[17 Dec 2010 12:57]
Bugs System
Pushed into mysql-trunk 5.6.1 (revid:georgi.kodinov@oracle.com-20101217125013-y8pb3az32rtbplc9) (version source revid:anitha.gopi@sun.com-20101210041312-50t9adyhwwybsm6x) (merge vers: 5.6.1) (pib:24)
[10 Jan 2011 3:26]
Paul DuBois
Noted in 5.5.9 changelog. If the set of values aggregated with AVG(DISTINCT) contained a NULL value, the function result could be incorrect.