Bug #101108 incorrect derived_merge optimization
Submitted: 9 Oct 2020 18:11 Modified: 19 Oct 2021 12:01
Reporter: Александр Ммммммм Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DML Severity:S2 (Serious)
Version:8.0.27 OS:Any
Assigned to: CPU Architecture:Any

[9 Oct 2020 18:11] Александр Ммммммм
Description:
derived_merge optimization applying for not deterministic functions and it gives an unpredictable, erroneous result

How to repeat:
set session optimizer_switch = 'derived_merge=on';
SELECT *
FROM (
    SELECT n
    , IF(RAND() < 0.5, RAND() * 10000, NULL) res
    FROM (
        SELECT 1 n
        UNION ALL SELECT 2
        UNION ALL SELECT 3
        UNION ALL SELECT 4
        UNION ALL SELECT 5
    ) t
) t
WHERE res IS NOT NULL

+---+-------------------+
| n | res               |
+---+-------------------+
| 1 |              NULL |
| 2 |              NULL |
| 4 |              NULL |
| 5 | 2772.243267644423 |
+---+-------------------+
4 rows in set (0.00 sec)

all SQL standarts say, if I make condition WHERE res IS NOT NULL, I must not receive NULL values

this comes from derived_merge optimization

-> Filter: (if((rand() < 0.5),(rand() * 10000),NULL) is not null)  (actual time=0.026..0.027 rows=3 loops=1)
    -> Table scan on t  (actual time=0.001..0.001 rows=5 loops=1)
        -> Union materialize  (actual time=0.019..0.020 rows=5 loops=1)
            -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
            -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
            -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
            -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
            -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)

filtered other result than selected!

if derived_merge optimization disabled, result will be expected

set session optimizer_switch = 'derived_merge=off'
;
SELECT *
FROM (
    SELECT n
    , IF(RAND() < 0.5, RAND() * 10000, NULL) res
    FROM (
        SELECT 1 n
        UNION ALL SELECT 2
        UNION ALL SELECT 3
        UNION ALL SELECT 4
        UNION ALL SELECT 5
    ) t
) t
WHERE res IS NOT NULL

+---+--------------------+
| n | res                |
+---+--------------------+
| 3 |  7635.638404298237 |
| 4 |  7536.478478029816 |
| 5 | 4472.3482099104185 |
+---+--------------------+
3 rows in set (0.00 sec)

-> Filter: (t.res is not null)  (actual time=0.024..0.025 rows=3 loops=1)
    -> Table scan on t  (actual time=0.000..0.000 rows=5 loops=1)
        -> Materialize  (actual time=0.023..0.023 rows=5 loops=1)
            -> Table scan on t  (actual time=0.001..0.001 rows=5 loops=1)
                -> Union materialize  (actual time=0.006..0.007 rows=5 loops=1)
                    -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
                    -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
                    -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
                    -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)
                    -> Rows fetched before execution  (actual time=0.000..0.000 rows=1 loops=1)

Suggested fix:
Don't use derived_merge optimization for non-deterministic expressions
[9 Oct 2020 20:15] MySQL Verification Team
Thank you for the bug report.
[19 Oct 2021 12:01] Александр Ммммммм
https://github.com/mysql/mysql-server/blob/da7a1870abe131923ed003e0e9747d2de75a41e5/sql/it...

bool Item_field::check_column_from_derived_table(uchar *arg) {
  TABLE_LIST *tl = pointer_cast<TABLE_LIST *>(arg);
  if (field->table == tl->table) {
    // If the expression in the derived table for this column has a subquery
    // or contains parameters or has non-deterministic result, condition is
    // not pushed down.
    // Expressions having subqueries need a more complicated replacement
    // strategy than the one that currently exists when the condition is
    // moved to derived table.
    // Expression having parameters when cloned as part of replacement have
    // problems to locate the original "?" and therefore will not be able to
    // get the value.  TODO: Lift these two limitations.
    // Any condition with expressions having non-deterministic result in the
    // underlying derived table should not be pushed.
    // For ex:
    // select * from (select rand() as a from t1) where a >0.5;
    // Here a > 0.5 if pushed down would result in rand() getting evaluated
    // twice because the query would then be
    // select * from (select rand() as a from t1 where rand() > 0.5) which
    // is not correct. See also Item_func::check_column_from_derived_table
    Item *item = tl->get_derived_expr(field->field_index());
    return (item->has_subquery() ||
            (item->used_tables() & (INNER_TABLE_BIT | RAND_TABLE_BIT)));
  }
  return true;
}