MySQL Bugs: #105836: KeyFields are optimzied away with no collation in Item::eq_by

Bug #105836	KeyFields are optimzied away with no collation in Item::eq_by_collation
Submitted:	8 Dec 2021 13:40	Modified:	16 Dec 2021 6:36
Reporter:	Shanshan Ying	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Item::eq_by_collation throws away collation on comparing items and reports equal items incorrelty, whichleading to incorret plan.

How to repeat:
CREATE TABLE `t1` (
`id` int unsigned NOT NULL AUTO_INCREMENT ,
`name` varchar(100) CHARACTER SET utf8 ,
PRIMARY KEY (`id`),
KEY `idx_name` (`name`) USING BTREE
) ENGINE=InnoDB;

insert into t1 values ('()');

select name from t1 
where 
name = replace(replace('()', '（', '('), '）', ')')  -- cond a
or 
name = replace(replace('()', '(', '（'), ')', '）') -- cond b
; /* returns one row */

select name from t1 
where 
name = replace(replace('()', '(', '（'), ')', '）') -- cond b
or
name = replace(replace('()', '（', '('), '）', ')')  -- cond a
; /* empty set */

Suggested fix:
Check function `Item::eq_by_collation` in file sql_optimizer.cc

function `Item::eq_by_collation` only updates the top item's collation.
it should either update collation for all items beheath this item or compare with collation argument.

Hi Mr. Ying,

Thank you very much for your bug report.

However, your report is insufficient and quite unclear as to what are you reporting at all. Simply, you do not explain what is it that is wrong with a query and why is it wrong.

Beside that, we do not see what collation are you using !!!! 

Next, what you are trying to say with "cond a" or "cond b" ...... that is totally unclear. We need clear and obvious  bug reports, not the ones that are highly cryptic.

Next, you do not seem to the coercibility  levels, which are fully explained in our Reference Manual.

We are waiting on your feedback.

Small correction ......

That sentence should have been like this:

"
Next, you do not seem to have considered at all the  impact of the coercibility  levels, which are fully
explained in our Reference Manual.

"

Thanks for your feedback, and let me describe more details about the case.

During the optimization stage, some key fields are optimized way wrongly. Because those items are compared WITHOUT charset.

for instance in mysql 8.0.26, 

step 1. we create a table `t1` using charset UTF8 
----------------------------------------------------------------------

mysql [localhost:8026] {msandbox} (test) > show variables like 'character\_set\_%';
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | utf8mb4 |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | utf8mb4 |
| character_set_system     | utf8mb3 |
+--------------------------+---------+
7 rows in set (0.00 sec)

CREATE TABLE `t1` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_name` (`name`) USING BTREE
) ENGINE=InnoDB
;

insert into t1(name)  values ('()');

----------------------------------------------------------------------

step 2. given two expressions as following
a) replace(replace('()', "（", "("), "）", ")")  -- does nothing
b) replace(replace('()', "(", "（"), ")", "）")  -- replaces brackets with 
Chinese Punctuation, i.e. replace %28 and %29 with %uFF08 and %uFF09

----------------------------------------------------------------------
step 3. execute a query with disjunct expressions, got one row

mysql [localhost:8026] {msandbox} (test) > select * from t1 where
    -> (name= replace(replace('()', "（", "("), "）", ")"))  -- expr a
    -> or
    -> (name= replace( replace('()', "(", "（"), ")", "）")) -- expr b
    -> ;
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.00 sec)

----------------------------------------------------------------------
step 4. swap the order of expr a) and expr b) in the query, and we got an empty set

mysql [localhost:8026] {msandbox} (test) > select * from t1 where
    -> (name= replace( replace('()', "(", "（"), ")", "）")) -- expr b
    -> or
    -> (name= replace(replace('()', "（", "("), "）", ")"))  -- expr a
    -> ;
Empty set (0.00 sec)

----------------------------------------------------------------------
step 5. veriy the plan
turnning on the optimizer trace and we found one key difference between these two queries is in `ref_optimizer_key_uses` field.

During optimization stage, expr a) and expr b) are considered as equal and the one occurs later in the query will be optimzied way. 

However, expr a) and expr b) are equal under UTF8mb4 (default charset), but not under UTF8 (charset of `name`).

Check the code, and we find while merging key fields, the function `Item::eq_by_collation` does not set collation for all items beheath current one. It should pass the collation info all the way down to the leaf node.

how to reproduce the case

Attachment: 105836_reproduce.sql (application/octet-stream, text), 622 bytes.

Hi Mr. Ying,

Thank you for your bug report.

We tried to repeat your report, but we failed:

name
()
name
()

We just replaced utf8 with utf8mb4, which is the UTF character set that we now support.

Not a bug.

Hi,

It was confirmed that your case requires utfmb4 character set.

Hence, not a bug .....

The root cause is when the default set is different from the column definition, mysql should choose only ONE collation for comparison, either using the default charset or the one defined in column, but it should be consistent throughtout the whole procedure.

While merging key fields, mysql will compre const items to remove duplicated ones. Given function `f1(f2(f3...))`, only `f1` compares using `item::eq_by_collation` with column collation, `f2` and `f3` compares using `item::eq` with system default one. 

Is such inconsistency acceptable? 

Recalling the test case given eariler, if we use string constants instead of `replace` functions, it works exactly as expected.
----------------------------------------------------
mysql [localhost:8027] {msandbox} (test) > select * from t1 where name = '()' or name = '（）';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.00 sec)

mysql [localhost:8027] {msandbox} (test) > select * from t1 where  name = '（）' or name = '()';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.01 sec)

----------------------------------------------------

These two queries give the same anwser.

The root cause is when the default set is different from the column definition, mysql should choose only ONE collation for comparison, either using the default charset or the one defined in column, but it should be consistent throughtout the whole procedure.

While merging key fields, mysql will compre const items to remove duplicated ones. Given function `f1(f2(f3...))`, only `f1` compares using `item::eq_by_collation` with column collation, `f2` and `f3` compares using `item::eq` with system default one. 

Is such inconsistency acceptable? 

Recalling the test case given eariler, if we use string constants instead of `replace` functions, it works exactly as expected.
----------------------------------------------------
mysql [localhost:8027] {msandbox} (test) > select * from t1 where name = '()' or name = '（）';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.00 sec)

mysql [localhost:8027] {msandbox} (test) > select * from t1 where  name = '（）' or name = '()';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.01 sec)

----------------------------------------------------

These two queries give the same anwser.

The root cause is when the default set is different from the column definition, mysql should choose only ONE collation for comparison, either using the default charset or the one defined in column, but it should be consistent throughtout the whole procedure.

While merging key fields, mysql will compre const items to remove duplicated ones. Given function `f1(f2(f3...))`, only `f1` compares using `item::eq_by_collation` with column collation, `f2` and `f3` compares using `item::eq` with system default one. 

Is such inconsistency acceptable? 

Recalling the test case given eariler, if we use string constants instead of `replace` functions, it works exactly as expected.
----------------------------------------------------
mysql [localhost:8027] {msandbox} (test) > select * from t1 where name = '()' or name = '（）';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.00 sec)

mysql [localhost:8027] {msandbox} (test) > select * from t1 where  name = '（）' or name = '()';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.01 sec)

----------------------------------------------------

These two queries give the same anwser.

The root cause is when the default set is different from the column definition, mysql should choose only ONE collation for comparison, either using the default charset or the one defined in column, but it should be consistent throughtout the whole procedure.

While merging key fields, mysql will compre const items to remove duplicated ones. Given function `f1(f2(f3...))`, only `f1` compares using `item::eq_by_collation` with column collation, `f2` and `f3` compares using `item::eq` with system default one. 

Is such inconsistency acceptable? 

Recalling the test case given eariler, if we use string constants instead of `replace` functions, it works exactly as expected.
----------------------------------------------------
mysql [localhost:8027] {msandbox} (test) > select * from t1 where name = '()' or name = '（）';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.00 sec)

mysql [localhost:8027] {msandbox} (test) > select * from t1 where  name = '（）' or name = '()';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.01 sec)

----------------------------------------------------

These two queries give the same anwser.

The root cause is when the default set is different from the column definition, mysql should choose only ONE collation for comparison, either using the default charset or the one defined in column, but it should be consistent throughtout the whole procedure.

While merging key fields, mysql will compre const items to remove duplicated ones. Given function `f1(f2(f3...))`, only `f1` compares using `item::eq_by_collation` with column collation, `f2` and `f3` compares using `item::eq` with system default one. 

Is such inconsistency acceptable? 

Recalling the test case given eariler, if we use string constants instead of `replace` functions, it works exactly as expected.
----------------------------------------------------
mysql [localhost:8027] {msandbox} (test) > select * from t1 where name = '()' or name = '（）';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.00 sec)

mysql [localhost:8027] {msandbox} (test) > select * from t1 where  name = '（）' or name = '()';
+----+------+
| id | name |
+----+------+
|  1 | ()   |
+----+------+
1 row in set (0.01 sec)

----------------------------------------------------

These two queries give the same anwser.

Hi,

MySQL can not use only one collation for character domain comparisons.

That is why we have coercibility rules for those purposes. Those rules take into the account the precedence that we wrote about .....

Let's simplify the where conditions as
1) `name = f1(const1)`, compare using collation of `name`
2) `name = f2(const2)`, compare using collation of `name`
where f_i() is a function, and const_i represents some constant.

Then, the question is, what is the collation used to compare `f1(const1)` and  `f2(const2)`? 

Whichever it is, there should be only one collation used throughout the whole comparision, isn't it?

From function `Item::eq_by_collation()`, we know that item `f1` will be compared with item `f2` using collation of `name`, BUT arguments `const1` and `const2` are compared using default charset. Is it designed so? Shouldn't const1 and const2 inherit collation from f1 and f2 respectively?

If such behavior is acceptable, we will encounter following case:
1. optimizer uses UTF8MB4 and belives two expressions are equal and optimizes away one of them
2. executor uses UTF8MB3 and fails to find a result for the expression left.

Hi,

If our default collations do not suit your needs or collations that are used for constants, then you can always use introducers.

In any case, there is a possibility that some expressions are not resolved in the optimiser, but in the executor's part ......

Thanks for your reply. But I still believe it is a bug and should be fixed.

Let me rephrase myself.

I guess we both agree that the intention of `Item::eq_by_collation` is to align two items to the same collation and do some comparison.

Then Why align halfway? 

Only the top-most item, i.e., root node of the expression tree, will be aligned. What's the physical logic here?

Hi,

This is not about alignment, but about the conversion.

We have also provided you with the way on how to write your DML's so that conversion is forced. Otherwise, MySQL will simply use coercibility and conversion rules.