Bug #41350 String expression collation determined only by the first equality operand
Submitted: 10 Dec 2008 8:00 Modified: 10 Dec 2008 8:30
Reporter: Ilan Hazan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.0.46-log OS:Linux
Assigned to:
Tags: casting, type conversion
Triage: Triaged: D3 (Medium)

[10 Dec 2008 8:00] Ilan Hazan
Description:
String expression collation determined only by the first equality expression operand.
As a result the way MySQL evaluates the expression is depends by the order of the operands. 

How to repeat:
I have two tables:

CREATE TABLE `TABLEY` (
  `MAIN` char(1) default NULL,
  `P_ID` char(10) NOT NULL,
  `NAME` char(89) default NULL,
...
  PRIMARY KEY  (`P_ID`),
  KEY `NAME` (`NAME`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

CREATE TABLE `TABLEX` (
  `ENTRY_ID` int(11) NOT NULL auto_increment,
  `OUTSOURCE_KEY` varchar(255) character set utf8 collate utf8_bin NOT NULL,
  ......
  PRIMARY KEY  (`ENTRY_ID`),
  UNIQUE KEY `OUTSOURCE_KEY` (`OUTSOURCE_KEY`)
) ENGINE=MyISAM AUTO_INCREMENT=70489 DEFAULT CHARSET=utf8

I have the query:
SELECT x.ENTRY_ID, x.OUTSOURCE_KEY, y.*  FROM  TABLEX x JOIN TABLEY y WHERE (x.OUTSOURCE_KEY = y.P_ID ) AND x.ENTRY_ID =10006

mysql> EXPLAIN SELECT x.ENTRY_ID, x.OUTSOURCE_KEY, y.*  FROM  TABLEX x JOIN TABLEY y WHERE (x.OUTSOURCE_KEY = y.P_ID ) AND x.ENTRY_ID =10006\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: x
         type: const
possible_keys: PRIMARY,OUTSOURCE_KEY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra:
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: y
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1048582
        Extra: Using where
2 rows in set (0.00 sec)

In this case MySQL determine that the collation of the expression  (x.OUTSOURCE_KEY = y.P_ID )  is utf8_bin hence it needs to convert the  TABLEY.P_ID column to utf8_bin. This means full table scan.

mysql>
mysql>
mysql>
mysql> EXPLAIN SELECT x.ENTRY_ID, x.OUTSOURCE_KEY, y.*  FROM  TABLEX x JOIN TABLEY y WHERE (y.P_ID = x.OUTSOURCE_KEY ) AND x.ENTRY_ID =10006\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ds
         type: const
possible_keys: PRIMARY,OUTSOURCE_KEY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra:
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: da
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 30
          ref: NULL
         rows: 1
        Extra: Using where
2 rows in set (0.01 sec)

In this case MySQL determine that the collation of the expression (y.P_ID = x.OUTSOURCE_KEY ) is utf8 (not bin) hence it needs to convert only  the given TABLEX.OUTSOURCE_KEY  value to utf8.

Suggested fix:
Need to determine a rule for String comparison involving different collation and force it.
[10 Dec 2008 8:30] Valerii Kravchuk
Thank you for a problem report. I do not see anything in the manual (including http://dev.mysql.com/doc/refman/5.0/en/charset-collate-tricky.html) that explains how collation is determined in cases like this (utf8_*_ci vs. utf8_bin), so this is at least a valid documentation request.
[19 Dec 2008 7:07] Alexander Barkov
A full script repeating the same problem:

DROP TABLE IF EXISTS t1, t2;

CREATE TABLE t1 (
  entry_id int(11) NOT NULL auto_increment PRIMARY KEY,
  outsource_key varchar(255) character set utf8 collate utf8_bin NOT NULL,
   UNIQUE KEY outsource_key (outsource_key)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO t1 VALUES (1,'key1'),(2,'key2'),(10006,'key10006');

CREATE TABLE t2 (
  p_id char(10) NOT NULL PRIMARY KEY,
  NAME char(89) default NULL,
  KEY NAME (NAME)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO t2 VALUES ('key1','name1'),('key2','name2'),('key10006','name10006');

EXPLAIN
SELECT
  t1.entry_id,
  t1.outsource_key,
  t2.*
FROM
  t1 JOIN t2
WHERE  
 (t1.outsource_key = t2.p_id ) AND t1.entry_id =10006;

EXPLAIN
SELECT
  t1.entry_id,
  t1.outsource_key,
  t2.*
FROM
  t1 JOIN t2
WHERE
  (t2.p_id = t1.outsource_key ) AND t1.entry_id =10006;

And the output is:

+----+-------------+-------+-------+-----------------------+---------+---------+-------+------+-------------+
| id | select_type | table | type  | possible_keys         | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+-------+-----------------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | t1    | const | PRIMARY,outsource_key | PRIMARY | 4       | const |    1 |             | 
|  1 | SIMPLE      | t2    | ALL   | PRIMARY               | NULL    | NULL    | NULL  |    3 | Using where | 
+----+-------------+-------+-------+-----------------------+---------+---------+-------+------+-------------+
+----+-------------+-------+-------+-----------------------+---------+---------+-------+------+-------------+
| id | select_type | table | type  | possible_keys         | key     | key_len | ref   | rows | Extra       |
+----+-------------+-------+-------+-----------------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | t1    | const | PRIMARY,outsource_key | PRIMARY | 4       | const |    1 |             | 
|  1 | SIMPLE      | t2    | range | PRIMARY               | PRIMARY | 30      | NULL  |    1 | Using where | 
+----+-------------+-------+-------+-----------------------+---------+---------+-------+------+-------------+
[19 Dec 2008 7:38] Alexander Barkov
The order of arguments does not affect the way how the collation
for the operation is determined. It is always utf8_bin in this case.

I think there are two problems here:

1. This part seems to be missing in the manual:

When mixing a _ci collation and a _bin collation,
and coercibility of the operands are the same,
the _bin collation wins. 

2. Possibly it is a bug in the optimizer. The plan
depends on the order of the arguments. The plan for the
first query is not optimal. I think that in both cases
the optimizer should choose the second plan.
[5 Jan 2009 19:44] Paul Dubois
I have added Bar's information about mixing _bin/_ci collations to the tricky-collation section.
[13 Jan 2009 16:15] Paul Dubois
I am unassigning myself from this report because it has been recategorized as an optimizer report.
[26 Feb 2009 22:09] Omer Barnir
triage: setting tag to CHECKED (3144) - triaging as an optimization bug