Bug #23689 Falcon: searches fail if exotic collation and index exists
Submitted: 26 Oct 2006 20:27 Modified: 20 May 2007 5:36
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:5.1.13-falcon-alpha-pb-debug OS:Linux (SUSE 10.0 / 64-bit)
Assigned to: Kevin Lewis CPU Architecture:Any

[26 Oct 2006 20:27] Peter Gulutzan
Description:
I create a Falcon table with a column with collation latin1_german2_ci.
I create an index on the column.
I insert 'Ü' and 'UE'.
I search for 'UE'.
I find only one row.

ChangeSet@1.2346, 2006-10-26 10:08:37-04:00

How to repeat:
mysql> create table tf (s1 char(5) character set latin1 collate latin1_german2_ci) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into tf values ('UE'),('Ü');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from tf where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
| Ü   |
+------+
2 rows in set (0.00 sec)

mysql> create index itf on tf (s1);
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from tf where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
+------+
1 row in set (0.01 sec)
[27 Oct 2006 19:31] Hakan Küçükyılmaz
Verified on Linux 32-bit, change set 1.2347, 2006-10-26
with slight modification:

SET NAMES utf8;
CREATE TABLE t1 (a char(5) character set latin1 collate latin1_german2_ci);

INSERT INTO t1 VALUES ('UE'), ('Ü');
SELECT count(*) FROM t1 WHERE a = 'UE';

CREATE INDEX i1 on t1 (a);
SELECT count(*) FROM t1 WHERE a = 'UE';

Added test case falcon_bug_23689.test to 5.1-falcon tree.

Regards, Hakan
[13 Mar 2007 3:00] Kevin Lewis
Peter,  I am assigning this back to you for some feedback.  After the two records are inserted, Falcon only finds one record  from this query.
   mysql> SELECT count(*) FROM t1 WHERE a = 'UE';
This is apparently the bug, since I assume that with latin1_german2_ci 'UE' and 'Ü' should be seen as the same.  

My question is that when I look for the other one;
   mysql> SELECT count(*) FROM t1 WHERE a = 'Ü';
I get this error;
   ERROR 1267 (HY000): Illegal mix of collations (latin1_german2_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='

Is Falcon supposed to apply the latin1_german2_ci collation to all references to field a?  Should a = 'Ü' be COERCIBLE?  Does the result above give you a clue as to what is going on here?  Can you explain to me what COERCIBILITY means?
mysql> SELECT COERCIBILITY('Ü');
+-------------------+
| COERCIBILITY('Ü') |
+-------------------+
|                 4 |
+-------------------+
1 row in set (0.00 sec)

mysql> SELECT COERCIBILITY('UE');
+--------------------+
| COERCIBILITY('UE') |
+--------------------+
|                  4 |
+--------------------+
1 row in set (0.00 sec)

mysql> SELECT COERCIBILITY(a) from t1;
+-----------------+
| COERCIBILITY(a) |
+-----------------+
|               2 |
|               2 |
+-----------------+
2 rows in set (0.00 sec)
[13 Mar 2007 16:56] Peter Gulutzan
Looking at select comments from Kevin Lewis ...

> Falcon only finds one record  from this query.
>   mysql> SELECT count(*) FROM t1 WHERE a = 'UE';
> This is apparently the bug, since I assume that
> with latin1_german2_ci 'UE' and 'Ü' should be seen as the same.  
Yes. You can see what result is correct by applying the same
search with MyISAM or InnoDB engines.

> My question is that when I look for the other one;
>   mysql> SELECT count(*) FROM t1 WHERE a = 'Ü';
> I get this error;
>    ERROR 1267 (HY000): Illegal mix of collations (latin1_german2_ci,IMPLICIT)
>    and (utf8_general_ci,COERCIBLE) for operation '='

I do not get that error. I don't expect it to occur in situations like this.
By the way, please check that "SELECT HEX(s1) FROM tf" returns 5545 and DC,
and that the default engine is Falcon.

> Is Falcon supposed to apply the latin1_german2_ci collation to all
> references to field a?
Yes.

> Should a = 'Ü' be COERCIBLE?
Yes. And, as I mentioned, I get no error when I try the same thing.

> Does the result above give you a clue as to what is going on here?
No.

> Can you explain to me what COERCIBILITY means?
See MySQL Reference Manual, 10.5.4.
"Some Special Cases Where the Collation Determination Is Tricky"
http://dev.mysql.com/doc/refman/5.1/en/charset-collate-tricky.html
It says the column (in this case 'a') should have coercibility=2,
the literal (in this case 'Ü') should have coercibility=4.
That is the case, so no problem so far.
Therefore the collation with the lowest coercibility
(that is, the column's collation) should be used.
[19 Mar 2007 19:10] Kevin Lewis
I cannot get either innodb or myisam (or Falcon)  to recognize 'UE' and 'Ü' as the same thing using latin1_german2_ci collation.  I am testing on a Windows box, if that make any difference, which it should not.

The test case falcon_bug_23689.test uses  'UE' and 'Ãœ' instead of 'UE' and 'Ü'.  But it makes no difference.  I cannot get the any engine to find more than one occurrance of these different characters even without an index.

mysql> CREATE TABLE ti (a char(5) character set latin1 collate latin1_german2_ci) engine = innodb;
Query OK, 0 rows affected (0.09 sec)

mysql> INSERT INTO ti VALUES ('UE'), ('Ao'), ('Ü');
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT count(*) FROM ti WHERE a = 'UE';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM ti WHERE a = 'Ao';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM ti WHERE a = 'Ü';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> CREATE TABLE tm (a char(5) character set latin1 collate latin1_german2_ci) engine = myisam;
Query OK, 0 rows affected (0.44 sec)

mysql> INSERT INTO tm VALUES ('UE'), ('Ao'), ('Ü');
Query OK, 3 rows affected (0.02 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT count(*) FROM tm WHERE a = 'UE';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM tm WHERE a = 'Ao';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM tm WHERE a = 'Ü';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)
[19 Mar 2007 20:30] Hakan Küçükyılmaz
Please try with "SET NAMES utf8".

[21:28] root@test>SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

[21:28] root@test>CREATE TABLE t1 (a char(5) character set latin1 collate latin1_german2_ci) engine myisam;
Query OK, 0 rows affected (0.65 sec)

[21:29] root@test>INSERT INTO t1 VALUES ('UE'), ('Ü');
Query OK, 2 rows affected (0.22 sec)
Records: 2  Duplicates: 0  Warnings: 0

[21:29] root@test>SELECT count(*) FROM t1 WHERE a = 'UE';
+----------+
| count(*) |
+----------+
|        2 |
+----------+

Regards, Hakan
[22 Mar 2007 14:38] Kevin Lewis
I finally tracked this down to space padding in the index key entry.  it takes two bytes to represent the collatable character, and one value uses two bytes, the other uses four.  I need the changes for WL#3664 in order to fix this.  These changes will be merged into falcon soon.
[9 May 2007 17:00] Kevin Lewis
Using strnncollsp() allows the MySQL character set code to compare character set weights one by one with implied space padding.  This reduces the amount of prep work that Falcon has to do.
[14 May 2007 15:37] Hakan Küçükyılmaz
falcon_bug_23689 passes now:

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_23689               [ pass ]            265
-------------------------------------------------------
Stopping All Servers
[20 May 2007 5:36] MC Brown
A note has been added to the 6.0.1 changelog.