MySQL Bugs: #23689: Falcon: searches fail if exotic collation and index exists

Bug #23689	Falcon: searches fail if exotic collation and index exists
Submitted:	26 Oct 2006 20:27	Modified:	20 May 2007 5:36
Reporter:	Peter Gulutzan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	5.1.13-falcon-alpha-pb-debug	OS:	Linux (SUSE 10.0 / 64-bit)
Assigned to:	Kevin Lewis	CPU Architecture:	Any

Description:
I create a Falcon table with a column with collation latin1_german2_ci.
I create an index on the column.
I insert 'Ü' and 'UE'.
I search for 'UE'.
I find only one row.

ChangeSet@1.2346, 2006-10-26 10:08:37-04:00

How to repeat:
mysql> create table tf (s1 char(5) character set latin1 collate latin1_german2_ci) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into tf values ('UE'),('Ü');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from tf where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
| Ü   |
+------+
2 rows in set (0.00 sec)

mysql> create index itf on tf (s1);
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from tf where s1 = 'UE';
+------+
| s1   |
+------+
| UE   |
+------+
1 row in set (0.01 sec)

Verified on Linux 32-bit, change set 1.2347, 2006-10-26
with slight modification:

SET NAMES utf8;
CREATE TABLE t1 (a char(5) character set latin1 collate latin1_german2_ci);

INSERT INTO t1 VALUES ('UE'), ('Ü');
SELECT count(*) FROM t1 WHERE a = 'UE';

CREATE INDEX i1 on t1 (a);
SELECT count(*) FROM t1 WHERE a = 'UE';

Added test case falcon_bug_23689.test to 5.1-falcon tree.

Regards, Hakan

Peter,  I am assigning this back to you for some feedback.  After the two records are inserted, Falcon only finds one record  from this query.
   mysql> SELECT count(*) FROM t1 WHERE a = 'UE';
This is apparently the bug, since I assume that with latin1_german2_ci 'UE' and 'Ü' should be seen as the same.  

My question is that when I look for the other one;
   mysql> SELECT count(*) FROM t1 WHERE a = 'Ü';
I get this error;
   ERROR 1267 (HY000): Illegal mix of collations (latin1_german2_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='

Is Falcon supposed to apply the latin1_german2_ci collation to all references to field a?  Should a = 'Ü' be COERCIBLE?  Does the result above give you a clue as to what is going on here?  Can you explain to me what COERCIBILITY means?
mysql> SELECT COERCIBILITY('Ü');
+-------------------+
| COERCIBILITY('Ü') |
+-------------------+
|                 4 |
+-------------------+
1 row in set (0.00 sec)

mysql> SELECT COERCIBILITY('UE');
+--------------------+
| COERCIBILITY('UE') |
+--------------------+
|                  4 |
+--------------------+
1 row in set (0.00 sec)

mysql> SELECT COERCIBILITY(a) from t1;
+-----------------+
| COERCIBILITY(a) |
+-----------------+
|               2 |
|               2 |
+-----------------+
2 rows in set (0.00 sec)

Looking at select comments from Kevin Lewis ...

> Falcon only finds one record  from this query.
>   mysql> SELECT count(*) FROM t1 WHERE a = 'UE';
> This is apparently the bug, since I assume that
> with latin1_german2_ci 'UE' and 'Ü' should be seen as the same.  
Yes. You can see what result is correct by applying the same
search with MyISAM or InnoDB engines.

> My question is that when I look for the other one;
>   mysql> SELECT count(*) FROM t1 WHERE a = 'Ü';
> I get this error;
>    ERROR 1267 (HY000): Illegal mix of collations (latin1_german2_ci,IMPLICIT)
>    and (utf8_general_ci,COERCIBLE) for operation '='

I do not get that error. I don't expect it to occur in situations like this.
By the way, please check that "SELECT HEX(s1) FROM tf" returns 5545 and DC,
and that the default engine is Falcon.

> Is Falcon supposed to apply the latin1_german2_ci collation to all
> references to field a?
Yes.

> Should a = 'Ü' be COERCIBLE?
Yes. And, as I mentioned, I get no error when I try the same thing.

> Does the result above give you a clue as to what is going on here?
No.

> Can you explain to me what COERCIBILITY means?
See MySQL Reference Manual, 10.5.4.
"Some Special Cases Where the Collation Determination Is Tricky"
http://dev.mysql.com/doc/refman/5.1/en/charset-collate-tricky.html
It says the column (in this case 'a') should have coercibility=2,
the literal (in this case 'Ü') should have coercibility=4.
That is the case, so no problem so far.
Therefore the collation with the lowest coercibility
(that is, the column's collation) should be used.

I cannot get either innodb or myisam (or Falcon)  to recognize 'UE' and 'Ü' as the same thing using latin1_german2_ci collation.  I am testing on a Windows box, if that make any difference, which it should not.

The test case falcon_bug_23689.test uses  'UE' and 'Ãœ' instead of 'UE' and 'Ü'.  But it makes no difference.  I cannot get the any engine to find more than one occurrance of these different characters even without an index.

mysql> CREATE TABLE ti (a char(5) character set latin1 collate latin1_german2_ci) engine = innodb;
Query OK, 0 rows affected (0.09 sec)

mysql> INSERT INTO ti VALUES ('UE'), ('Ao'), ('Ü');
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT count(*) FROM ti WHERE a = 'UE';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM ti WHERE a = 'Ao';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM ti WHERE a = 'Ü';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> CREATE TABLE tm (a char(5) character set latin1 collate latin1_german2_ci) engine = myisam;
Query OK, 0 rows affected (0.44 sec)

mysql> INSERT INTO tm VALUES ('UE'), ('Ao'), ('Ü');
Query OK, 3 rows affected (0.02 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT count(*) FROM tm WHERE a = 'UE';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM tm WHERE a = 'Ao';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM tm WHERE a = 'Ü';
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

Please try with "SET NAMES utf8".

[21:28] root@test>SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

[21:28] root@test>CREATE TABLE t1 (a char(5) character set latin1 collate latin1_german2_ci) engine myisam;
Query OK, 0 rows affected (0.65 sec)

[21:29] root@test>INSERT INTO t1 VALUES ('UE'), ('Ü');
Query OK, 2 rows affected (0.22 sec)
Records: 2  Duplicates: 0  Warnings: 0

[21:29] root@test>SELECT count(*) FROM t1 WHERE a = 'UE';
+----------+
| count(*) |
+----------+
|        2 |
+----------+

Regards, Hakan

I finally tracked this down to space padding in the index key entry.  it takes two bytes to represent the collatable character, and one value uses two bytes, the other uses four.  I need the changes for WL#3664 in order to fix this.  These changes will be merged into falcon soon.

Using strnncollsp() allows the MySQL character set code to compare character set weights one by one with implied space padding.  This reduces the amount of prep work that Falcon has to do.

falcon_bug_23689 passes now:

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_23689               [ pass ]            265
-------------------------------------------------------
Stopping All Servers

A note has been added to the 6.0.1 changelog.