Bug #34479 Falcon: search failure with indexed ucs2 varchar
Submitted: 12 Feb 2008 2:39 Modified: 26 May 2010 17:49
Reporter: Peter Gulutzan Email Updates:
Status: Unsupported Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:6.0.5-alpha-debug OS:Linux (SUSE 10 | 32-bit)
Assigned to: Lars-Erik Bjørk
Tags: F_ENCODING
Triage: Triaged: D2 (Serious) / R3 (Medium) / E3 (Medium)

[12 Feb 2008 2:39] Peter Gulutzan
Description:
I'm using mysql-6.0-falcon.

I create an indexed Falcon table with a UCS2 column.
I insert some variants of the letter 'A'.
I search for values less than 'Ġ'
(LATIN CAPITAL LETTER G WITH DOT ABOVE, 0x0120).
I do not find all the variants of the letter 'A'.

Possibly, since the last byte of Ġ is 0x20,
this is related to
Bug#23692 Falcon: searches fail if data is 0x00

How to repeat:
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t20 (s1 varchar(5) character set ucs2) engine=falcon;
Query OK, 0 rows affected (0.07 sec)

mysql> /* The following values are: 0x0100, 0x0101, 0x0102, 0x0103, 0x0104 */
mysql> insert into t20 values ('Ā'),('ā'),('Ă'),('ă'),('Ą');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from t20 where s1 < 'Ġ';
+------+
| s1   |
+------+
| Ā   |
| ā   |
| Ă   |
| ă   |
| Ą   |
+------+
5 rows in set (0.00 sec)

mysql> create index i on t20 (s1);
Query OK, 5 rows affected (0.15 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from t20 where s1 < 'Ġ';
+------+
| s1   |
+------+
| Ā   |
+------+
1 row in set (0.00 sec)
[12 Feb 2008 9:49] Miguel Solorzano
Thank you for the bug report.
[26 Jun 2008 16:26] Peter Gulutzan
As suspected, the bug appears with other characters
whose final byte is 0x20. There are many such
characters. One of them is common:
0420;CYRILLIC CAPITAL LETTER ER
[26 Jun 2008 21:50] Kevin Lewis
Lars-Erik, I had been meaning to assign this bug to you along with a few other character-set related bugs because I think it is an isolated, yet important part of the Falcon design.  I used to focus a lot of time on these, but never 'finished' them all.  I can spend time with you on this area of code as you debug it.
[22 Oct 2008 10:29] Lars-Erik Bjørk
I can also reproduce this scenario without creating the index, but simply by creating the table, inserting the values and doing the select.
[2 Dec 2008 13:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/60379

2922 lars-erik.bjork@sun.com	2008-12-02
      This is a patch for bug#34479
      Falcon: search failure with indexed ucs2 varchar
      
      When computing the key length, trailing spaces are removed. This was
      done by looking at a single byte at a time. When using f. ex the ucs2
      character set (where every character is represented using two bytes), 
      this would result in a character ending in 0x20, (f .ex Ä  (0x0120) )
      having its final byte 'trimmed'. I have now implemented a charset-wise
      version, taking into account the varying lengths of multi-byte
      sequences of different character sets.
      
      added file 'mysql-test/suite/falcon/t/falcon_bug_34479.test'
      ------------------------------------------------------------
      This is a test file testing the patch. It is based on the
      bug report
      
      added file 'mysql-test/suite/falcon/r/falcon_bug_34479.result'
      --------------------------------------------------------------
      This is the result file for the test. It states the expected
      output
      
      
      modified file 'storage/falcon/MySQLCollation.cpp'
      -------------------------------------------------
      Modified the function computeKeyLength( ... ) 
      
      Earlier, when removing trailing pad characters (' '), the function
      only looked at a single byte at a time. I have implemented a
      charset-wise implementation that translates the pad character into the
      relevant character set, and that compares relative to the (possible)
      multi-byte sequences of the different character sets.
      
      
      modified file 'storage/falcon/MySQLCollation.h'
      -----------------------------------------------
      Changed the function computeKeyLength( ...) not to be inline, because the
      implementation grew big enough to clutter the header file.
      
      
      modified file 'storage/falcon/ha_falcon.cpp'
      --------------------------------------------
      Added some functions giving access to the character set functions
      
      * int falcon_conv_uni_cs ( ... )
        - Converts a character to the given character set
      * unsigned int falcon_get_mbminlen ( ... )
        - Returns the minimum multi-byte sequence for the given charset
      
      * uint falcon_get_mbcharlen( ... )
        - Returns the length of the current multi-byte sequence if the
          pointer given points to a valid header, 0 otherwise
[2 Dec 2008 13:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/60380

2922 lars-erik.bjork@sun.com	2008-12-02
      This is a patch for bug#34479
      Falcon: search failure with indexed ucs2 varchar
      
      When computing the key length, trailing spaces are removed. This was
      done by looking at a single byte at a time. When using f. ex the ucs2
      character set (where every character is represented using two bytes), 
      this would result in a character ending in 0x20, (f .ex Ä  (0x0120) )
      having its final byte 'trimmed'. I have now implemented a charset-wise
      version, taking into account the varying lengths of multi-byte
      sequences of different character sets.
      
      added file 'mysql-test/suite/falcon/t/falcon_bug_34479.test'
      ------------------------------------------------------------
      This is a test file testing the patch. It is based on the
      bug report
      
      added file 'mysql-test/suite/falcon/r/falcon_bug_34479.result'
      --------------------------------------------------------------
      This is the result file for the test. It states the expected
      output
      
      
      modified file 'storage/falcon/MySQLCollation.cpp'
      -------------------------------------------------
      Modified the function computeKeyLength( ... ) 
      
      Earlier, when removing trailing pad characters (' '), the function
      only looked at a single byte at a time. I have implemented a
      charset-wise implementation that translates the pad character into the
      relevant character set, and that compares relative to the (possible)
      multi-byte sequences of the different character sets.
      
      
      modified file 'storage/falcon/MySQLCollation.h'
      -----------------------------------------------
      Changed the function computeKeyLength( ...) not to be inline, because the
      implementation grew big enough to clutter the header file.
      
      
      modified file 'storage/falcon/ha_falcon.cpp'
      --------------------------------------------
      Added some functions giving access to the character set functions
      
      * int falcon_conv_uni_cs ( ... )
        - Converts a character to the given character set
      * unsigned int falcon_get_mbminlen ( ... )
        - Returns the minimum multi-byte sequence for the given charset
      
      * uint falcon_get_mbcharlen( ... )
        - Returns the length of the current multi-byte sequence if the
          pointer given points to a valid header, 0 otherwise
[3 Dec 2008 19:16] Alexander Barkov
Setting back to "In progress" as it's not correct to remove
trailing spaces and minSortChar before strnncoll call.

More comments sent by email.
[19 Dec 2008 11:34] Lars-Erik Bjørk
I closed bug#30282 to be a duplicate of this one, since this is the bug with the highest priority an a beta tag. Bug#30282 could be reproduced in the following ways, which should be added to the test for this bug as well.

Alternative 1:
-----------
mysql> create table t6 (s1 char(1)) engine=falcon;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t6 values (0x20),(0x00);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t6 where s1 = 0x20;
+------+
| s1   |
+------+
|      |
+------+
1 row in set (0.00 sec)

mysql> select * from t6 where s1 = 0x00;
+------+
| s1   |
+------+
|      |
+------+
1 row in set (0.00 sec)

mysql> create unique index i6 on t6 (s1);
ERROR 1582 (23000): Duplicate entry '' for key 'i6'

Alternative 2:
-----------
root@test>CREATE TABLE t2 (a varchar(1) CHARACTER SET UTF8 COLLATE UTF8_BIN
PRIMARY KEY) Engine Falcon;
Query OK, 0 rows affected (0.01 sec)

root@test>INSERT INTO t2 (a) VALUES (UNHEX('00'));
Query OK, 1 row affected (0.00 sec)

root@test>INSERT INTO t2 (a) VALUES (UNHEX('20'));
ERROR 1062 (23000): Duplicate entry ' ' for key 'PRIMARY'
6.0.9-alpha