Bug #24858 Falcon: searches fail if partial index on utf8 column
Submitted: 6 Dec 2006 17:29 Modified: 20 May 2007 5:36
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:5.1.13-falcon-alpha-debug OS:Linux (SUSE 10.0 / 64-bit)
Assigned to: Kevin Lewis CPU Architecture:Any

[6 Dec 2006 17:29] Peter Gulutzan
Description:
I create a table with a column with a partial index.
I search it.
With engine=falcon, I get 0 rows. With engine=myisam, I get 1 row.

(So far the description is the same as the one for bug#23691.)

This time I'm using a utf8 column with a special character
which requires more than one byte.

ChangeSet@1.2391, 2006-12-06

How to repeat:
mysql> create table t (s1 char(5) character set utf8) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> create index i on t (s1(3));
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into t values ('abÜde');
Query OK, 1 row affected (0.04 sec)

mysql> select * from t where s1 = 'abÜde';
Empty set (0.00 sec)
[7 Dec 2006 10:39] Hakan Küçükyılmaz
Added test case falcon_bug_24858.test to 5.1-falcon tree.

Regards, Hakan
[7 Dec 2006 15:18] Kevin Lewis
I did a short evaluation of this because it was similar to code I recently worked on.  This problem is unique to the UTF8 character set.  Since each character is variable length, the current algorithm inside MySQLCollation::makeKey does not work.  Putting this aside until after alpha.  

But we can document that UTF8 character sets do not work with Falcon indexes yet.  After documenting, please put this back into the open state.
[7 Dec 2006 16:25] Peter Gulutzan
I have changed to "Need Doc Info" -- sorry, I'm just hoping that
Kevin can clarify a few points from his last comment, before
giving it back to MC.

1. The problem is not "unique to the UTF8 character set",
as this example shows:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t (s1 char(5) character set sjis) engine=falcon;
Query OK, 0 rows affected (0.56 sec)

mysql> create index i on t (s1(3));
Query OK, 0 rows affected (0.12 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into t values ('abペde');
Query OK, 1 row affected (0.01 sec)

mysql> select * from t where s1 = 'abペde';
Empty set (0.00 sec)

So should the statement not really be that
"all" multi-byte character sets can have
the failure? Or, perhaps, all multi-byte
variable-length character sets, which
excludes ucs2?

2. The bug report is about "partial" indexes.
But the suggested phrase is "do not work with
Falcon indexes" -- that is, not just partial
indexes. So should the statement not really be
that they "do not work with partial Falcon
indexes"?
[7 Dec 2006 17:04] Kevin Lewis
Thanks for jumping on this.  I think the problem I saw in the code is associated with variable length character sets.  But I only traced it through the handling of the UTF8 testcase.  It may also be limited to partial indexes as well, but I suspect not.  I did not spend too much time narowing it down since I know Calvin wants me to stay focused on P1 & P2 bugs.

Could you do a little testing to see if this is a problem when partial keys are not used?
[8 Dec 2006 17:06] Peter Gulutzan
Here is another test case, as discussed in dev-jstar thread
"RE: FW: #24858 [Com,Doc->NDI]: Falcon: searches fail if partialindex on utf8
column"

There is a problem with indexed UTF8 columns when I search
for a character A, and there is a character A-WITH-AN-ACCENT-MARK
that should be equal to A in the default collation.

mysql> create table t9 (s1 varchar(4) character set utf8) engine=falcon;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t9 values ('ä');
Query OK, 1 row affected (0.00 sec)

mysql> create index it9 on t9 (s1);
Query OK, 1 row affected (0.03 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> select * from t9 where s1 = 'a';
Empty set (0.00 sec)

mysql> select * from t9 where s1 > 'a';
Empty set (0.00 sec)

mysql> select * from t9 where s1 < 'a';
Empty set (0.00 sec)

That is, ä isn't greater than, or less than, or equal to, a.

By the way, it's still not quite right with latin1 either,
but I regard that as just another case of bug#23689. At least
one of the SELECTs works when I use latin1, even though it's
the wrong answer.

Kevin Lewis recommended:
"This may also be associated with not knowing the actual length of the utf8
string to compare against.  So I would recommend keeping it associated with
bug#24858 for now."
[22 Dec 2006 16:23] Kevin Lewis
The problem here is that the index string is different than the comparison string because they are trimmed differently.  Prefix handling is inconsistent when the encoding is variable byte.

I have a hacked up fix for this which counts up the characters in a string and truncates it accordingly for prefix (partial) key searches.  But I will hold off pushing it until after alpha.  For one thing, a better fix is on the way.  

Bar is making a change to the ctype implementation of all calls to strnxfrm so that it will not pad strings with spaces unless the flag is set. See Worklog  Also, it will implement a limit on the number of characters (more accurately, weights) that is put into the output.  

With this function, Falcon can be sure it gets correct and consistent results from MySQLCollation::makeKey() which is called for all keys before they are compared.
[22 Dec 2006 16:56] Kevin Lewis
The associated worklog is #3664.
[25 Jan 2007 15:38] Hakan Küçükyılmaz
falcon_bug_24858 passes now:

=======================================================
Starting Tests in the 'main' suite

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_24858               [ pass ]           1407
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 1.407 seconds actually executing testcases

Regards, Hakan
[28 Jan 2007 17:39] Peter Gulutzan
Here are three more test cases. They don't involve prefix indexes,
but (see the comment on 2006-12-08) I believe the desire is to
see all bugs in the same broad category as bug#24858 instances.

Test case #1. U+2020 is "Dagger".

mysql> create table tb (s1 char(5) character set ucs2) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> create index itb on tb (s1);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into tb values (0x2020);
Query OK, 1 row affected (0.00 sec)

mysql> select * from tb where s1 < '';
Empty set (0.00 sec)

mysql> select * from tb where s1 = '';
Empty set (0.00 sec)

mysql> select * from tb where s1 > '';
Empty set (0.00 sec)

That is, dagger isn't greater than, or less than, or equal to, ''.

Test case #2.

mysql> create table td (s1 char(3) character set utf8) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> create index itd on td (s1);
Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into td values ('a'),('ä');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from td where s1 = 'a';
+------+
| s1   |
+------+
| a    |
+------+
1 row in set (0.00 sec)

Test case #3.

mysql> create table tg (s1 varchar(7) character set utf8 collate utf8_unicode_ci, s2 varchar(7) character set utf8 collate utf8_unicode_ci) engine=falcon;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into tg values (0x41414141000020,0x41414141000020);
Query OK, 1 row affected (0.00 sec)

mysql> insert into tg values (0x41414141000000,0x41414141000000);
Query OK, 1 row affected (0.00 sec)

mysql> insert into tg values (0x41414141041000,0x41414141041000);
Query OK, 1 row affected (0.00 sec)

mysql> select hex(s1),s1 from tg where s1 = 0x41414141;
+----------------+---------+
| hex(s1)        | s1      |
+----------------+---------+
| 41414141000020 | AAAA    |
| 41414141000000 | AAAA    |
| 41414141041000 | AAAA  |
+----------------+---------+
3 rows in set (0.00 sec)

mysql> create index itg on tg (s1);
Query OK, 3 rows affected (0.04 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select hex(s1),s1 from tg where s1 = 0x41414141;
+----------------+---------+
| hex(s1)        | s1      |
+----------------+---------+
| 41414141000020 | AAAA    |
| 41414141000000 | AAAA    |
+----------------+---------+
2 rows in set (0.00 sec)
[9 May 2007 19:01] Kevin Lewis
Fixed with recent code changes for handling MySQL character sets.
[14 May 2007 15:45] Hakan Küçükyılmaz
falcon_bug_24858 passes now:

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_24858               [ pass ]             86
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 0.086 seconds actually executing testcases
[16 May 2007 16:20] Kevin Lewis
Added the three other testcases specified by Peter in this bug to falcon_bug_24858.test script.  Also made a change to Falcon to avoid trimming strings that come from the record and are compared to the record since strnncollsp() handles all the space padding issues for us.  This fixes the testcase involving the dagger character, which was not working previously.
[20 May 2007 5:36] MC Brown
A note has been added to the 6.0.1 changelog.