Bug #20353 Prefix key search return incorrect results for ignorable characters
Submitted: 9 Jun 2006 11:12 Modified: 24 Nov 2008 15:30
Reporter: Alexander Barkov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:4.1, 5.0, 5.1 OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[9 Jun 2006 11:12] Alexander Barkov
Description:
The bug appears with UCA-based collations.
If a string begins with an ignorable character (i.e. which does not have weight,
then searches involving a prefix key do not return all results.

A list of some ignorable characters:
0000  ; [.0000.0000.0000.0000] # [0000] NULL (in 6429)
0001  ; [.0000.0000.0000.0000] # [0001] START OF HEADING (in 6429)
0002  ; [.0000.0000.0000.0000] # [0002] START OF TEXT (in 6429)
0003  ; [.0000.0000.0000.0000] # [0003] END OF TEXT (in 6429)
0004  ; [.0000.0000.0000.0000] # [0004] END OF TRANSMISSION (in 6429)
0005  ; [.0000.0000.0000.0000] # [0005] ENQUIRY (in 6429)
0006  ; [.0000.0000.0000.0000] # [0006] ACKNOWLEDGE (in 6429)
0007  ; [.0000.0000.0000.0000] # [0007] BELL (in 6429)
0008  ; [.0000.0000.0000.0000] # [0008] BACKSPACE (in 6429)
000E  ; [.0000.0000.0000.0000] # [000E] SHIFT OUT (in 6429)
000F  ; [.0000.0000.0000.0000] # [000F] SHIFT IN (in 6429)
0010  ; [.0000.0000.0000.0000] # [0010] DATA LINK ESCAPE (in 6429)
0011  ; [.0000.0000.0000.0000] # [0011] DEVICE CONTROL ONE (in 6429)
0012  ; [.0000.0000.0000.0000] # [0012] DEVICE CONTROL TWO (in 6429)
0013  ; [.0000.0000.0000.0000] # [0013] DEVICE CONTROL THREE (in 6429)
0014  ; [.0000.0000.0000.0000] # [0014] DEVICE CONTROL FOUR (in 6429)
0015  ; [.0000.0000.0000.0000] # [0015] NEGATIVE ACKNOWLEDGE (in 6429)
0016  ; [.0000.0000.0000.0000] # [0016] SYNCHRONOUS IDLE (in 6429)
0017  ; [.0000.0000.0000.0000] # [0017] END OF TRANSMISSION BLOCK (in 6429)
0018  ; [.0000.0000.0000.0000] # [0018] CANCEL (in 6429)
0019  ; [.0000.0000.0000.0000] # [0019] END OF MEDIUM (in 6429)
001A  ; [.0000.0000.0000.0000] # [001A] SUBSTITUTE (in 6429)
001B  ; [.0000.0000.0000.0000] # [001B] ESCAPE (in 6429)
001C  ; [.0000.0000.0000.0000] # [001C] FILE SEPARATOR (in 6429)
001D  ; [.0000.0000.0000.0000] # [001D] GROUP SEPARATOR (in 6429)
001E  ; [.0000.0000.0000.0000] # [001E] RECORD SEPARATOR (in 6429)
001F  ; [.0000.0000.0000.0000] # [001F] UNIT SEPARATOR (in 6429)
007F  ; [.0000.0000.0000.0000] # [007F] DELETE (in 6429)

The full list can be found here:
ftp://ftp.unicode.org/Public/UCA/latest/allkeys.txt

Ignorable characters are those having weights equal to 0000.0000.0000.0000.

How to repeat:
DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (a varchar(128)) character set utf8 collate utf8_unicode_ci;
INSERT INTO t1 VALUES ('aaaa'),('bbbb'),('cccc');
INSERT INTO t1 VALUES (concat(0x01,'aaaa'));
INSERT INTO t1 VALUES (concat(0x01,'bbbb'));
INSERT INTO t1 VALUES (concat(0x01,'cccc'));
SELECT a, hex(a) FROM t1 WHERE a='aaaa';
ALTER TABLE t1 ADD KEY (a(1));
SELECT a, hex(a) FROM t1 WHERE a='aaaa';

The first SELECT correctly returns two rows:

mysql> SELECT a, hex(a) FROM t1 WHERE a='aaaa'
+-------+-------------+
| a      | hex(a)       |
+-------+-------------+
| aaaa | 61616161   |
| aaaa | 0161616161|
+-------+-------------+

The same query after adding a prefix key 
returns only one row:

mysql> SELECT a, hex(a) FROM t1 WHERE a='aaaa'
+------+-----------+
| a      | hex(a)    |
+------+-----------+
| aaaa | 61616161|
+------+-----------+

The problem disappears if we create a full key, instead of a prefix key:
ALTER TABLE t1 ADD KEY (a);

Suggested fix:
When generating a prefix-key value, we should skip ingorable characters,
so that only non-ignorable characters are written into the keys.
[9 Jun 2006 11:16] Alexander Barkov
Changing priority to P2, as SELECT returns wrong result set.
[8 Dec 2006 16:39] Jonathan Miller
Hi,

I created a test case for this bug and will be commiting it soon, but the results of the test shows that there maybe some other issues here as well.

+ Note  1051    Unknown table 't1'
+ CREATE TABLE t1 (a varchar(128)) character set utf8 collate utf8_unicode_ci;
+ "Full Row Count"
+ SELECT COUNT(*) FROM t1;
+ COUNT(*)
+ 444

There are 444 rows that have ignorable character. When doing a SELECT w/o and keys, only 109 are returned.

+
+ "Count w/o key where a='aaaa';"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 109
+

When adding the prefix, only one row is returned until we have a prefix key of (a(5)

+ "Count w/ Perfix Key (a(1))"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 1
+
+ "Count w/ Perfix Key (a(2))"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 1
+
+ "Count w/ Perfix Key (a(3))"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 1
+
+ "Count w/ Perfix Key (a(4))"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 1
+
+ "Count w/ Perfix Key (a(5))"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 77

Even with a full key, we are only getting back 109.

+
+ "Count w/ Full Key"
+ SELECT COUNT(*) FROM t1 WHERE a='aaaa';
+ COUNT(*)
+ 109
+

/jeb
[8 Dec 2006 16:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16673

ChangeSet@1.2376, 2006-12-08 17:41:49+01:00, jmiller@mysql.com +2 -0
  prefix_key_search.result, prefix_key_search.test:
    Test case for Bug #20353  Prefix key search return incorrect results for ignorable characters
[11 Dec 2006 11:43] Alexander Barkov
The list of primary ignorable characters

Attachment: bad-keys.txt (text/plain), 31.80 KiB.

[11 Dec 2006 15:24] Jonathan Miller
Hi Bar,

I am glad that you like the test case. You can make the changes if you wish, or I can correct it per your description and resubmit for you. Either way. Please let me know.

Best wishes,
/Jeb
[9 Dec 2008 16:33] Omer Barnir
triage: changing tag from SR60BETA to SR61BTA as bug should be pushed to 6.x (see above comment)