Bug #32510 LIKE search fails with indexed 'eucjpms' and 'ujis' char column
Submitted: 20 Nov 2007 7:55 Modified: 31 Mar 2008 16:46
Reporter: Mikiya Okuno Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.50 OS:Any
Assigned to: Alexander Barkov CPU Architecture:Any
Tags: bfsm_2007_12_06, charset eucjpms like
Triage: D2 (Serious)

[20 Nov 2007 7:55] Mikiya Okuno
Description:
SELECT statement returns a incorrect result if the following conditions are met.

- CHARSET is 'eucjpms' or 'ujis'
- LIKE clause is specified for patten matching.
- Pattern string includes more than 20 bytes characters other than wild cards.
- The column used for pattern matching is indexed.

This is similar to bug# 20471, which dealt with 'utf8' character set. The same query with 'utf8' character set succeeds, so the bug# 20471 is certainly fixed. Then another problem is happening around 'eucjpms' and 'ujis' character sets.

I tried both MyISAM and InnoDB tables, and the SELECT query didn't return a correct result on both tables.

How to repeat:
Please beware of attached three files.
- DDL.sql
- select.sql
- examples.txt

'DDL.sql' includes statements to create the table and populate the data into it. 'select.sql' includes the SELECT statement with the problematic pattern.

1) connect to the server with 'eucjpms' charset.

shell> mysql --default-character-set=eucjpms

2) pls use some default database

mysql> use dbname

3) create the table and populate the data to it

mysql> SOURCE DDL.sql;

4) execute the problematic query

mysql> SOURCE select.sql
[20 Nov 2007 7:56] Mikiya Okuno
Statements to create the table and populate the data into it.

Attachment: DDL.sql (, text), 618 bytes.

[20 Nov 2007 7:58] Mikiya Okuno
The SELECT statement with a problematic pattern.

Attachment: select.sql (, text), 58 bytes.

[20 Nov 2007 7:59] Mikiya Okuno
Example result.

Attachment: query_and_result.txt (text/plain), 1.89 KiB.

[27 Nov 2007 5:52] Yasufumi Kinoshita
I think this is "critical" bug!

This kind of problem occurs when we use the following collations.

ujis_japanese_ci
euckr_korean_ci
gb2312_chinese_ci
ucs2_bin (Anyway, it doesn't work even if we fix as the follows... another BUG?)
eucjpms_japanese_ci

These collations use my_like_range_simple() method.
This my_like_range_simple() seems to intend to count characters, but counts bytes...
So conditions of LIKE clause are mis-interpreted by unintended termination.

I think thet we should fix as the either of following.

(a) makes my_like_range_simple() intend to count bytes.

--(ex.)----------------------------------------------------
strings/ctype-simple.c
@@ -1043,7 +1043,7 @@
   const char *end= ptr + ptr_length;
   char *min_org=min_str;
   char *min_end=min_str+res_length;
-  uint charlen= res_length / cs->mbmaxlen;
+  uint charlen= res_length;
 
   for (; ptr != end && min_str != min_end && charlen > 0 ; ptr++, charlen--)
   {
-----------------------------------------------------------

(b) use my_like_range_mb() in the such collations.

Regards,
[31 Jan 2008 11:53] Alexander Barkov
A simplified test demonstrating the same problem:

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  a int(6) PRIMARY KEY,
  b varchar(20),
  c varchar(8),
  d varchar(8),
  KEY (b)
) ENGINE = innodb COLLATE ujis_bin;

INSERT INTO t1 (a, b) VALUES (1, repeat(0xA4A2,5));
INSERT INTO t1 (a, b) VALUES (2, repeat(0xA4A2,10));
INSERT INTO t1 (a, b) VALUES (3, repeat(0xA4A2,11));
INSERT INTO t1 (a, b) VALUES (4, repeat(0xA4A2,12));
SELECT a, hex(b),c FROM t1 WHERE b LIKE concat(repeat(0xA4A2,10), '%');

Verified with collations:
big5_bin
eucjpms_bin
euckr_bin
gb2312_bin
gbk_bin
ujis_bin
eucjpms_japanese_ci
euckr_korean_ci
gb2312_chinese_ci
ujis_japanese_ci

TODO:
check with ucs2
[31 Jan 2008 11:54] Alexander Barkov
The same problem repeats with ENGINE=myisam.
[31 Jan 2008 12:02] Alexander Barkov
The same problem happens with cp932_bin and sjis_bin,
with character 0x8482.

Note, the character 0xA4A2 used in the previous post is not a multi-byte
character in sjis/cp932, so in order to repeat the same problem
with sjis/cp932 one can use the same test, however with different
two-byte sequence representing a single sjis/cp932 character.
For example, 0x8482.

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  a int(6) PRIMARY KEY,
  b varchar(20),
  c varchar(8),
  d varchar(8),
  KEY (b)
)  COLLATE cp932_bin;

INSERT INTO t1 (a, b) VALUES (1, repeat(0x8482,5));
INSERT INTO t1 (a, b) VALUES (2, repeat(0x8482,10));
INSERT INTO t1 (a, b) VALUES (3, repeat(0x8482,11));
INSERT INTO t1 (a, b) VALUES (4, repeat(0x8482,12));
SELECT a, hex(b),c FROM t1 WHERE b LIKE concat(repeat(0x8482,10), '%');
[31 Jan 2008 12:33] Alexander Barkov
Character 0xF1F2 is a two-byte character in all these character sets:

big5, eucjpms, euckr, gb2312, gbk, ujis, cp932, sjis

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (
  a int(6) PRIMARY KEY,
  b varchar(20),
  c varchar(8),
  d varchar(8),
  KEY (b)
)  COLLATE ujis_japanese_ci;

INSERT INTO t1 (a, b) VALUES (1, repeat(0xF1F2,5));
INSERT INTO t1 (a, b) VALUES (2, repeat(0xF1F2,10));
INSERT INTO t1 (a, b) VALUES (3, repeat(0xF1F2,11));
INSERT INTO t1 (a, b) VALUES (4, repeat(0xF1F2,12));
SELECT a, hex(b),c FROM t1 WHERE b LIKE concat(repeat(0xF1F2,10), '%');

With 0xF1F2, the same problem happens with:

- All _bin collations:

Collation    IsOk   RangeHandlerFuncName
big5_bin     No     my_like_range_mb
cp932_bin    No     my_like_range_mb
eucjpms_bin  No     my_like_range_mb
euckr_bin    No     my_like_range_mb
gb2312_bin   No     my_like_range_mb
gbk_bin      No     my_like_range_mb
sjis_bin     No     my_like_range_sjis
ujis_bin     No     my_like_range_mb

- Almost all case insensitive collations, as in this table:

Collation            IsOk    RangeHandlerFuncName
big5_chinese_ci      Yes     my_like_range_big5
cp932_japanese_ci    Yes     my_like_range_mb
eucjpms_japanese_ci  No      my_like_range_simple
euckr_korean_ci      No      my_like_range_simple
gb2312_chinese_ci    No      my_like_range_simple
gbk_chinese_ci       Yes     my_like_range_gbk
sjis_japanese_ci     Yes     my_like_range_mb
ujis_japanese_ci     No      my_like_range_simple

* "IsOk == Yes" means that SELECT works fine
* "IsOk == No"  means incorrect behavior (bug)
[31 Jan 2008 13:06] Alexander Barkov
More information for _bin collations:

This bug can be repeated with mysql-5.0.22 with these collations:

big5_bin
cp932_bin
eucjpms_bin
euckr_bin
gb2312_bin
gbk_bin
sjis_bin
ujis_bin

With mysql-5.0.52 all the above _bin collations work fine!
[31 Jan 2008 13:09] Alexander Barkov
For _ci collations, mysql-5.0.22 and mysql-5.0.52 work similar:

Collation            IsOk    RangeHandlerFuncName
big5_chinese_ci      Yes     my_like_range_big5
cp932_japanese_ci    Yes     my_like_range_mb
eucjpms_japanese_ci  No      my_like_range_simple
euckr_korean_ci      No      my_like_range_simple
gb2312_chinese_ci    No      my_like_range_simple
gbk_chinese_ci       Yes     my_like_range_gbk
sjis_japanese_ci     Yes     my_like_range_mb
ujis_japanese_ci     No      my_like_range_simple
[4 Feb 2008 7:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41621

ChangeSet@1.2568, 2008-02-04 11:10:40+04:00, bar@mysql.com +26 -0
  Bug#32510 LIKE search fails with indexed 'eucjpms' and 'ujis' char column
  Problem: some collation handlers called incorrect version
  of my_like_range_xxx(), which led to wrong min_str and max_str,
  so like range optimizer threw away good records.
  Fix: changing the wrong handlers to call proper version of
  my_like_range_xxx().
[20 Feb 2008 12:10] Alexander Barkov
Pushed into 5.0.58-rpl
[25 Mar 2008 11:22] Bugs System
Pushed into 5.0.60
[25 Mar 2008 11:22] Bugs System
Pushed into 5.1.24-rc
[26 Mar 2008 19:00] Bugs System
Pushed into 6.0.5-alpha
[31 Mar 2008 16:46] Jon Stephens
Documented bugfix in the 5.0.60, 5.1.23-ndb-6.3.11, 5.1.24, and 6.0.5 changelogs as follows:

        Queries using LIKE on tables having indexed
        CHAR columns using either of the
        eucjpms or ujis character sets did
        not return correct results.