MySQL Bugs: #3957: Wildcard character of LIKE pattern-matching doesn't match zero characters.

Bug #3957	Wildcard character of LIKE pattern-matching doesn't match zero characters.
Submitted:	1 Jun 2004 20:14	Modified:	14 Sep 2004 13:12
Reporter:	Shuichi Tamagawa	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.1.2-alpha-Max-log	OS:	Linux (Linux)
Assigned to:	Heikki Tuuri	CPU Architecture:	Any

Description:
#This is a report from Mr. Masaaki Hirose of Japan MySQL User Group

In LIKE pattern-matching, the wildcard character '%' matches any number of characters, even zero characters (empty sequence), so LIKE '0xA4A2%' should match '0xA4A2'. But it doesn't if the column is indexed and character set of the column is ujis (or utf8).

*Replace 0xA4A2 with the ujis character of the code point

How to repeat:
Run the following script (Modify the user name accordingly). 
Please refer to the attached file for result with the problem.

-----------8<-----------8<-----------8<-----------8<-----------8<-----------
#! /bin/sh
# -*- coding: euc-jp-unix; -*-

mysqlcmd='mysql -uhirose31'

client_cs=ujis
table_cs=ujis
# client_cs=utf8
# table_cs=utf8
# client_cs=latin1
# table_cs=latin1

cat <<EOSQL | ${mysqlcmd} --default-character-set=${client_cs}
use test;
select '* create table' as '';
drop table if exists t_like;
create table t_like (
  c1 varchar(8)
) default character set '${table_cs}';

select '* show default character set' as '';
show create table t_like;
select '';
show create database test;
select '';
show variables like 'character\_set%';

insert into t_like values (0xA4A2),(0xA4A2A4A4),(0xA4ABA4A2),(0xA4A2A4A6);
insert into t_like values ('a'),('abc'),('bca'),('ade');

select c1 as 'no index #1' from t_like where c1 like cast(concat(0xA4A2,'%') as char character set ujis);
select c1 as 'no index #2' from t_like where c1 like 'a%';

create index idx_c1 on t_like (c1);
select c1 as 'using index #1' from t_like where c1 like cast(concat(0xA4A2,'%') as char character set ujis);
select c1 as 'using index #2' from t_like where c1 like 'a%';
EOSQL

# Japanese Hiragana  ujis code (in hex)
# =====================================
# a                  A4A2
# i                  A4A4
# u                  A4A6
# ka                 A4AB
-----------8<-----------8<-----------8<-----------8<-----------8<-----------

Test result

Attachment: like-bug.out (application/octet-stream, text), 586 bytes.

I was able to verify this as an innodb specific problem.  The verification was done using 4.1.2 on Red 
Hat 9 2.4.21 #12 SMP.

root@localhost:bugs~> select t, hex(t) from bug3957 where t like cast(concat(0xA4A2,'%') as char 
character set ujis);
+------+----------+
| t    | hex(t)   |
+------+----------+
| Â¤   | A4A2     |
| Â¤Â¤ | A4A2A4A4 |
| Â¤Â¤ | A4A2A4A6 |
+------+----------+
3 rows in set (0.00 sec)

root@localhost:bugs~> alter table bug3957 type=innodb;
Query OK, 10 rows affected, 1 warning (0.14 sec)
Records: 10  Duplicates: 0  Warnings: 0

root@localhost:bugs~> select t, hex(t) from bug3957 where t like cast(concat(0xA4A2,'%') as char 
character set ujis);
+------+----------+
| t    | hex(t)   |
+------+----------+
| Â¤Â¤ | A4A2A4A4 |
| Â¤Â¤ | A4A2A4A6 |
+------+----------+
2 rows in set (0.00 sec)

root@localhost:bugs~>

Me too!

create table AAA (
      A_ID VARCHAR(5) NOT NULL,
      PRIMARY KEY (A_ID)
)TYPE=INNODB  DEFAULT CHARSET=gb2312      ;
INSERT INTO AAA VALUES ('ABC');
1.  SELECT * FROM AAA WHERE A_ID='ABC'
    -------------------------
    A_ID
    -------
    ABC
    ------------------------

2.  SELECT * FROM AAA WHERE A_ID LIEK 'ABC%'
    ---------------------------
    EMPTY

Right, this happens only with Innobase for non-latin1 character set.
Latin1 works fine even in Innodb. I guess this problem appeared due
to 4.1.2 change in string comparison.

I also found the following:

create table t1 (a char(10)  character set utf8 not null, key (a)) engine=innodb;
insert into t1 values ('b');
select * from t1 where a like _utf8'b%';
drop table t1;

This script doesn't return rows.

If I set a breakpoint in ha_innodb.cc:2822, i.e.
after row_sel_convert_mysql_key_to_innobase() call,
and type:

set key_buf[0]='a'

then this query returns the row:

select * from t1 where a like _utf8'b%';

Hi!

The reason apparently is this. When we insert the string 'b' to the table, MySQL pads it with spaces (0x20) at the end. When MySQL does the LIKE 'b%' search, it pads the end of the search string with null bytes (0x00).

InnoDB contains the following code where it removes spaces from non-latin1 strings to be compared:

"
        case DATA_VARMYSQL:
        case DATA_MYSQL:
        case DATA_BLOB:
                if (data_type == DATA_BLOB
                    && 0 != (type->prtype & DATA_BINARY_TYPE)) {

                        ut_print_timestamp(stderr);
                        fprintf(stderr,
"  InnoDB: Error: comparing a binary BLOB with a character set sensitive\n"
"InnoDB: comparison!\n");
                }

                /* MySQL does not pad the ends of strings with spaces in a
                comparison. That would cause a foreign key check to fail for
                non-latin1 character sets if we have different length columns.
                To prevent that we remove trailing spaces here before doing
                the comparison. NOTE that if we in the future map more MySQL
                types to DATA_MYSQL or DATA_VARMYSQL, we have to change this
                code. */

                while (a_length > 0 && a[a_length - 1] == ' ') {
                      a_length--;
                }

                while (b_length > 0 && b[b_length - 1] == ' ') {
                      b_length--;
                }

                return(innobase_mysql_cmp(
                                (int)(type->prtype & DATA_MYSQL_TYPE_MASK),
                                (uint)dtype_get_charset_coll(type->prtype),
                                a, a_length, b, b_length));
"

The question is how we can fix this without breaking FOREIGN KEY constraint handling?

I have to study this more.

Regards,

Heikki

It still have the problem if the strage engine is bdb.

mysql> select version();
+----------------+
| version()      |
+----------------+
| 4.1.3-beta-log |
+----------------+
1 row in set (0.01 sec)

mysql> create table t_like(c char(3), index(c)) engine = bdb;
Query OK, 0 rows affected (1.67 sec)

mysql> insert into t_like values ('aaa'),('a'),('abc'),('bcd');
Query OK, 4 rows affected (0.62 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from t_like where c like 'a%';
+------+
| c    |
+------+
| aaa  |
| abc  |
+------+
2 rows in set (0.01 sec)

mysql> alter table t_like drop index c;
Query OK, 4 rows affected (1.90 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from t_like where c like 'a%';
+------+
| c    |
+------+
| aaa  |
| a    |
| abc  |
+------+
3 rows in set (0.58 sec)

Hi!

This was fixed in 4.1.3. InnoDB started then to use a different comparison function.

ha_innodb.cc:

"
                /* Starting from 4.1.3, we use strnncollsp() in comparisons of
                non-latin1_swedish_ci strings. NOTE that the collation order
                changes then: 'b\0\0...' is ordered BEFORE 'b  ...'. Users
                having indexes on such data need to rebuild their tables! */

                ret = charset->coll->strnncollsp(charset,
                                  a, a_length,
                                  b, b_length);
"

Note also:
"
Important: Starting from MySQL 4.1.3, InnoDB uses the same character set comparison functions as MySQL for non-latin1_swedish_ci character strings that are not BINARY. This changes the sorting order of space and characters < ASCII(32) in those character sets. For latin1_swedish_ci character strings and BINARY strings, InnoDB uses its own pad-spaces-at-end comparison method, which stays unchanged. If you have an InnoDB table created with MySQL 4.1.2 or earlier, with an index on a non-latin1 character set (in the case of 4.1.0 and 4.1.1 with any character set) CHAR/VARCHAR/or TEXT column that is not BINARY but may contain characters < ASCII(32), then you should do ALTER TABLE or OPTIMIZE table on it to regenerate the index, after upgrading to MySQL 4.1.3 or later. 
"

A patch by Jan Lindström will fix similar LIKE 'a%' problems with UTF-8 in connection with column prefix indexes.

Regards,

Heikki