Bug #31999 REGEXP match too many string for accentued character between bracket
Submitted: 31 Oct 2007 17:38 Modified: 16 Nov 2007 11:33
Reporter: Séb Cabot Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.1.22-rc, 5.0, 4.1 BK OS:MacOS
Assigned to: CPU Architecture:Any
Tags: Characters Sets, REGEXP

[31 Oct 2007 17:38] Séb Cabot
Description:
When using REGEXP with an accentued character between bracket the expression match too many results.

For exemple REGEXP 'B[Ç]', match BÇ as expected, but also match all other accentued characters like BÉ, BÔ, Bô, BÀ and so on, which is not expected.

So, it's like a single accentued character between bracket in a reg. exp. is converted to all possibles accents.

 Ex. [Ç] become [çÇàÀéÉôÔÎÎëË...]

How to repeat:
Create a UT8 database with utf8_bin collation:

  CREATE DATABASE `Z` DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_bin;

Create a simple table with a VARCHAR field;

  CREATE TABLE `lexie` (`id` INT(5), `voc` VARCHAR(10),  PRIMARY KEY (`id`));

Fill the table with tests values:

+----+------+
| id | voc  |
+----+------+
|  1 | AÉ   | 
|  2 | BA   | 
|  3 | BE   | 
|  4 | BÉ   | 
|  5 | BÎ   | 
|  6 | BÖ   | 
|  7 | Bö   | 
+----+------+

Then, try some select.
SELECT * FROM LEXIE WHERE voc REGEXP 'B[Ç]';
+----+------+
| id | voc  |
+----+------+
|  4 | BÉ   | 
|  5 | BÎ   |      <==  No result should have been return here.
|  6 | BÖ   | 
|  7 | Bö   | 
+----+------+

SELECT * FROM LEXIE WHERE voc REGEXP 'B[É]';
+----+------+
| id | voc  |
+----+------+
|  4 | BÉ   |      <==  Only BÉ should have been return here.
|  5 | BÎ   | 
|  6 | BÖ   | 
|  7 | Bö   | 
+----+------+

SELECT * FROM LEXIE WHERE voc REGEXP 'B[AÇ]';
+----+------+
| id | voc  |
+----+------+
|  2 | BA   |      <==  Only BA should have been return here.
|  4 | BÉ   | 
|  5 | BÎ   | 
|  6 | BÖ   | 
|  7 | Bö   | 
+----+------+

Suggested fix:

REGEXP '[Ç]' should match a string containing at least one 'Ç' and nothing else.

The same apply for all the other accentued characters.
[13 Nov 2007 9:17] Sveta Smirnova
Thank you for the report.

Please provide output of SHOW VARIABLES LIKE 'char%' and SHOW VARIABLES LIKE 'coll%'
[15 Nov 2007 19:45] Séb Cabot
As requested, here the output of output of
SHOW VARIABLES LIKE 'char%' and
SHOW VARIABLES LIKE 'coll%'

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.02 sec)

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   | 
| collation_database   | utf8_bin          | 
| collation_server     | latin1_swedish_ci | 
+----------------------+-------------------+
3 rows in set (0.00 sec)
[16 Nov 2007 10:35] Sveta Smirnova
Thank you for the report.

Verified as described.
[16 Nov 2007 11:33] Sergei Golubchik
This is expected and documented here:
http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html
"Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
"

It will be ultimately fixed when we will replace the regex library in MySQL.