Bug #54576 | REGEXP symbol dot count some characters like two characters | ||
---|---|---|---|
Submitted: | 17 Jun 2010 8:05 | Modified: | 23 Aug 2010 19:18 |
Reporter: | Teo Branton | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Server: FULLTEXT search | Severity: | S3 (Non-critical) |
Version: | 5.0.51a | OS: | Linux (3ubuntu5.5 (Ubuntu)) |
Assigned to: | CPU Architecture: | Any |
[17 Jun 2010 8:05]
Teo Branton
[17 Jun 2010 8:29]
Valeriy Kravchuk
Please, send the results of: show variables like 'char%'; from your environment. Our manual, http://dev.mysql.com/doc/refman/5.1/en/regexp.html, says: "Warning The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal." and I wonder if this is the case here...
[21 Jun 2010 11:18]
Teo Branton
mysql> SHOW VARIABLES LIKE 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec) ...they are not multi-byte safe... Yes, I think There's the rub. But regexp is useless in this way :( Any idea how to do?
[21 Jun 2010 11:19]
Teo Branton
mysql> SET NAMES utf8; mysql> SHOW VARIABLES LIKE 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec) with same result...
[22 Jun 2010 7:47]
Susanne Ebrecht
Which collation is used? Show variables like '%col%'; In MySQL 5.0 utf8_general_ci included some characters that were counted as more then one byte. This is different in MySQL 5.1.
[1 Jul 2010 7:48]
Teo Branton
mysql> SHOW VARIABLES LIKE '%col%'; +---------------------------+-------------------+ | Variable_name | Value | +---------------------------+-------------------+ | collation_connection | utf8_general_ci | | collation_database | latin1_swedish_ci | | collation_server | latin1_swedish_ci | | protocol_version | 10 | | slave_compressed_protocol | OFF | +---------------------------+-------------------+ 5 rows in set (0.00 sec) SELECT 'Ç' COLLATE utf8_general_ci REGEXP '^.$'; SELECT 'Ç' COLLATE utf8_bin REGEXP '^.$'; No any positive effect. Still same wrong answer. I cannot test it in MySQL 5.1 now. Can anybody check if 5.1 works better?
[23 Aug 2010 19:18]
Sveta Smirnova
This is duplicate of bug #30241