Bug #54576 REGEXP symbol dot count some characters like two characters
Submitted: 17 Jun 2010 8:05 Modified: 23 Aug 2010 19:18
Reporter: Teo Branton Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.0.51a OS:Linux (3ubuntu5.5 (Ubuntu))
Assigned to: CPU Architecture:Any

[17 Jun 2010 8:05] Teo Branton
Description:
Some characters, for example 'č' (u'\u010d', 'LATIN SMALL LETTER C WITH CARON') are not count as one character string:

mysql> SELECT 'č' REGEXP '^.$';
+-------------------+
| 'č' REGEXP '^.$' |
+-------------------+
|                 0 | 
+-------------------+
1 row in set (0.00 sec)

But correct answer is 1. Like this:

mysql> SELECT 'c' REGEXP '^.$';
+------------------+
| 'c' REGEXP '^.$' |
+------------------+
|                1 | 
+------------------+
1 row in set (0.00 sec)

How to repeat:
SELECT 'Ç' REGEXP '^.$';
SELECT 'ö' REGEXP '^.$';
SELECT 'è' REGEXP '^.$';
and so on...
[17 Jun 2010 8:29] Valeriy Kravchuk
Please, send the results of:

show variables like 'char%';

from your environment.

Our manual, http://dev.mysql.com/doc/refman/5.1/en/regexp.html, says:

"Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal."

and I wonder if this is the case here...
[21 Jun 2010 11:18] Teo Branton
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

...they are not multi-byte safe... Yes, I think There's the rub. But regexp is useless in this way :( Any idea how to do?
[21 Jun 2010 11:19] Teo Branton
mysql> SET NAMES utf8;
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

with same result...
[22 Jun 2010 7:47] Susanne Ebrecht
Which collation is used?

Show variables like '%col%';

In MySQL 5.0 utf8_general_ci included some characters that were counted as more then one byte. This is different in MySQL 5.1.
[1 Jul 2010 7:48] Teo Branton
mysql> SHOW VARIABLES LIKE '%col%';
+---------------------------+-------------------+
| Variable_name             | Value             |
+---------------------------+-------------------+
| collation_connection      | utf8_general_ci   | 
| collation_database        | latin1_swedish_ci | 
| collation_server          | latin1_swedish_ci | 
| protocol_version          | 10                | 
| slave_compressed_protocol | OFF               | 
+---------------------------+-------------------+
5 rows in set (0.00 sec)

SELECT 'Ç' COLLATE utf8_general_ci REGEXP '^.$';
SELECT 'Ç' COLLATE utf8_bin REGEXP '^.$';

No any positive effect. Still same wrong answer. I cannot test it in MySQL 5.1 now. Can anybody check if 5.1 works better?
[23 Aug 2010 19:18] Sveta Smirnova
This is duplicate of bug #30241