Bug #61783 | Documentation issue with REGEXP operator | ||
---|---|---|---|
Submitted: | 7 Jul 2011 12:25 | Modified: | 29 Aug 2011 20:27 |
Reporter: | Peter Laursen (Basic Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Documentation | Severity: | S3 (Non-critical) |
Version: | 5.0+ | OS: | Any |
Assigned to: | Paul DuBois | CPU Architecture: | Any |
Tags: | qc |
[7 Jul 2011 12:25]
Peter Laursen
[7 Jul 2011 15:25]
Valeriy Kravchuk
I agree that "current character set" should be properly defined. As for notes about RLIKE etc, for me it does NOT work out of the box the same way as LIKE, for exmaple: mysql> select 'æøåÆØÅ' like 'æ%' as a; +---+ | a | +---+ | 1 | +---+ 1 row in set (0.00 sec) mysql> select 'æøåÆØÅ' rlike 'æ%' as a; +---+ | a | +---+ | 0 | +---+ 1 row in set (0.00 sec) mysql> select version(); +--------------+ | version() | +--------------+ | 5.5.15-debug | +--------------+ 1 row in set (0.00 sec) mysql> show variables like 'char%'; +--------------------------+---------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /Users/openxs/dbs/5.5/share/charsets/ | +--------------------------+---------------------------------------+ 8 rows in set (0.01 sec) So, manual still makes some sense in this area...
[7 Jul 2011 15:44]
Peter Laursen
@Valeriy .. I don't think you can use wildcards in the operand after REGEXP/RLIKE. % works as a literal here and not a wildcard.
[7 Jul 2011 16:00]
Peter Laursen
I get (with data stored as utf8); SET NAMES utf8; SELECT * FROM dansk ORDER BY id; /* id txt ------ ------ 1 æøå 2 ÆØÅ */ SELECT txt REGEXP 'Æ' FROM dansk ORDER BY id; /* txt regexp 'Æ' --------------- 0 1 */ SELECT txt REGEXP 'æ' FROM dansk ORDER BY id; /* txt REGEXP 'æ' --------------- 1 0 */ SET character_set_client = latin1; SELECT txt REGEXP 'Æ' FROM dansk ORDER BY id; /* txt regexp 'Æ' --------------- 0 1 */ SELECT txt REGEXP 'æ' FROM dansk ORDER BY id; /* txt REGEXP 'æ' --------------- 1 0 */ SET NAMES latin1; SELECT txt REGEXP 'Æ' FROM dansk ORDER BY id; /* txt regexp 'Æ' --------------- 0 0 */ SELECT txt REGEXP 'æ' FROM dansk ORDER BY id; /* txt REGEXP 'æ' --------------- 0 0 */
[7 Jul 2011 16:03]
Valeriy Kravchuk
Yes, my fault (to use %), sorry. And differences like these: mysql> select 'Å' like 'A%' as a; +---+ | a | +---+ | 1 | +---+ 1 row in set (0.00 sec) mysql> select 'Å' rlike 'A.*' as a; +---+ | a | +---+ | 0 | +---+ 1 row in set (0.00 sec) are related to the fact that collation rules are ignored. In any case, manual (http://dev.mysql.com/doc/refman/5.1/en/regexp.html#operator_regexp) needs clarification :)
[7 Jul 2011 17:19]
Peter Laursen
Now what is this? -- set names utf8 or latin1 does not matter; SELECT 'a' REGEXP 'A'; -- 1 SELECT 'æ' REGEXP 'Æ'; -- 0 Looks to me like there is more than documentation issues here. Documentation states that "REGEXP is not case sensitive, except when used with binary strings.". But outside ASCII range it seem to be case sensitive.
[7 Jul 2011 18:34]
Peter Laursen
A more elaborated test case: -- SET NAMES utf8|latin1 does not matter; CREATE TABLE `regtest` ( `id` INT(11) NOT NULL AUTO_INCREMENT, `col1` BINARY(5) DEFAULT NULL, `col2` CHAR(5) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL, `col3` CHAR(5) DEFAULT NULL, `col4` CHAR(5) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL, `col5` CHAR(5) CHARACTER SET latin1 DEFAULT NULL, PRIMARY KEY (`id`) ); INSERT INTO regtest VALUES (1,'a','a','a','a','a'); INSERT INTO regtest VALUES (2,'æ','æ','æ','æ','æ'); SELECT -- I add result for each column as comment id, col1 REGEXP 'A', -- 0 col1 REGEXP 'Æ', -- 0 col2 REGEXP 'A', -- 0 col2 REGEXP 'Æ', -- 0 col3 REGEXP 'A', -- 1 col3 REGEXP 'Æ', -- 0 col4 REGEXP 'A', -- 0 col4 REGEXP 'Æ', -- 0 col5 REGEXP 'A', -- 1 col5 REGEXP 'Æ' -- 0 FROM regtest; I's simply say that if case insensivity cannot be assured in the complete Unicode range REGEXP should always be case sensitive. But that is not a documentation issue!
[29 Aug 2011 19:28]
Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products. Updated text: "REGEXP and RLIKE use the character set and collations of the arguments when deciding the type of a character and performing the comparison. If the arguments have different character sets or collations, coercibility rules apply as described in http://dev.mysql.com/doc/refman/5.5/en/charset-collation-expressions.html. " Regarding: "Clarify! If REGEXP "may produce unexpected results" it should be documented which ones and under what circumstances it may occur and how to handle it." We are not going to be more specific here. "May produce unexpected results" means "so don't use it for multibyte characters". We will not attempt to enumerate special cases or details.
[29 Aug 2011 20:27]
Peter Laursen
Maybe I am missing something but I do not understand how a (solely) documentation update can fix this: SELECT 'a' REGEXP 'A'; -- 1 SELECT 'æ' REGEXP 'Æ'; -- 0