MySQL Bugs: #61783: Documentation issue with REGEXP operator

Bug #61783	Documentation issue with REGEXP operator
Submitted:	7 Jul 2011 12:25	Modified:	29 Aug 2011 20:27
Reporter:	Peter Laursen (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S3 (Non-critical)
Version:	5.0+	OS:	Any
Assigned to:	Paul DuBois	CPU Architecture:	Any
Tags:	qc

Description:
http://dev.mysql.com/doc/refman/5.1/en/regexp.html

"REGEXP and RLIKE use the current character set when deciding the type of a character. The default is latin1 (cp1252 West European)." What is the current character set? There is a character_set_server, .._connection, .._database, .. client etc. as well as the (column-wise) character set used for storage. Which one applies? The term "current character set" does not make much sense from MySQL 4.1 IMHO.

Also I think this should be elaborated/clarified

"The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets." I have not had any issues with Danish accented characters ('æøåÆØÅ') and the REGEXP operator in neither latin1 or utf8.

How to repeat:
I have reported a similar issue before (I think 1-2 years ago I believe) where I complained about the use of the term "current character set" in documentation sosme place also dealing with string comparisons. I don't remember full details except that at the time a MySQL developer explained that before comparison both strings were cast to UTF8 internally.

Maybe same applies when using REGEXP today?

Suggested fix:
Clarify! If REGEXP "may produce unexpected results" it should be documented which ones and under what circumstances it may occur and how to handle it.

But it more looks to me that this is remains of 4.0 docs that went unchanged through till today (with the excpetion of the note about collation, what is an OK clarification!)

I agree that "current character set" should be properly defined.

As for notes about RLIKE etc, for me it does NOT work out of the box the same way as LIKE, for exmaple:

mysql> select 'æøåÆØÅ' like 'æ%' as a;
+---+
| a |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> select 'æøåÆØÅ' rlike 'æ%' as a;
+---+
| a |
+---+
| 0 |
+---+
1 row in set (0.00 sec)

mysql> select version();
+--------------+
| version()    |
+--------------+
| 5.5.15-debug |
+--------------+
1 row in set (0.00 sec)

mysql> show variables like 'char%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | utf8                                  |
| character_set_connection | utf8                                  |
| character_set_database   | latin1                                |
| character_set_filesystem | binary                                |
| character_set_results    | utf8                                  |
| character_set_server     | latin1                                |
| character_set_system     | utf8                                  |
| character_sets_dir       | /Users/openxs/dbs/5.5/share/charsets/ |
+--------------------------+---------------------------------------+
8 rows in set (0.01 sec)

So, manual still makes some sense in this area...

@Valeriy .. I don't think you can use wildcards in the operand after REGEXP/RLIKE. % works as a literal here and not a wildcard.

I get (with data stored as utf8);

SET NAMES utf8;

SELECT * FROM dansk ORDER BY id;
/*
    id  txt   
------  ------
     1  æøå
     2  ÆØÅ
*/

SELECT txt REGEXP 'Æ' FROM dansk ORDER BY id;
/*
txt regexp 'Æ'
---------------
              0
              1
*/              
              
SELECT txt REGEXP 'æ' FROM dansk ORDER BY id;
/*
txt REGEXP 'æ'
---------------
              1
              0
*/

SET character_set_client = latin1;

SELECT txt REGEXP 'Æ' FROM dansk ORDER BY id;
/*
txt regexp 'Æ'
---------------
              0
              1
*/  

SELECT txt REGEXP 'æ' FROM dansk ORDER BY id;
/*
txt REGEXP 'æ'
---------------
              1
              0
*/

SET NAMES latin1;

SELECT txt REGEXP 'Æ' FROM dansk ORDER BY id;
/*
txt regexp 'Æ'
---------------
              0
              0
*/  

SELECT txt REGEXP 'æ' FROM dansk ORDER BY id;
/*
txt REGEXP 'æ'
---------------
              0
              0
*/

Yes, my fault (to use %), sorry. And differences like these:

mysql> select 'Å' like 'A%' as a;
+---+
| a |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

mysql> select 'Å' rlike 'A.*' as a;
+---+
| a |
+---+
| 0 |
+---+
1 row in set (0.00 sec)

are related to the fact that collation rules are ignored.

In any case, manual (http://dev.mysql.com/doc/refman/5.1/en/regexp.html#operator_regexp) needs clarification :)

Now what is this?

-- set names utf8 or latin1 does not matter;
SELECT 'a' REGEXP 'A'; -- 1
SELECT 'æ' REGEXP 'Æ'; -- 0

Looks to me like there is more than documentation issues here. Documentation states that "REGEXP is not case sensitive, except when used with binary strings.".  But outside ASCII range it seem to be case sensitive.

A more elaborated test case: 

-- SET NAMES utf8|latin1 does not matter;

CREATE TABLE `regtest` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `col1` BINARY(5) DEFAULT NULL,
  `col2` CHAR(5) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `col3` CHAR(5) DEFAULT NULL,
  `col4` CHAR(5) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `col5` CHAR(5) CHARACTER SET latin1 DEFAULT NULL,
  PRIMARY KEY (`id`)
);
 
INSERT INTO regtest VALUES (1,'a','a','a','a','a');
INSERT INTO regtest VALUES (2,'æ','æ','æ','æ','æ');

SELECT -- I add result for each column as comment
  id,
  col1 REGEXP 'A', -- 0 
  col1 REGEXP 'Æ', -- 0
  col2 REGEXP 'A', -- 0
  col2 REGEXP 'Æ', -- 0
  col3 REGEXP 'A', -- 1
  col3 REGEXP 'Æ', -- 0
  col4 REGEXP 'A', -- 0
  col4 REGEXP 'Æ', -- 0
  col5 REGEXP 'A', -- 1
  col5 REGEXP 'Æ'  -- 0
FROM
  regtest;

I's simply say that if case insensivity cannot be assured in the complete Unicode range REGEXP should always be case sensitive.  But that is not a documentation issue!

Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

Updated text:

"REGEXP and RLIKE use the character set and collations of the
arguments when deciding the type of a character and performing the
comparison. If the arguments have different character sets or
collations, coercibility rules apply as described in
http://dev.mysql.com/doc/refman/5.5/en/charset-collation-expressions.html.
"

Regarding:

"Clarify! If REGEXP "may produce unexpected results" it should be documented which ones
and under what circumstances it may occur and how to handle it."

We are not going to be more specific here. "May produce unexpected results" means "so don't use it for multibyte characters". We will not attempt to enumerate special cases or details.

Maybe I am missing something but I do not understand how a (solely) documentation update can fix this:

SELECT 'a' REGEXP 'A'; -- 1
SELECT 'æ' REGEXP 'Æ'; -- 0