MySQL Bugs: #3928: regexp [[:>:]] and UTF-8

Bug #3928	regexp [[:>:]] and UTF-8
Submitted:	28 May 2004 18:03	Modified:	7 Jun 2004 9:52
Reporter:	Timothy Smith	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	4.1	OS:	Any (any)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
The manual states that REGEXP doesn't work for multi-byte character sets.

But, for UTF-8, it seems that (if the strings are just treated as octets) the word-boundary tests should still work.

The reasoning is this:

1) If the octets are all < 128, then it's just an ASCII string - no difference.

2) For any octets >= 128, they can be treated like in-word characters.  I.e., it's in reality one Unicode character, but two (or more) octets >= 128.  None of those octets should confuse the word-boundary matching.

So, this is a cross between a bug report and a feature request - it seems like it *should* work without any modifications, which is why I'm calling it a bug report.

How to repeat:
I am attaching a Perl program to demonstrate this.

Suggested fix:
I'm not sure; I looked into the code a bit, but didn't get past the regcomp function.  It seems to care about the character set - maybe simply setting the character set to the binary pseudo-character set if it's UTF-8 would solve the problem.

Perl program which demonstrates the problem (CGI so you can view things with a web browser)

Attachment: charset.pl (application/octet-stream, text), 1.77 KiB.

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html