Bug #3928 regexp [[:>:]] and UTF-8
Submitted: 28 May 2004 18:03 Modified: 7 Jun 2004 9:52
Reporter: Timothy Smith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1 OS:Any (any)
Assigned to: Alexander Barkov CPU Architecture:Any

[28 May 2004 18:03] Timothy Smith
Description:
The manual states that REGEXP doesn't work for multi-byte character sets.

But, for UTF-8, it seems that (if the strings are just treated as octets) the word-boundary tests should still work.

The reasoning is this:

1) If the octets are all < 128, then it's just an ASCII string - no difference.

2) For any octets >= 128, they can be treated like in-word characters.  I.e., it's in reality one Unicode character, but two (or more) octets >= 128.  None of those octets should confuse the word-boundary matching.

So, this is a cross between a bug report and a feature request - it seems like it *should* work without any modifications, which is why I'm calling it a bug report.

How to repeat:
I am attaching a Perl program to demonstrate this.

Suggested fix:
I'm not sure; I looked into the code a bit, but didn't get past the regcomp function.  It seems to care about the character set - maybe simply setting the character set to the binary pseudo-character set if it's UTF-8 would solve the problem.
[28 May 2004 18:08] Timothy Smith
Perl program which demonstrates the problem (CGI so you can view things with a web browser)

Attachment: charset.pl (application/octet-stream, text), 1.77 KiB.

[7 Jun 2004 9:52] Alexander Barkov
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html