Bug #81027 RLIKE is not multibyte safe - i.e. it's broken and useless
Submitted: 10 Apr 2016 16:03 Modified: 11 Apr 2016 19:37
Reporter: teo teo Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[10 Apr 2016 16:03] teo teo
Description:
This even documented, but there's no way this is acceptable.

From the very documentation:

"""
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
"""

Then what the hell is it good for??

None of that is acceptable. 

By the way, that statements contraddicts this one:
""
REGEXP and RLIKE use the character set and collations of the arguments when deciding the type of a character and performing the comparison
""
If they don't handle multibyte encoding correctling, it can't be stated that they "use the character set and collations of the arguments". In fact, they don't use any meaningful or consistent character set and collation.

How to repeat:
SELECT 'diaz' RLIKE 'díaz'

Expected result: 1
Observed result: 0

Suggested fix:
Look at how any other open source software do multibyte-safe regular expressions and copy from that.

It's unbelievable that in 2016 we still have to worry about stuff not dealing with utf-8 character encoding seamlessly.
[11 Apr 2016 14:10] MySQL Verification Team
Hi Teo,

Those two functions are designed for single-byte character sets only.

However, having similar (or same) functions that cover multi-byte character sets is a totally valid feature request.

Verified as a feature request.
[11 Apr 2016 19:37] teo teo
> Those two functions are designed for single-byte character sets only.

Then they are wrongly designed. "For single-byte character sets only" means "useless".

When something is "by design" but the design is stupid, it's a bug.

> However, having similar (or same) functions that cover multi-byte character sets is a totally valid feature request.

One that apparently you have been "wanting" to implement for almost nine years:
http://bugs.mysql.com/bug.php?id=30241 (see comment [7 Aug 2007 21:31])