Bug #21560 RLIKE resp. REGEXP throw ERROR 2013 when used with utf8
Submitted: 10 Aug 2006 9:04 Modified: 26 Sep 2006 11:46
Reporter: Ralf Wohner Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:3.23, 4.1, 5.0 OS:Windows (Win9x, XP, Linux)
Assigned to: CPU Architecture:Any
Tags: ERROR 2013, REGEXP, RLIKE, utf8

[10 Aug 2006 9:04] Ralf Wohner
Description:
I'll get an "ERROR 2013 (HY000): Lost connection to mySQL server during query when using RLIKE resp. REGEXP with unicode columns. My DB uses utf8_unicode_ci columns.

I found in the web that many, many people who don't want to use country specific code pages have this problem, obviously with mySQL versions from 3.23 to 5.1.

The reportet ERROR 2013 isn't true. Even when running the mysql command line client on the mySQL server this reply happens within milliseconds (so no timeout problem). Even with a table holding only one record.

How to repeat:
When I use 
SELECT * FROM table WHERE utf8field RLIKE _utf8'.*text.*';
the error appears.

Suggested fix:
Workaround: Convert the character sets first. Perhaps this doesn't help in all languages that use Unicode, but at least in some that are similar to latin1.

SELECT * FROM table WHERE CONVERT(utf8field USING latin1) RLIKE _latin1 '.*text.*';

As this query is even more complex than the original one, we see that the 2013 can't have happened because of a too complex query or server timeout.
[10 Aug 2006 11:42] MySQL Verification Team
Thank you for the bug report. Could you please provide a small test case
with create table and some insert commands as well as your setup regarding
character set fro server and client.

Thanks in advance.
[11 Aug 2006 7:42] Ralf Wohner
Here a script to create the test case. I think it will reproduce the error no matter what default charset/collation the client and the server use, as I pin it to utf8_unicode_ci. Currently I use mySQL 4.1.5-gamma, but as I said, I confirmed or read about this on various versions. Server and client use latin1 by default, I think this is the swedish factory setting still. I didn't want to rely on especially configured servers with my project, so my DBs, tables and columns are "forced" to use utf8_unicode_ci.

The script:

CREATE TABLE `testtable` (
  `id` int(11) NOT NULL auto_increment,
  `utf8column` varchar(64) collate utf8_unicode_ci default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

INSERT INTO testtable VALUES (null,'this is utf8.'),(null,'this, too. No special characters used, by the way.');

SELECT * FROM testtable WHERE utf8column LIKE '%too%'; -- Works

SELECT * FROM testtable WHERE utf8column REGEXP '.*too.*'; -- Doesn't. Throws the error 2013

SELECT * FROM testtable WHERE CONVERT(utf8column USING latin1) REGEXP _latin1 '.*too.*';
-- Works, but a test case for non-latin1-characters failed:
-- I added a 3rd row containing text with an german a-umlaut (see HTML-Entity auml)
-- Although the german a-umlaut in the column isn't changed when using
SELECT CONVERT(utf8column USING latin1) FROM testtable;
-- , the 3rd SELECT doesn't find anything when searching for it instead of 'too',
-- neither when looking for a simple 'a' (of course it finds record #2 then, as
-- there are some 'a's in there.)
-- Perhaps someone can recommend a better alternative to latin1.
[26 Aug 2006 11:46] Valeriy Kravchuk
Sorry, but I was not able to repeat the behaviour described:

openxs@suse:~/dbs/5.1> bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 5.0.25-debug

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> CREATE TABLE `testtable` (
    ->   `id` int(11) NOT NULL auto_increment,
    ->   `utf8column` varchar(64) collate utf8_unicode_ci default NULL,
    ->   PRIMARY KEY  (`id`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO testtable VALUES (null,'this is utf8.'),(null,'this, too. No
 special
    '> characters used, by the way.');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM testtable WHERE utf8column LIKE '%too%';
+----+----------------------------------------------------+
| id | utf8column                                         |
+----+----------------------------------------------------+
|  2 | this, too. No special
characters used, by the way. |
+----+----------------------------------------------------+
1 row in set (0.02 sec)

mysql> SELECT * FROM testtable WHERE utf8column REGEXP '.*too.*';
+----+----------------------------------------------------+
| id | utf8column                                         |
+----+----------------------------------------------------+
|  2 | this, too. No special
characters used, by the way. |
+----+----------------------------------------------------+
1 row in set (0.01 sec)

mysql> select version();
+--------------+
| version()    |
+--------------+
| 5.0.25-debug |
+--------------+
1 row in set (0.01 sec)

So, please, try to repeat with a newer version (5.0.24, 5.1.11 or 4.1.21), and inform about the results.
[26 Sep 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".