Bug #112787 regexp/rlike function in routines extreme regression mysql 5.7 to mysql 8
Submitted: 21 Oct 2023 0:01 Modified: 5 Feb 18:07
Reporter: John Scott Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:8.0.34, 8.1.0 OS:Linux
Assigned to: CPU Architecture:Any
Tags: regexp rlike, regression

[21 Oct 2023 0:01] John Scott
Description:
On MySQL 5.7.38, this stored procedure runs in 3-5 seconds.
On MySQL 8.0.34, this stored procedure takes 27 minutes.

This stored procedure (below) is simply using regexp and rlike to extract alpha characters from UUIDs.  I provided this as the example to avoid having to load a dataset to reproduce.  A similar technique was being used by a customer of mine to extract alpha characters as a part of a data masking function.  It is being refactored before moving to MySQL 8 to eliminate character-by-character evaluation.

Submitting this because the regression is so severe, and because super-inefficient use cases like this may not be the only use cases affected.

stored procedure:

----
DELIMITER $$
CREATE DEFINER=`user`@`%` PROCEDURE return_alpha_from_10k_uuids()
SQL SECURITY INVOKER
BEGIN
  DECLARE STRLEN int;
  DECLARE p1 int default 10000;
  DECLARE p2 int default 0;
  DECLARE retval varchar(50) default '';
  DECLARE myuuid char(50) default '';

  label1: WHILE p1 > 0 DO
    SET p1 = p1 - 1;
    select uuid() into myuuid;
    set retval='';
    set p2 = 0;
    label2: WHILE p2 < 36 DO
      set p2 = p2 + 1;
      set retval = concat(retval,if(substring(myuuid, p2, 1) REGEXP '[[:alpha:]]', substring(myuuid, p2, 1), ''));
    END WHILE label2;
    select retval;
  END WHILE label1;
END$$
DELIMITER ;
----

How to repeat:
Install the above stored procedure into a schema on a MySQL 5.7 version (or presumably an 8.0 version prior to 8.0.4 when the ICU regexp library began being used.

Run: 
mysql> call return_alpha_from_10k_uuids();

On 5.7 versions this returns in 3.5 seconds.
On 8.x versions this returns in several minutes (27 for me).
[21 Oct 2023 0:02] John Scott
changing title to be clearer
[23 Oct 2023 5:14] MySQL Verification Team
Hello John Scott,

Thank you for the report and test case.

regards,
Umesh
[23 Oct 2023 6:31] MySQL Verification Team
On a moderate/large instance, I observed :

#################### 5.7.43
mysql> call return_alpha_from_10k_uuids();
.
.
Query OK, 0 rows affected (5.06 sec)
Query OK, 0 rows affected (3.27 sec)
Query OK, 0 rows affected (3.41 sec)
Query OK, 0 rows affected (3.19 sec)
Query OK, 0 rows affected (3.16 sec)

#################### 8.0.34
mysql> call return_alpha_from_10k_uuids();
.

Query OK, 0 rows affected (1 min 57.89 sec)
Query OK, 0 rows affected (1 min 57.64 sec)
Query OK, 0 rows affected (1 min 57.75 sec)
Query OK, 0 rows affected (1 min 58.49 sec)
Query OK, 0 rows affected (1 min 58.42 sec)
Query OK, 0 rows affected (1 min 58.48 sec)

#################### 8.1.0
mysql> call return_alpha_from_10k_uuids();
.
Query OK, 0 rows affected (1 min 55.42 sec)
Query OK, 0 rows affected (1 min 55.49 sec)
Query OK, 0 rows affected (1 min 55.75 sec)
Query OK, 0 rows affected (1 min 55.82 sec)
Query OK, 0 rows affected (1 min 56.09 sec)
[23 Oct 2023 6:32] MySQL Verification Team
Test results - 8.0.34,  5.7.43, 8.1.0, 5.6.51

Attachment: 112787_5.6.51_5.7.43_8.0.34_8.1.0.results (application/octet-stream, text), 9.57 KiB.

[23 Oct 2023 16:54] John Scott
Additional information:

The regression seems limited to use cases where :alpha: and :digit: are being used instead of their regexp equivalents A-Za-z and 0-9, respectively.
[14 Nov 2023 9:19] Tor Didriksen
Posted by developer:
 
ICU regexp is documented here:
https://unicode-org.github.io/icu/userguide/strings/regexp.html
With some performance hints here:
https://unicode-org.github.io/icu/userguide/strings/regexp.html#performance-tips

This is what it says about Unicode properties:
"ICU recognizes all Unicode properties known to ICU, which is all of them."

The regexp matching engine seems to generate a *much* faster state machine for
[A-Za-z] than it does for [[:alpha:]]

Here are results from running the Linux 'perf' tool for the two cases.

 [A-Za-z]
   6,21%  connection       mysqld                                [.] icu_73::UnicodeSet::findCodePoint
   5,28%  swapper          [kernel.kallsyms]                     [k] 0xffffffff8c912188
   5,27%  connection       [kernel.kallsyms]                     [k] 0xffffffff8cf5c87b
   3,38%  connection       mysqld                                [.] icu_73::UnicodeSet::contains
   2,94%  connection       [kernel.kallsyms]                     [k] 0xffffffff8d00012d
   2,57%  connection       [kernel.kallsyms]                     [k] 0xffffffff8d00008f
   1,94%  connection       [kernel.kallsyms]                     [k] 0xffffffff8d000199
   1,92%  connection       [kernel.kallsyms]                     [k] 0xffffffff8d00001d
   1,74%  connection       mysqld                                [.] icu_73::RegexCompile::compile
   1,72%  connection       mysqld                                [.] pfs_end_statement_vc
   1,68%  connection       mysqld                                [.] ucase_addCaseClosure_73
   1,54%  connection       libc.so.6                             [.] malloc
   1,37%  connection       mysqld                                [.] sp_head::execute
   1,31%  connection       mysqld                                [.] icu_73::RegexCompile::matchStartType
   1,06%  connection       mysqld                                [.] pfs_start_statement_vc
   1,04%  connection       libc.so.6                             [.] _int_free
   1,00%  connection       mysqld                                [.] insert_events_statements_history_long
   0,94%  connection       [kernel.kallsyms]                     [k] 0xffffffff8c1881f2
   0,92%  connection       [vdso]                                [.] __vdso_clock_gettime
   0,82%  connection       mysqld                                [.] bind_fields
   0,79%  connection       libc.so.6                             [.] __memmove_avx_unaligned_erms
   0,79%  connection       [vdso]                                [.] 0x00000000000006b5
   0,77%  connection       mysqld                                [.] icu_73::UnicodeSet::add
   0,76%  connection       mysqld                                [.] sp_lex_instr::reset_lex_and_exec_core
   0,70%  connection       libc.so.6                             [.] cfree@GLIBC_2.2.5
   0,69%  connection       mysqld                                [.] icu_73::BMPSet::contains
   0,53%  connection       mysqld                                [.] pfs_start_stage_v1
   0,51%  connection       mysqld                                [.] icu_73::UnicodeSet::retain
###############
 [[:alpha:]]
  23,47%  connection       mysqld                                [.] icu_73::UnicodeSet::findCodePoint
  16,29%  connection       mysqld                                [.] ucase_addCaseClosure_73
   7,33%  connection       mysqld                                [.] icu_73::UnicodeSet::add
   5,55%  swapper          [kernel.kallsyms]                     [k] mwait_idle_with_hints.constprop.0
   5,38%  connection       libc.so.6                             [.] __memcmp_avx2_movbe
   5,05%  connection       mysqld                                [.] uhash_compareUnicodeString_73
   2,67%  connection       mysqld                                [.] icu_73::UVector::indexOf
   2,23%  connection       mysqld                                [.] icu_73::UnicodeString::doEquals
   1,80%  connection       libc.so.6                             [.] _int_malloc
   1,64%  connection       libc.so.6                             [.] malloc
   1,61%  connection       mysqld                                [.] icu_73::UnicodeSet::closeOverCaseInsensitive
   1,47%  connection       mysqld                                [.] icu_73::UnicodeString::doCompare
   1,36%  connection       libc.so.6                             [.] _int_free
   1,25%  connection       mysqld                                [.] icu_73::UnicodeSet::size
   1,24%  connection       mysqld                                [.] icu_73::UnicodeSet::add
   1,08%  connection       mysqld                                [.] icu_73::UVector::sortedInsert
   0,88%  connection       mysqld                                [.] icu_73::UnicodeSet::retain
   0,73%  connection       mysqld                                [.] icu_73::UnicodeSet::getRangeStart
   0,71%  connection       libc.so.6                             [.] malloc_consolidate
   0,62%  connection       libc.so.6                             [.] __memmove_avx_unaligned_erms
   0,61%  connection       mysqld                                [.] icu_73::UnicodeSet::getRangeEnd
   0,60%  connection       mysqld                                [.] icu_73::compareUnicodeString
   0,56%  connection       mysqld                                [.] icu_73::_set_add
[13 Dec 2023 10:45] MySQL Verification Team
HI Mr. Scott,

We have received a full reply from our Development team.

In short, this is not a bug in the code , but a change in how MySQL 8.0 deals with regexp and rlike in 8.0. This has not been documented in the Manual, so we are converting this verified bug as something that requires proper documentation.

Here is the explanation of why this is not a code bug:

There was a significant change in 8.0 regarding handling of the regular expressions, so all users  need to rewrite his regexp pattern.

The regexp matching engine is changed in 8.0, we use ICU.

Here are additional comments .....:
==================

The regexp matching engine seems to generate a *much* faster state machine for [A-Za-z] than it does  for [[:alpha:]]

This happoens because :alpha: is much more general, it matches any alphabetic character in  utf16_general_ci which is the charset used internally by ICU.

Hence, you should use ICU, which will be fully explained when the Reference Manual is updated. 

When the Manual is updated, you will be immediately informed, here in this bug report.

Thank you.
[5 Feb 18:07] Jon Stephens
Fixed in mysqldoc rev 77771, in 8.0 and later versions of the MySQL Manual.

Closed.