Bug #112787 | regexp/rlike function in routines extreme regression mysql 5.7 to mysql 8 | ||
---|---|---|---|
Submitted: | 21 Oct 2023 0:01 | Modified: | 5 Feb 2024 18:07 |
Reporter: | John Scott | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Documentation | Severity: | S3 (Non-critical) |
Version: | 8.0.34, 8.1.0 | OS: | Linux |
Assigned to: | CPU Architecture: | Any | |
Tags: | regexp rlike, regression |
[21 Oct 2023 0:01]
John Scott
[21 Oct 2023 0:02]
John Scott
changing title to be clearer
[23 Oct 2023 5:14]
MySQL Verification Team
Hello John Scott, Thank you for the report and test case. regards, Umesh
[23 Oct 2023 6:31]
MySQL Verification Team
On a moderate/large instance, I observed : #################### 5.7.43 mysql> call return_alpha_from_10k_uuids(); . . Query OK, 0 rows affected (5.06 sec) Query OK, 0 rows affected (3.27 sec) Query OK, 0 rows affected (3.41 sec) Query OK, 0 rows affected (3.19 sec) Query OK, 0 rows affected (3.16 sec) #################### 8.0.34 mysql> call return_alpha_from_10k_uuids(); . Query OK, 0 rows affected (1 min 57.89 sec) Query OK, 0 rows affected (1 min 57.64 sec) Query OK, 0 rows affected (1 min 57.75 sec) Query OK, 0 rows affected (1 min 58.49 sec) Query OK, 0 rows affected (1 min 58.42 sec) Query OK, 0 rows affected (1 min 58.48 sec) #################### 8.1.0 mysql> call return_alpha_from_10k_uuids(); . Query OK, 0 rows affected (1 min 55.42 sec) Query OK, 0 rows affected (1 min 55.49 sec) Query OK, 0 rows affected (1 min 55.75 sec) Query OK, 0 rows affected (1 min 55.82 sec) Query OK, 0 rows affected (1 min 56.09 sec)
[23 Oct 2023 6:32]
MySQL Verification Team
Test results - 8.0.34, 5.7.43, 8.1.0, 5.6.51
Attachment: 112787_5.6.51_5.7.43_8.0.34_8.1.0.results (application/octet-stream, text), 9.57 KiB.
[23 Oct 2023 16:54]
John Scott
Additional information: The regression seems limited to use cases where :alpha: and :digit: are being used instead of their regexp equivalents A-Za-z and 0-9, respectively.
[14 Nov 2023 9:19]
Tor Didriksen
Posted by developer: ICU regexp is documented here: https://unicode-org.github.io/icu/userguide/strings/regexp.html With some performance hints here: https://unicode-org.github.io/icu/userguide/strings/regexp.html#performance-tips This is what it says about Unicode properties: "ICU recognizes all Unicode properties known to ICU, which is all of them." The regexp matching engine seems to generate a *much* faster state machine for [A-Za-z] than it does for [[:alpha:]] Here are results from running the Linux 'perf' tool for the two cases. [A-Za-z] 6,21% connection mysqld [.] icu_73::UnicodeSet::findCodePoint 5,28% swapper [kernel.kallsyms] [k] 0xffffffff8c912188 5,27% connection [kernel.kallsyms] [k] 0xffffffff8cf5c87b 3,38% connection mysqld [.] icu_73::UnicodeSet::contains 2,94% connection [kernel.kallsyms] [k] 0xffffffff8d00012d 2,57% connection [kernel.kallsyms] [k] 0xffffffff8d00008f 1,94% connection [kernel.kallsyms] [k] 0xffffffff8d000199 1,92% connection [kernel.kallsyms] [k] 0xffffffff8d00001d 1,74% connection mysqld [.] icu_73::RegexCompile::compile 1,72% connection mysqld [.] pfs_end_statement_vc 1,68% connection mysqld [.] ucase_addCaseClosure_73 1,54% connection libc.so.6 [.] malloc 1,37% connection mysqld [.] sp_head::execute 1,31% connection mysqld [.] icu_73::RegexCompile::matchStartType 1,06% connection mysqld [.] pfs_start_statement_vc 1,04% connection libc.so.6 [.] _int_free 1,00% connection mysqld [.] insert_events_statements_history_long 0,94% connection [kernel.kallsyms] [k] 0xffffffff8c1881f2 0,92% connection [vdso] [.] __vdso_clock_gettime 0,82% connection mysqld [.] bind_fields 0,79% connection libc.so.6 [.] __memmove_avx_unaligned_erms 0,79% connection [vdso] [.] 0x00000000000006b5 0,77% connection mysqld [.] icu_73::UnicodeSet::add 0,76% connection mysqld [.] sp_lex_instr::reset_lex_and_exec_core 0,70% connection libc.so.6 [.] cfree@GLIBC_2.2.5 0,69% connection mysqld [.] icu_73::BMPSet::contains 0,53% connection mysqld [.] pfs_start_stage_v1 0,51% connection mysqld [.] icu_73::UnicodeSet::retain ############### [[:alpha:]] 23,47% connection mysqld [.] icu_73::UnicodeSet::findCodePoint 16,29% connection mysqld [.] ucase_addCaseClosure_73 7,33% connection mysqld [.] icu_73::UnicodeSet::add 5,55% swapper [kernel.kallsyms] [k] mwait_idle_with_hints.constprop.0 5,38% connection libc.so.6 [.] __memcmp_avx2_movbe 5,05% connection mysqld [.] uhash_compareUnicodeString_73 2,67% connection mysqld [.] icu_73::UVector::indexOf 2,23% connection mysqld [.] icu_73::UnicodeString::doEquals 1,80% connection libc.so.6 [.] _int_malloc 1,64% connection libc.so.6 [.] malloc 1,61% connection mysqld [.] icu_73::UnicodeSet::closeOverCaseInsensitive 1,47% connection mysqld [.] icu_73::UnicodeString::doCompare 1,36% connection libc.so.6 [.] _int_free 1,25% connection mysqld [.] icu_73::UnicodeSet::size 1,24% connection mysqld [.] icu_73::UnicodeSet::add 1,08% connection mysqld [.] icu_73::UVector::sortedInsert 0,88% connection mysqld [.] icu_73::UnicodeSet::retain 0,73% connection mysqld [.] icu_73::UnicodeSet::getRangeStart 0,71% connection libc.so.6 [.] malloc_consolidate 0,62% connection libc.so.6 [.] __memmove_avx_unaligned_erms 0,61% connection mysqld [.] icu_73::UnicodeSet::getRangeEnd 0,60% connection mysqld [.] icu_73::compareUnicodeString 0,56% connection mysqld [.] icu_73::_set_add
[13 Dec 2023 10:45]
MySQL Verification Team
HI Mr. Scott, We have received a full reply from our Development team. In short, this is not a bug in the code , but a change in how MySQL 8.0 deals with regexp and rlike in 8.0. This has not been documented in the Manual, so we are converting this verified bug as something that requires proper documentation. Here is the explanation of why this is not a code bug: There was a significant change in 8.0 regarding handling of the regular expressions, so all users need to rewrite his regexp pattern. The regexp matching engine is changed in 8.0, we use ICU. Here are additional comments .....: ================== The regexp matching engine seems to generate a *much* faster state machine for [A-Za-z] than it does for [[:alpha:]] This happoens because :alpha: is much more general, it matches any alphabetic character in utf16_general_ci which is the charset used internally by ICU. Hence, you should use ICU, which will be fully explained when the Reference Manual is updated. When the Manual is updated, you will be immediately informed, here in this bug report. Thank you.
[5 Feb 2024 18:07]
Jon Stephens
Fixed in mysqldoc rev 77771, in 8.0 and later versions of the MySQL Manual. Closed.