Bug #98305 | hp_rec_key_cmp suboptimal comparison | ||
---|---|---|---|
Submitted: | 21 Jan 2020 10:57 | Modified: | 29 Jan 2020 13:03 |
Reporter: | Georgiy Kirichenko | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Data Types | Severity: | S5 (Performance) |
Version: | 5.7, 8.0 | OS: | Linux |
Assigned to: | CPU Architecture: | Any |
[21 Jan 2020 10:57]
Georgiy Kirichenko
[21 Jan 2020 13:50]
MySQL Verification Team
Hi Mr. Kirichenko, Thank you for your performance enhancement request. 5.7 is in very late phase and all performance enhancement would go to 8.0. However, this function that you are writing about is already significantly changed in 8.0 and we are happy with its performance. Last, but not least, MEMORY storage engine is being replaced by a more efficient one. So far only for intrinsic tables, but any further development will go into that direction.
[28 Jan 2020 8:33]
Georgiy Kirichenko
Hi Mr. Milivojevic, Thank you for you response. Let me provide a more detailed explanation of the issue. I understand that MySQL 5.7 is in a very late phase, and unlikely to receive any performance fixes. However, MySQL 8.0 is also affected, even though not with default settings. As soon as the internal_tmp_mem_storage_engine variable is changed from its default value 'TempTable' to 'MEMORY' (which is a permitted and supported configuration change at the time of this writing), MySQL 8.0 will have the same performance issue I described originally. Additionally, a user may want to create an explicit MEMORY table with a hash index for whatever purposes in MySQL 8.0, which would still be a supported functionaly that might benefit from the optimization proposed here. On top of that, even the TempTable storage engine has absolutely the same performance issue. If you look at the Cell_calculator::compare() function in MySQL 8.0, you'll note that the entire strings to be compared are first fully scanned before they are actually compared. Which is clearly suboptimal, especially if the strings differ in the first few characters. Finally, let me explain the optimization with an example. Suppose we should compare two UTF-8 strings "a"*50 times and "b"*50 times. The current inefficient way to do it (regardless of the server version or the storage engine), due to the collation API, is evaluating the octet length of both strings using my_charpos() and then using strncollsp() and passing the previusly calculated octet lengths. Obviously, in this specific case, decoding both strings until their ends is inefficient because comparison will stop on the first character. One way to optimize that would be decoding and comparing multi-byte strings in some fixed-length chunks of 1 or more characters. A better approach would be extending the collation API with a function that implements comparison logic in both MEMORY and TempTable storage engines to avoid code duplication. With best regards, Georgy.
[28 Jan 2020 9:02]
Georgiy Kirichenko
Changing the status
[28 Jan 2020 13:46]
MySQL Verification Team
Hi Mr. Kirichenko, Thank you for your response .... Our string comparison code stops at the first character, which differs from the compared string. This is the same logic everywhere. Also, scanning is necessary in order to determine whether the string conforms fully to the character set standard that it belongs to. If you have a better idea on how to write all those string comparison functions, for each character set, you are welcome to send us a patch that will perform significantly better.
[28 Jan 2020 19:14]
Alexey Kopytov
Sinisa, my dear friend!!!11 Let me explain it in even more detail than my colleague Georgy. The snippet below is from the hp_rec_key_cmp() function in MySQL 8.0.19, storage/heap/hp_hash.cc: 375 if (seg->type == HA_KEYTYPE_TEXT) { ... 381 if (cs->mbmaxlen > 1 && (seg->flag & HA_PART_KEY_SEG)) { ... 383 char_length1 = my_charpos(cs, pos1, pos1 + seg->length, char_length); // <-- scans the entire first argument 384 char_length1 = std::min(char_length1, size_t(seg->length)); 385 char_length2 = my_charpos(cs, pos2, pos2 + seg->length, char_length); // <-- scans the entire second argument 386 char_length2 = std::min(char_length2, size_t(seg->length)); ... 403 if (cs->coll->strnncollsp(cs, pos1, char_length1, pos2, char_length2)) // <-- compares arguments character-by-character return 1; In fact, the above code path and comments hold for any multi-byte encoding. The only reason we do a full scan of both arguments before comparing them is that strnncollsp() API call expects lengths of arguments in characters, rather than octets. But converting octets to characters is an expensive operation for multi-byte strings, and is also a high price to pay if the arguments differ in the first few characters! TempTable code has absolutely the same problem in cell_calculator::compare(), storage/temptable/include/cell_calculator.h: 259 } else if (m_mode == Mode::CHARSET_AND_CHAR_LENGTH) { 260 lhs_length = std::min( 261 static_cast<size_t>(lhs_data_length), 262 my_charpos(m_cs, lhs_data, lhs_data + lhs_data_length, m_char_length)); 263 rhs_length = std::min( 264 static_cast<size_t>(rhs_data_length), 265 my_charpos(m_cs, rhs_data, rhs_data + rhs_data_length, m_char_length)); ... 280 return m_cs->coll->strnncollsp(m_cs, lhs_data, lhs_length, rhs_data, 281 rhs_length); There's certainly some room for optimization here, right? That's what this bug report is about. There may be multiple ways to optimize that. Georgy has outlined possible approaches in his previous comment. Even with the most straightforward approach we see about 15% speedup in CPU-bound sysbench OLTP_RO.
[29 Jan 2020 13:03]
MySQL Verification Team
My dear friend Kaamos, There is no code in this world that could not be optimised further. Next, we can't just do byte comparison, because you have some characters that are equal, although byte strings are different. Do I have to mention case and accent (and other) insensitive comparisons ???? We had added code in those functions, in order to avoid some pitfalls and bugs. We can not optimise code and cause some old bugs creeping in. However, as this is a very low priority request, I will verify it. This report would get more weight if there would be a patch that would increase performance, while not introducing any regression bugs.
[29 Jan 2020 13:46]
Alexey Kopytov
Dear Sinisa, Thank you for verifying this optimization request. We would love to contribute the patch, but code contributions are a more complicated then reporting a problem. I've been told even on the Oracle side it is sometimes easier to deal with just a problem description rather than a contributed patch. But yes, the patch passes MTR of course, so hopefully we are not reintroducing any old bugs.
[27 Jul 2020 11:31]
Georgy Kirichenko
Despite the fact that MEMORY engine is now deprecated for 8.0 there are still Field_string::cmp function which uses excessive mb_charpos function. So I would like to propose a patch which introduces two new functions for collation API and changes Field_string::cmp implementation as well as hp_hash.cc. I compared performance of two queries: select count(distinct c) from sbtest1 and select count(distinct pad) from sbtest1 using default sysbench data and have from 6x (for shorter pad column) up to 9.5x (for larger c) performance gain on my server: After mysql> select count(distinct c) from sbtest1; +-------------------+ | count(distinct c) | +-------------------+ | 1000000 | +-------------------+ 1 row in set (2.24 sec) Before: mysql> select count(distinct c) from sbtest1; +-------------------+ | count(distinct c) | +-------------------+ | 1000000 | +-------------------+ 1 row in set (21.44 sec)
[27 Jul 2020 11:31]
Georgy Kirichenko
Collation API change proposal
Attachment: 131.patch (application/octet-stream, text), 39.47 KiB.
[27 Jul 2020 12:48]
MySQL Verification Team
Hi Mr. Kirchenko, Thank you for your patch. Let me just ask one question. As far as I can see, this is a patch for 8.0. Please, confirm. Also, do all the tests from our MTR suite pass without problems ??? Last, but not least, we can not accept the patch, until you sign OCA agreement with us. Further information follow below.
[27 Jul 2020 12:49]
MySQL Verification Team
Thank you very much for your patch contribution, we appreciate it! In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in http://www.oracle.com/technetwork/community/oca-486395.html Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well. Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future. Please let me know, if you have any questions. Thank you for your interest in MySQL.
[30 Jul 2020 16:06]
Georgy Kirichenko
Thanks for your answer! I rebased the patch against latest 8.0 master and then MTR passed fine. Also I thought that I already have an OCA signed and have an allowance to submit code to Oracle. If you don't have my OCA signed I could sign it again in 2-3 days. Please found new version attached bellow.
[30 Jul 2020 16:07]
Georgy Kirichenko
Rebased patch
Attachment: 131.patch (application/octet-stream, text), 38.76 KiB.
[31 Jul 2020 12:18]
MySQL Verification Team
Hi Mr. Kirichenko, We would be grateful if you would sign the OCA agreement, while we shall check whether there is any such agreement with you or the company that you work for. Thanks again !!!
[3 Aug 2020 14:01]
MySQL Verification Team
Hi Georgiy, We would like, very much, to use your patch. However, we are not able to do that, so far ..... You either have to submit your patch, by using your company account, or you can obtain the OCA agreement by using your GMail account. Please, choose one of the two choices. Thanks in advance.
[3 Aug 2020 14:12]
Georgy Kirichenko
Reuploading the patch using company account
Attachment: 131.patch (application/octet-stream, text), 38.76 KiB.
[3 Aug 2020 14:18]
Georgy Kirichenko
I logged out and then logged in using my Huawei company account `georgy.kirichenko@huawei.com` and then uploaded the patch one time again. This instance of the patch is signed by `Georgy Kirichenko` instead of my `GeorgIy Kirichenko` gmail account which I used when created this issue. I hope this would be Ok, if not then I would ask someone of my team to assist me with submitting the patch.
[3 Aug 2020 14:27]
MySQL Verification Team
Thank you, Mr. Kirichenko, We shall enquire whether this will do the trick and we will let you know.
[3 Aug 2020 15:43]
MySQL Verification Team
Hi Mr. Kirichenko, These legal matters are complex for everybody. Anyway, we found out that you should submit it in the contribution tab *where you will confirm that its under OCA. Simple attachment, unfortunately, does not do the job. Thank you !!!!
[4 Aug 2020 12:41]
MySQL Verification Team
Thank you, Mr. Kirichenko.