Bug #118133 8.0.27 DDL performance degradation - Scan PK
Submitted: 8 May 8:40 Modified: 8 May 9:59
Reporter: Xizhe Zhang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DDL Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[8 May 8:40] Xizhe Zhang
Description:
In Bug #116738, we can see that the DDL Scan PK performance has dropped from 8.0.26 to 8.0.27. This is mainly due to three problems, which will be introduced below.

How to repeat:
See Bug #116738
[8 May 8:53] Xizhe Zhang
Problem 1

After fixing problems introduced in Bug #118130, The execution time of Merge Sort in 8.0.27 and 8.0.26 is unified, but there is still a significant difference in the execution time of Scan PK.

After another statistical analysis, it was found that the 'row_merge_buf_sort' function in 8.0.26 and the 'Key_sort_buffer::serialize' function in 8.0.27 were both called 258 times, indicating that 258 secondary index blocks were serialized. However, the 'Key_sort_buffer::sort' function in 8.0.27 was called 1575 times, far exceeding the number of 'Key_sort_buffer::serialize'; in contrast, the 'row_merge_buf_sort' function in 8.0.26 was only called 258 times:

8.0.26:
                                             mean            std                min                max      count                sum
row_merge_read_clustered_index 41382720376.000000            nan 41382720376.000000 41382720376.000000   1.000000 41382720376.000000
row_merge_buf_sort                49462751.410853 3693709.059359    21740823.000000    80207136.000000 258.000000 12761389864.000000
row_merge_buf_write                7441377.236434 3627104.992159     3059449.000000    25288639.000000 258.000000  1919875327.000000

8.0.27:
                                   mean             std                min                max       count                sum
ddl_scan             55336084466.000000             nan 55336084466.000000 55336084466.000000    1.000000 55336084466.000000
traverse               621370612.955056 54452729.661022   248581529.000000   752919344.000000   89.000000 55301984553.000000
traverse_recs          621359160.280899 54452767.352574   248569598.000000   752907768.000000   89.000000 55300965265.000000
key_buffer_sort         10760531.013968 16842892.023687      192286.000000    73626663.000000 1575.000000 16947836347.000000
key_buffer_serialize    10777398.511628  2450400.547623     4572400.000000    21308115.000000  258.000000  2780568816.000000

Why does 8.0.27 have so many sorts? After reading the code, I found that there is only one entry for the 'Key_sort_buffer::sort' function. The high number of calls is because 8.0.27 also sorts the PK that does not need to be sorted (also mentioned in Bug# 115499). After skipping 'key_buffer_sort' for the PK that does not need to be sorted, the total scan time is reduced to 50s, and the time (12.5s) and number (258) of sorting are aligned with 8.0.26:
                                 mean             std              min              max      count                sum
traverse_recs        562168393.044944 50876427.781493 225171406.000000 730896629.000000  89.000000 50032986981.000000
key_buffer_sort       48573790.631783  3800736.935213  20883825.000000  92327768.000000 258.000000 12532037983.000000
key_buffer_serialize  10410354.674419  2072421.397148   4344254.000000  23451838.000000 258.000000  2685871506.000000
[8 May 9:15] Xizhe Zhang
Hello, Verification Team, this is my patch based on 8.0.42 for Problem 1.

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: ddl_scan_pk_problem_1.diff (application/octet-stream, text), 1.50 KiB.

[8 May 9:29] Xizhe Zhang
Problem 2 Perf result

Attachment: 8.0.26.png (image/png, text), 226.63 KiB.

[8 May 9:30] Xizhe Zhang
Problem 2 Perf result

Attachment: 8.0.27.png (image/png, text), 319.95 KiB.

[8 May 9:31] Xizhe Zhang
Problem 2

After solving Problem 1, I conducted several statistical analyses and found that there was a significant difference in the time it took to build a B+ tree for PK (row_merge_insert_index_tuples vs insert_direct):

8.0.26:
                                             mean            std                min                max       count                sum
row_merge_read_clustered_index 41437766901.000000            nan 41437766901.000000 41437766901.000000    1.000000 41437766901.000000
row_merge_insert_index_tuples     10919227.215642 6169444.760858      598532.000000   116644576.000000 1317.000000 14380622243.000000
row_merge_buf_sort                50043518.081395 4001944.237946    21731955.000000    80215808.000000  258.000000 12911227665.000000
row_merge_buf_write                7018112.724806 1714566.844070     3112412.000000    28037227.000000  258.000000  1810673083.000000
row_merge_write                    2952340.263566  542853.877538     2327920.000000     6181485.000000  258.000000   761703788.000000

8.0.27:
                                   mean            std                min                max       count                sum
traverse_recs        49859621441.000000            nan 49859621441.000000 49859621441.000000    1.000000 49859621441.000000
insert_direct           14322413.707669 5810964.895471     1000409.000000   119392136.000000 1317.000000 18862618853.000000
key_buffer_sort         48072937.794574 2435081.179260    20931494.000000    63897465.000000  258.000000 12402817951.000000
key_buffer_serialize    10241637.790698 1945422.596186     4510307.000000    21974752.000000  258.000000  2642342550.000000

Using perf to show the CPU hotspot distribution of the two versions (see the two pictures uploaded earlier), I found that the 'insert_direct' function is hotter in 8.0.27. After checking the source code, I found that the original 'ut_ad(page_validate(m_page, m_index))' in 8.0.26 has become 'ut_a(page_validate(m_page, m_index))' in 8.0.27, which makes each page is validated in Release mode.
[8 May 9:34] Xizhe Zhang
Hello, Verification Team, this is my patch based on 8.0.42 for Problem 2.

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: ddl_scan_pk_problem_2.diff (application/octet-stream, text), 546 bytes.

[8 May 9:38] Xizhe Zhang
Problem 3 Perf result

Attachment: 8.0.26 problem 3.png (image/png, text), 145.83 KiB.

[8 May 9:38] Xizhe Zhang
Problem 3 Perf result

Attachment: 8.0.27 problem 3.png (image/png, text), 192.56 KiB.

[8 May 9:44] Xizhe Zhang
Problem 3

After fixing Problem 2, the time consumption of 'insert_direct' was reduced to 14.6s, which is consistent with 8.0.26, and the scanning time consumption was reduced to 45.34s. There is still a small gap between 8.0.27 and 8.0.26. By checking the Perf sampling results (see the two pictures uploaded earlier), it is found that 8.0.27 has two calls to 'rec_get_offsets', while 8.0.26 has only one.

After analyzing the code, it is found that the extra 'rec_get_offsets' call in 8.0.27 is redundant. The previous results can be used to delete the redundant 'rec_get_offsets' call, and the scanning time consumption is reduced to 44.1s. I think the remaining gap is caused by the inability to inline many functions, and the impact is relatively small.
[8 May 9:51] Xizhe Zhang
Hello, Verification Team, this is my patch based on 8.0.42 for Problem 3.

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: ddl_scan_pk_problem_3.diff (application/octet-stream, text), 643 bytes.

[8 May 9:59] MySQL Verification Team
Hello Xizhe Zhang ,

Thank you for the report and contribution.

regards,
Umesh