Bug #99695 | Remove bloat caused by InnoDB logger class | ||
---|---|---|---|
Submitted: | 26 May 2020 13:38 | Modified: | 9 Jun 2020 12:06 |
Reporter: | Dmitriy Philimonov | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S5 (Performance) |
Version: | 8.0 | OS: | Any |
Assigned to: | Jakub Lopuszanski | CPU Architecture: | Any |
Tags: | compiler |
[26 May 2020 13:38]
Dmitriy Philimonov
[26 May 2020 13:42]
Dmitriy Philimonov
Fix: innobase storage engine -> InnoDB storage engine
[2 Jun 2020 12:12]
MySQL Verification Team
Hello Mr. Philimonov, We have made a patch and we have ran dozens of OLTP benchmarks on both patched and unpatched sources. Variance between the tests was so large, that more then half of the runs with the unpatched source were faster then with source patched according to your ideas. In short , these benchmarks can not be relied upon. Hence, we need some additional feedback from you .... 1. You could let us know how many benchmarks have run on original and changed sources ??? Which benchmarks have you run ???? What was the average gain for the patched source in average ??? 2. Both links that you have sent us on godbolt.org show that you used only ARM64. We do need benchmarks for x64, which is CPU present in 99 % of the machines running our server. 3. Your testing environment could be more powerful and thus allow for more queries per second, which might help to smooth out the variance due to "law of big numbers". To put it the other way round: testing smaller number of queries on slower machine could cause more noise. 4, Please, could you also, rerun the test and provide all the numbers from both experiments that you have, at the minimum: what was the architecture, what duration, what were the average TPS that you have got. We can not proceed further without these informations. Thanks in advance.
[2 Jun 2020 14:11]
Dmitriy Philimonov
Dear MySQL Verification Team, 1. We use sysbench (~40min per patch): * test data: 10 tables with 1M records each * workload: OLTP_PS/OLTP_RO/OLTP_RW * duration 60s for each of 1/4/16.../1024 threads for each of workload We had 4 runs in total, all gave positive results. 2. Unfortunately, I'm not authorized to share absolute numbers with you, so the table shows only relative performance improvements from the patch. 3. The benchmarks were indeed performed on an ARM machine, which is an officially supported architecture for MySQL 8.0. On X86 machines there might be a similar or lower effect from these optimizations, but I did not have an opportunity to verify that. However, reduction in code size for infrequently executed code branches is a reasonable optimization that might lead to better cache locality and better performance even in new code that involves checks for error conditions and error logging. 4. As to the hardware configuration, it was a 128-core Kunpeng 920 machine. mysqld uses 64 cores from this configuration: https://e.huawei.com/en/products/servers/taishan-server/taishan-2280-v2 | test | threads | cold_noinline_diff | |:-------:|--------:|-------------------:| | OLTP_PS | 1 | 2.59% | | OLTP_PS | 4 | 3.98% | | OLTP_PS | 16 | 4.38% | | OLTP_PS | 24 | 3.48% | | OLTP_PS | 32 | 4.25% | | OLTP_PS | 48 | 3.28% | | OLTP_PS | 64 | 2.82% | | OLTP_PS | 96 | 3.37% | | OLTP_PS | 128 | 5.41% | | OLTP_PS | 256 | 5.00% | | OLTP_PS | 512 | 4.83% | | OLTP_PS | 1024 | 4.13% | | OLTP_RO | 1 | 0.09% | | OLTP_RO | 4 | 2.00% | | OLTP_RO | 16 | 1.49% | | OLTP_RO | 24 | 1.79% | | OLTP_RO | 32 | 1.61% | | OLTP_RO | 48 | 1.35% | | OLTP_RO | 64 | 1.74% | | OLTP_RO | 96 | 1.41% | | OLTP_RO | 128 | 1.98% | | OLTP_RO | 256 | 1.79% | | OLTP_RO | 512 | 2.13% | | OLTP_RO | 1024 | 1.64% | | OLTP_RW | 1 | 7.14% | | OLTP_RW | 4 | 0.91% | | OLTP_RW | 16 | 2.23% | | OLTP_RW | 24 | 1.68% | | OLTP_RW | 32 | 1.92% | | OLTP_RW | 48 | 1.74% | | OLTP_RW | 64 | 0.42% | | OLTP_RW | 96 | 0.73% | | OLTP_RW | 128 | 0.62% | | OLTP_RW | 256 | 1.17% | | OLTP_RW | 512 | 1.11% | | OLTP_RW | 1024 | 0.67% | P.S. diff is calculated from TPS: (patched.tps-original.tps)/original.tps*100% Sincerely yours, Dmitriy Philimonov
[2 Jun 2020 14:13]
Dmitriy Philimonov
P.P.S. test data: 10 tables with 1M records each, fully cached in the buffer pool.
[2 Jun 2020 14:16]
MySQL Verification Team
Hi, I would like to inform you that we have done 99 % identical tests that you did. Only we did it on Intel platform. Results are totally inconclusive. Hence, can you provide the other data and can you do the additional tests as we have asked you in our previous comment ????
[3 Jun 2020 9:26]
Jakub Lopuszanski
Hi, First of all, I'd like to say that I really appreciate the patch, and testing effort. It's just that we want to avoid pushing changes to code which *WE* can not prove to have performance impact. And this is why we are trying to replicate your results. Yesterday, I took an ARM machine (ellex04, ARM64, 2S, 64cores, 4TH per core = 256 vcpu in total) and conducted {pareto,uniform}x{128,1024}users sysbench oltp-rw tests on it, running each version of the code 9 times for each of these 2x2=4 scenarios for 300 seconds with 60 seconds warmup ( --warmup-time=60 --time=300). Here I my results: [mysql@ellex04 q-test-root]$ for u in 128 1024;do for d in uniform pareto;do echo $d $u; cat links/logs/$u-$d-univ_colder.* | ./summarize.sh | cut -d' ' -f 2,4,5-8,10-;done;done uniform 128 9 20844 < 21041.56 < 21141 "mysql-trunk@43a86444" 9 20870 < 21006.78 < 21156 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" pareto 128 9 20775 < 21038.33 < 21183 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" 9 20753 < 21012.22 < 21144 "mysql-trunk@43a86444" uniform 1024 9 23324 < 23648.67 < 23888 "mysql-trunk@43a86444" 9 23247 < 23623.56 < 23973 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" pareto 1024 9 19607 < 19724.56 < 19830 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" 9 19551 < 19667.89 < 19779 "mysql-trunk@43a86444" [number of repetitions] [minimum for 9 runs of run's avg TPS] < [avg for 9 runs of run's avg TPS] < [maximum for 9 runs of run's avg TPS] [version name] As you can see, overall there is not much difference in TPS observed. And the binary indeed is smaller (by 1MB) so I can rule out the possibility that I was mistakenly comparing the same version to itself, etc. Please note how large is the [min,max] spread, which reflects quite large noise and variance from run to run, which in turn might lead us to wrong conclusions if the sample is too small. For example if instead of looking at all 9 runs, I had focused only on the first 3, then it would look like for "uniform 1024" the trunk is faster by 0.6%. For the second three runs the patched version is faster by 1%. For the last triple the trunk is faster by 0.7%. Only by aggregating all 9 runs it becomes obvious that there is no relevant difference. This is why I think it would be great if you could repeat the experiments and confirm that the results replicate. Also, this is just a coincidence, that I usually test for 128 and 1024 users oltp-rw, which unfortunately are the cases which seem to have among smallest differences in your report. Sorry, this wasn't intentional - I'll run oltp-ro for 4 and 512 users today.
[3 Jun 2020 11:18]
Dmitriy Philimonov
Dear Jacob and MySQL verification team. First of all, thank you for your efforts in testing our patch. We appreciate it and fully understand your concern. Secondly, I've managed to obtain x86_64 machine (CPU: https://ark.intel.com/content/www/us/en/ark/products/120490/intel-xeon-gold-6150-processor..., 2 socket configuration, powered by EulerOS 2.5SP(kernel 3.10), compiler GCC-10.1) and ran 4 tests (2 without patch, 2 with patch, then compared each with one of the original results). Please, pay attention, that I used our internal modified fork of MySQL 8.0.17 and due to security restrictions, I can't share real TPS numbers, sorry about that. I have to admit that the profit on x86_64 architecture isn't so obvious as for Kunpeng 920, however, it's still noticeable (especially for OLTP_PS): | test | threads | COLD | COLD2 | ORIG2 | |:-------:|--------:|-------:|-------:|-------:| | OLTP_PS | 1 | 0.72% | 0.02% | -0.01% | | OLTP_PS | 4 | 1.55% | 1.55% | 0.76% | | OLTP_PS | 16 | 0.96% | -0.61% | -1.12% | | OLTP_PS | 24 | 1.33% | 0.82% | -0.39% | | OLTP_PS | 32 | 1.70% | 0.92% | -0.05% | | OLTP_PS | 48 | 2.09% | 1.01% | -0.63% | | OLTP_PS | 64 | 1.63% | 1.38% | -0.25% | | OLTP_PS | 96 | 1.84% | 1.62% | -0.22% | | OLTP_PS | 128 | 2.01% | 1.84% | -0.10% | | OLTP_PS | 256 | 2.28% | 2.06% | 0.41% | | OLTP_PS | 512 | 3.68% | 1.74% | 2.03% | | OLTP_PS | 1024 | 6.84% | 1.21% | 3.68% | | OLTP_RO | 1 | 2.85% | 2.84% | 0.32% | | OLTP_RO | 4 | 1.37% | 1.53% | -0.36% | | OLTP_RO | 16 | 1.66% | 1.91% | -0.15% | | OLTP_RO | 24 | 1.14% | 1.14% | -0.26% | | OLTP_RO | 32 | 0.62% | 0.63% | -0.14% | | OLTP_RO | 48 | 0.44% | 0.36% | -0.23% | | OLTP_RO | 64 | 0.33% | 0.50% | 0.16% | | OLTP_RO | 96 | 0.26% | 0.31% | 0.05% | | OLTP_RO | 128 | 0.33% | 0.34% | 0.04% | | OLTP_RO | 256 | 0.46% | -0.00% | -0.20% | | OLTP_RO | 512 | 0.90% | 0.17% | -0.21% | | OLTP_RO | 1024 | -0.38% | -0.76% | 2.40% | | OLTP_RW | 1 | 0.04% | 0.13% | -0.32% | | OLTP_RW | 4 | 1.50% | 1.02% | -0.04% | | OLTP_RW | 16 | 1.25% | 1.15% | -0.20% | | OLTP_RW | 24 | 0.70% | 0.74% | -0.01% | | OLTP_RW | 32 | 0.38% | 0.32% | 0.89% | | OLTP_RW | 48 | 1.15% | 0.68% | 0.32% | | OLTP_RW | 64 | 0.89% | 0.60% | -0.03% | | OLTP_RW | 96 | 0.67% | 0.23% | 0.24% | | OLTP_RW | 128 | 0.90% | 0.19% | 0.16% | | OLTP_RW | 256 | 1.18% | 0.01% | 0.87% | | OLTP_RW | 512 | 1.11% | -0.08% | 0.39% | | OLTP_RW | 1024 | 1.54% | 0.44% | 1.09% | We hope that you will reproduce our results and prove the profit from our patch.
[3 Jun 2020 11:22]
Dmitriy Philimonov
P.S. Legend of the table published above: * COLD - first run with patch * COLD2 - second run with patch * ORIG2 - second run without patch All data was compared with ORIG (first run without patch, not presented in the table).
[3 Jun 2020 12:58]
MySQL Verification Team
Hi Mr. Philimonov, Thank you for sharing your findings. I do have to inform you that we can verify only those performance improvement patches that we can fully reproduce on our, original version of server. We believe that your version of MySQL server has that info, but this is a forum ONLY for the unchanged version of our current GA server. We simply can not accept patches that bring benefit only to some clone of our server. Hope that you can understand this.
[4 Jun 2020 8:55]
Jakub Lopuszanski
Here are the results for the same ARM machine as before, but this time for oltp-ro {uniform,pareto}x{4,512}users: ``` [mysql@ellex04 q-test-root]$ for u in 4 512;do for d in uniform pareto;do echo $d $u; cat links/logs/$u-$d-RO-univ_colder.* | ./summarize.sh | cut -d' ' -f 2,4,5-8,10-;done;done uniform 4 9 1707 < 1718.56 < 1727 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" 9 1706 < 1717.22 < 1726 "mysql-trunk@43a86444" pareto 4 9 1681 < 1691.67 < 1703 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" 9 1679 < 1690.22 < 1701 "mysql-trunk@43a86444" uniform 512 9 30279 < 30686.67 < 31145 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" 9 30431 < 30686.44 < 30943 "mysql-trunk@43a86444" pareto 512 9 30992 < 31228.67 < 31644 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD" 9 30106 < 30710.33 < 31030 "mysql-trunk@43a86444" ``` Looks like there is no difference for scenarios other than parteo 512. Actually the numbers look so close, as if there is some bug in the procedure, but I've checked manually that everything makes sense. Note that in my tests I applied the patch to the latest trunk, and if I understand correctly, you've applied it to 8.0.17. There were some bug-fixes targeting ARM after 8.0.17 was released, among them: Bug #30401416 RWLOCK:REFINE LOCK->RECURSIVE WITH C11 ATOMICS Bug #30694177 RW_LOCK_X_LOCK_LOW: CONDITIONAL JUMP OR MOVE... Bug #30837136 RW_LOCK_X_LOCK_LOW: CONDITIONAL JUMP OR MOVE Bug #30819167 INNORWLOCKTEST DEADLOCKS ON ARM BECAUSE OF BARRIERS MISSING IN SYNC0DEBUG.CC The last three of them are AFAIU debug-only, but the first one might affect release build, too. Also, there were obviously many other speed improvements and buf-fixes. Therefore, let me try to compare mysql-8.0.17 with mysql-8.0.17+contrib.patch. If I see gain from the patch, that would mean that something else, probably more important, was fixed since 8.0.17, and the patch does not provide big value for the current trunk, but at least we could verify that all this makes sense. Maybe even we could git bisect to get to the root cause of the problem and what fixed it? If I see no gain, that would mean I am unable to reproduce the problem on this machine. At which point I'd say I've run out of sane ideas. As you see, "gain" and "no-gain", both lead to the same result for the patch: I see no compelling reason to include the patch in mysql-trunk, other than the sunken cost fallacy, mysqld's size reduction and appreciating a contribution. (I'm really torn about this, as these are not very bad reasons) However, the next experiment would at least lead to some knowledge about what is going on, which might be helpful at least for my future work, and perhaps could inspire someone to upgrade. And there are other possible outcomes beyond "gain" and "no-gain", such as "patch seems to make 8.0.17 run slower" or "results are completely chaotic" which would also provide some info. So, let's see what happens...
[5 Jun 2020 15:27]
Jakub Lopuszanski
OK, so the results I got for 8.0.17 look like there is no difference (or maybe some small degradation), have a look: [mysql@ellex04 q-test-root]$ for u in 4 512;do for d in pareto uniform;do echo $d $u;cat links/logs/$u-$d-RO-univ_colder.8017.* | ./summarize.sh | cut -d' ' -f 2,4,5-8,10- ;done;done pareto 4 9 1507 < 1510.56 < 1513 "mysql-8.0.17" 9 1498 < 1504.22 < 1508 "mysql-8.0.17 + contrib.patch UNIV_COLD" uniform 4 9 1522 < 1527.44 < 1531 "mysql-8.0.17" 9 1514 < 1519.11 < 1523 "mysql-8.0.17 + contrib.patch UNIV_COLD" pareto 512 9 27830 < 28185.00 < 28478 "mysql-8.0.17" 9 27067 < 27870.00 < 28223 "mysql-8.0.17 + contrib.patch UNIV_COLD" uniform 512 9 27639 < 27962.00 < 28452 "mysql-8.0.17" 9 27167 < 27696.11 < 28040 "mysql-8.0.17 + contrib.patch UNIV_COLD" [mysql@ellex04 q-test-root]$ for u in 128 1024;do for d in pareto uniform;do echo $d $u;cat links/logs/$u-$d-RW-univ_colder.8017.* | ./summarize.sh |cut -d' ' -f 2,4,5-8,10- ;done;done pareto 128 9 19267 < 19346.33 < 19398 "mysql-8.0.17" 9 19140 < 19252.22 < 19370 "mysql-8.0.17 + contrib.patch UNIV_COLD" uniform 128 9 19292 < 19428.00 < 19573 "mysql-8.0.17" 9 19202 < 19302.44 < 19433 "mysql-8.0.17 + contrib.patch UNIV_COLD" pareto 1024 9 10531 < 10636.89 < 10755 "mysql-8.0.17" 9 10248 < 10536.00 < 10712 "mysql-8.0.17 + contrib.patch UNIV_COLD" uniform 1024 9 21185 < 21334.00 < 21553 "mysql-8.0.17" 9 21037 < 21205.56 < 21389 "mysql-8.0.17 + contrib.patch UNIV_COLD" So, the only way to reconcile this with your results is that "it depends on machine/build environment/testing procedure/phase of moon" but can not be adequately described as "clear win!".
[9 Jun 2020 12:06]
MySQL Verification Team
Hi Mr. Philimonov, It seems that we are not able to repeat your results on the performance improvement.