MySQL Bugs: #99695: Remove bloat caused by InnoDB logger class

Bug #99695	Remove bloat caused by InnoDB logger class
Submitted:	26 May 2020 13:38	Modified:	9 Jun 2020 12:06
Reporter:	Dmitriy Philimonov	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:	Jakub Lopuszanski	CPU Architecture:	Any
Tags:	compiler

Description:
Default implementation of innobase logger class uses std::stringstream inside.
STL streams are famous for generation lots of bloat code after compiling.

Usage of logger class in hot functions, which is typical for error handling,
generates a lot of code which is inlined right in the invocation point, and
99.99% of time this code isn't used. This leads to a huge instruction cache
waste.

We can point the compiler at the fact, that any logging is an expremely rare
operation for database, so the compiler can reduce unnecessary inlining and 
pessimize 'cold' logging branches, meanwhile optimize really hot branches
and reduce intruction case misses.

Profit: in our tests, this optimization results in up to 2% performance
increase in sysbench OLTP PS/RO/RW tests.

How to repeat:
Code alalysis 

Before(look at format_cpp function): https://godbolt.org/z/xBL663
After (look at format_cpp function): https://godbolt.org/z/sDkc-5

Suggested fix:
Mark all public logger method and all derived class contructors with 'cold'
and 'noinline' attibutes

Fix: innobase storage engine -> InnoDB storage engine

Hello Mr. Philimonov,

We have made a patch and we have ran dozens of OLTP benchmarks on both patched and unpatched sources. Variance between the tests was so large, that more then half of the runs with the unpatched source were faster then with source patched according to your ideas.

In short , these benchmarks can not be relied upon.

Hence, we need some additional feedback from you ....

1. You could let us know how many benchmarks have run on original and changed sources ??? Which benchmarks have you run ???? What was the average gain for the patched source in average ???

2. Both links that you have sent us on godbolt.org show that you used only ARM64. We do need benchmarks for x64, which is CPU present in 99 % of the machines running our server.

3. Your testing environment could be more powerful and thus allow for more queries per second, which might help to smooth out the variance due to "law of big numbers". To put it the other way round: testing smaller number of queries on slower machine could cause more noise.

4, Please, could you  also, rerun the test and provide all the numbers from both experiments that you have, at the minimum: what was the architecture, what duration, what were the average TPS that you have got. 

We can not proceed further without these informations. Thanks in advance.

Dear MySQL Verification Team,

 1. We use sysbench (~40min per patch):
    * test data: 10 tables with 1M records each
    * workload: OLTP_PS/OLTP_RO/OLTP_RW
    * duration 60s for each of 1/4/16.../1024 threads for each of workload
    We had 4 runs in total, all gave positive results.
 2. Unfortunately, I'm not authorized to share absolute numbers with you, so the table shows only relative performance improvements from the patch.
 3. The benchmarks were indeed performed on an ARM machine, which is an officially supported architecture for MySQL 8.0. On X86 machines there might be a similar or lower effect from these optimizations, but I did not have an opportunity to verify that. However, reduction in code size for infrequently executed code branches is a reasonable optimization that might lead to better cache locality and better performance even in new code that involves checks for error conditions and error logging.
 4. As to the hardware configuration, it was a 128-core Kunpeng 920 machine. mysqld uses 64 cores from this configuration: https://e.huawei.com/en/products/servers/taishan-server/taishan-2280-v2

| test    | threads | cold_noinline_diff |
|:-------:|--------:|-------------------:|
| OLTP_PS |       1 |              2.59% |
| OLTP_PS |       4 |              3.98% |
| OLTP_PS |      16 |              4.38% |
| OLTP_PS |      24 |              3.48% |
| OLTP_PS |      32 |              4.25% |
| OLTP_PS |      48 |              3.28% |
| OLTP_PS |      64 |              2.82% |
| OLTP_PS |      96 |              3.37% |
| OLTP_PS |     128 |              5.41% |
| OLTP_PS |     256 |              5.00% |
| OLTP_PS |     512 |              4.83% |
| OLTP_PS |    1024 |              4.13% |
| OLTP_RO |       1 |              0.09% |
| OLTP_RO |       4 |              2.00% |
| OLTP_RO |      16 |              1.49% |
| OLTP_RO |      24 |              1.79% |
| OLTP_RO |      32 |              1.61% |
| OLTP_RO |      48 |              1.35% |
| OLTP_RO |      64 |              1.74% |
| OLTP_RO |      96 |              1.41% |
| OLTP_RO |     128 |              1.98% |
| OLTP_RO |     256 |              1.79% |
| OLTP_RO |     512 |              2.13% |
| OLTP_RO |    1024 |              1.64% |
| OLTP_RW |       1 |              7.14% |
| OLTP_RW |       4 |              0.91% |
| OLTP_RW |      16 |              2.23% |
| OLTP_RW |      24 |              1.68% |
| OLTP_RW |      32 |              1.92% |
| OLTP_RW |      48 |              1.74% |
| OLTP_RW |      64 |              0.42% |
| OLTP_RW |      96 |              0.73% |
| OLTP_RW |     128 |              0.62% |
| OLTP_RW |     256 |              1.17% |
| OLTP_RW |     512 |              1.11% |
| OLTP_RW |    1024 |              0.67% |

P.S. diff is calculated from TPS: (patched.tps-original.tps)/original.tps*100%

Sincerely yours,
Dmitriy Philimonov

P.P.S. test data: 10 tables with 1M records each, fully cached in the buffer pool.

Hi,

I would like to inform you that we have done 99 % identical tests that you did. Only we did it on Intel platform.

Results are totally inconclusive.

Hence, can you provide the other data and can you do the additional tests as we have asked you in our previous comment ????

Hi,
First of all, I'd like to say that I really appreciate the patch, and testing effort. It's just that we want to avoid pushing changes to code which *WE* can not prove to have performance impact. And this is why we are trying to replicate your results.

Yesterday, I took an ARM machine (ellex04, ARM64, 2S, 64cores, 4TH per core = 256 vcpu in total) and conducted {pareto,uniform}x{128,1024}users sysbench oltp-rw tests on it, running each version of the code 9 times for each of these 2x2=4 scenarios for 300 seconds with 60 seconds warmup ( --warmup-time=60 --time=300). 
Here I my results:

[mysql@ellex04 q-test-root]$ for u in 128 1024;do for d in uniform pareto;do echo $d $u; cat links/logs/$u-$d-univ_colder.* | ./summarize.sh | cut -d' ' -f 2,4,5-8,10-;done;done
uniform 128
9 20844 < 21041.56 < 21141 "mysql-trunk@43a86444"
9 20870 < 21006.78 < 21156 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
pareto 128
9 20775 < 21038.33 < 21183 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
9 20753 < 21012.22 < 21144 "mysql-trunk@43a86444"
uniform 1024
9 23324 < 23648.67 < 23888 "mysql-trunk@43a86444"
9 23247 < 23623.56 < 23973 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
pareto 1024
9 19607 < 19724.56 < 19830 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
9 19551 < 19667.89 < 19779 "mysql-trunk@43a86444"

[number of repetitions] [minimum for 9 runs of run's avg TPS] < [avg for 9 runs of run's avg TPS] < [maximum for 9 runs of run's avg TPS] [version name]

As you can see, overall there is not much difference in TPS observed.
And the binary indeed is smaller (by 1MB) so I can rule out the possibility that I was mistakenly comparing the same version to itself, etc.

Please note how large is the [min,max] spread, which reflects quite large noise and variance from run to run, which in turn might lead us to wrong conclusions if the sample is too small. For example if instead of looking at all 9 runs, I had focused only on the first 3, then it would look like for "uniform 1024" the trunk is faster by 0.6%. For the second three runs the patched version is faster by 1%. For the last triple the trunk is faster by 0.7%. Only by aggregating all 9 runs it becomes obvious that there is no relevant difference.

This is why I think it would be great if you could repeat the experiments and confirm that the results replicate.

Also, this is just a coincidence, that I usually test for 128 and 1024 users oltp-rw, which unfortunately are the cases which seem to have among smallest differences in your report. Sorry, this wasn't intentional - I'll run oltp-ro for 4 and 512 users today.

Dear Jacob and MySQL verification team.

  First of all, thank you for your efforts in testing our patch. We appreciate it and fully understand your concern.

  Secondly, I've managed to obtain x86_64 machine (CPU: https://ark.intel.com/content/www/us/en/ark/products/120490/intel-xeon-gold-6150-processor..., 2 socket configuration, powered by EulerOS 2.5SP(kernel 3.10), compiler GCC-10.1) and ran 4 tests (2 without patch, 2 with patch, then compared each with one of the original results).

  Please, pay attention, that I used our internal modified fork of MySQL 8.0.17 and due to security restrictions, I can't share real TPS numbers, sorry about that.

  I have to admit that the profit on x86_64 architecture isn't so obvious as for Kunpeng 920, however, it's still noticeable (especially for OLTP_PS):

|   test  | threads |  COLD  |  COLD2 |  ORIG2 |
|:-------:|--------:|-------:|-------:|-------:|
| OLTP_PS |       1 |  0.72% |  0.02% | -0.01% |
| OLTP_PS |       4 |  1.55% |  1.55% |  0.76% |
| OLTP_PS |      16 |  0.96% | -0.61% | -1.12% |
| OLTP_PS |      24 |  1.33% |  0.82% | -0.39% |
| OLTP_PS |      32 |  1.70% |  0.92% | -0.05% |
| OLTP_PS |      48 |  2.09% |  1.01% | -0.63% |
| OLTP_PS |      64 |  1.63% |  1.38% | -0.25% |
| OLTP_PS |      96 |  1.84% |  1.62% | -0.22% |
| OLTP_PS |     128 |  2.01% |  1.84% | -0.10% |
| OLTP_PS |     256 |  2.28% |  2.06% |  0.41% |
| OLTP_PS |     512 |  3.68% |  1.74% |  2.03% |
| OLTP_PS |    1024 |  6.84% |  1.21% |  3.68% |
| OLTP_RO |       1 |  2.85% |  2.84% |  0.32% |
| OLTP_RO |       4 |  1.37% |  1.53% | -0.36% |
| OLTP_RO |      16 |  1.66% |  1.91% | -0.15% |
| OLTP_RO |      24 |  1.14% |  1.14% | -0.26% |
| OLTP_RO |      32 |  0.62% |  0.63% | -0.14% |
| OLTP_RO |      48 |  0.44% |  0.36% | -0.23% |
| OLTP_RO |      64 |  0.33% |  0.50% |  0.16% |
| OLTP_RO |      96 |  0.26% |  0.31% |  0.05% |
| OLTP_RO |     128 |  0.33% |  0.34% |  0.04% |
| OLTP_RO |     256 |  0.46% | -0.00% | -0.20% |
| OLTP_RO |     512 |  0.90% |  0.17% | -0.21% |
| OLTP_RO |    1024 | -0.38% | -0.76% |  2.40% |
| OLTP_RW |       1 |  0.04% |  0.13% | -0.32% |
| OLTP_RW |       4 |  1.50% |  1.02% | -0.04% |
| OLTP_RW |      16 |  1.25% |  1.15% | -0.20% |
| OLTP_RW |      24 |  0.70% |  0.74% | -0.01% |
| OLTP_RW |      32 |  0.38% |  0.32% |  0.89% |
| OLTP_RW |      48 |  1.15% |  0.68% |  0.32% |
| OLTP_RW |      64 |  0.89% |  0.60% | -0.03% |
| OLTP_RW |      96 |  0.67% |  0.23% |  0.24% |
| OLTP_RW |     128 |  0.90% |  0.19% |  0.16% |
| OLTP_RW |     256 |  1.18% |  0.01% |  0.87% |
| OLTP_RW |     512 |  1.11% | -0.08% |  0.39% |
| OLTP_RW |    1024 |  1.54% |  0.44% |  1.09% |

We hope that you will reproduce our results and prove the profit from our patch.

P.S. Legend of the table published above:
  * COLD  - first run with patch
  * COLD2 - second run with patch
  * ORIG2 - second run without patch
All data was compared with ORIG (first run without patch, not presented in the table).

Hi Mr. Philimonov,

Thank you for sharing your findings.

I do have to inform you that we can verify only those performance improvement patches that we can fully reproduce on our, original version of server.

We believe that your version of MySQL server has that info, but this is a forum ONLY for the unchanged version of our current GA server.

We simply can not accept patches that bring benefit only to some clone of our server.

Hope that you can understand this.

Here are the results for the same ARM machine as before, but this time for oltp-ro {uniform,pareto}x{4,512}users:
```
[mysql@ellex04 q-test-root]$ for u in 4 512;do for d in uniform pareto;do echo $d $u; cat links/logs/$u-$d-RO-univ_colder.* | ./summarize.sh | cut -d' ' -f 2,4,5-8,10-;done;done
uniform 4
9 1707 < 1718.56 < 1727 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
9 1706 < 1717.22 < 1726 "mysql-trunk@43a86444"
pareto 4
9 1681 < 1691.67 < 1703 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
9 1679 < 1690.22 < 1701 "mysql-trunk@43a86444"
uniform 512
9 30279 < 30686.67 < 31145 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
9 30431 < 30686.44 < 30943 "mysql-trunk@43a86444"
pareto 512
9 30992 < 31228.67 < 31644 "mysql-trunk@43a86444 + contrib.patch UNIV_COLD"
9 30106 < 30710.33 < 31030 "mysql-trunk@43a86444"

```
Looks like there is no difference for scenarios other than parteo 512. Actually the numbers look so close, as if there is some bug in the procedure, but I've checked manually that everything makes sense.

Note that in my tests I applied the patch to the latest trunk, and if I understand correctly, you've applied it to 8.0.17.
There were some bug-fixes targeting ARM after 8.0.17 was released, among them:
    Bug #30401416 RWLOCK:REFINE LOCK->RECURSIVE WITH C11 ATOMICS
    Bug #30694177 RW_LOCK_X_LOCK_LOW: CONDITIONAL JUMP OR MOVE...
    Bug #30837136 RW_LOCK_X_LOCK_LOW: CONDITIONAL JUMP OR MOVE
    Bug #30819167 INNORWLOCKTEST DEADLOCKS ON ARM BECAUSE OF BARRIERS MISSING IN SYNC0DEBUG.CC
The last three of them are AFAIU debug-only, but the first one might affect release build, too.
Also, there were obviously many other speed improvements and buf-fixes.

Therefore, let me try to compare mysql-8.0.17 with mysql-8.0.17+contrib.patch.
If I see gain from the patch, that would mean that something else, probably more important, was fixed since 8.0.17, and the patch does not provide big value for the current trunk, but at least we could verify that all this makes sense. Maybe even we could git bisect to get to the root cause of the problem and what fixed it?
If I see no gain, that would mean I am unable to reproduce the problem on this machine. At which point I'd say I've run out of sane ideas.

As you see, "gain" and "no-gain", both lead to the same result for the patch: I see no compelling reason to include the patch in mysql-trunk, other than the sunken cost fallacy, mysqld's size reduction and appreciating a contribution.
(I'm really torn about this, as these are not very bad reasons)

However, the next experiment would at least lead to some knowledge about what is going on, which might be helpful at least for my future work, and perhaps could inspire someone to upgrade.
And there are other possible outcomes beyond "gain" and "no-gain", such as "patch seems to make 8.0.17 run slower" or "results are completely chaotic" which would also provide some info.

So, let's see what happens...

OK, so the results I got for 8.0.17 look like there is no difference (or maybe some small degradation), have a look:

[mysql@ellex04 q-test-root]$ for u in 4 512;do for d in pareto uniform;do echo $d $u;cat links/logs/$u-$d-RO-univ_colder.8017.* | ./summarize.sh | cut -d' ' -f 2,4,5-8,10-  ;done;done
pareto 4
9 1507 < 1510.56 < 1513 "mysql-8.0.17"
9 1498 < 1504.22 < 1508 "mysql-8.0.17 + contrib.patch UNIV_COLD"
uniform 4
9 1522 < 1527.44 < 1531 "mysql-8.0.17"
9 1514 < 1519.11 < 1523 "mysql-8.0.17 + contrib.patch UNIV_COLD"
pareto 512
9 27830 < 28185.00 < 28478 "mysql-8.0.17"
9 27067 < 27870.00 < 28223 "mysql-8.0.17 + contrib.patch UNIV_COLD"
uniform 512
9 27639 < 27962.00 < 28452 "mysql-8.0.17"
9 27167 < 27696.11 < 28040 "mysql-8.0.17 + contrib.patch UNIV_COLD"
[mysql@ellex04 q-test-root]$ for u in 128 1024;do for d in pareto uniform;do echo $d $u;cat links/logs/$u-$d-RW-univ_colder.8017.* | ./summarize.sh |cut -d' ' -f 2,4,5-8,10-   ;done;done
pareto 128
9 19267 < 19346.33 < 19398 "mysql-8.0.17"
9 19140 < 19252.22 < 19370 "mysql-8.0.17 + contrib.patch UNIV_COLD"
uniform 128
9 19292 < 19428.00 < 19573 "mysql-8.0.17"
9 19202 < 19302.44 < 19433 "mysql-8.0.17 + contrib.patch UNIV_COLD"
pareto 1024
9 10531 < 10636.89 < 10755 "mysql-8.0.17"
9 10248 < 10536.00 < 10712 "mysql-8.0.17 + contrib.patch UNIV_COLD"
uniform 1024
9 21185 < 21334.00 < 21553 "mysql-8.0.17"
9 21037 < 21205.56 < 21389 "mysql-8.0.17 + contrib.patch UNIV_COLD"

So, the only way to reconcile this with your results is that "it depends on machine/build environment/testing procedure/phase of moon" but can not be adequately described as "clear win!".

Hi Mr. Philimonov,

It seems that we are not able to repeat your results on the performance improvement.