Bug #19974 | Performance issue of bmove512 | ||
---|---|---|---|
Submitted: | 20 May 2006 21:22 | Modified: | 26 Mar 2011 1:52 |
Reporter: | Gunnar von Boehn | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server | Severity: | S1 (Critical) |
Version: | all | OS: | MacOS (MacOS) |
Assigned to: | CPU Architecture: | Any | |
Tags: | Contribution |
[20 May 2006 21:22]
Gunnar von Boehn
[31 May 2006 22:49]
Jorge del Conde
Hi! Is there any way you could send me a simple test-case that I can compile/execute in a PPC that shows this behaviour ? Thanks a lot!
[5 Jun 2006 17:04]
Gunnar von Boehn
Sure, Here is a little benchmark compiled for Apple Mac OS X (PPC) http://www.greyhound-data.com/gunnar/glibc/stream_OSX Here is the benchmark (with all tests enabled) compiled for Linux PPC http://www.greyhound-data.com/gunnar/glibc/stream02.gz Here is the source for the Linux version http://www.greyhound-data.com/gunnar/glibc/stream_memspeed02.tgz You can find many results on this page: http://www.greyhound-data.com/gunnar/glibc/index.htm?page=benchmarks The MySQL bmove512 performs about as good/bad as a unoptimized Linux glibc memcpy function. The Apple Mac OS X memcpy will perform nearly twice as good. Those barcharts might be able to show this better: http://www.greyhound-data.com/gunnar/glibc/membench_memcpy.gif http://www.greyhound-data.com/gunnar/glibc/memspeed_linux_vs_osx.gif On Mac OS X you should always use Apples functions. On a G3 and a G4 the Apple memcpy will give you about 180% of the performance of MySQL function. On the G5 the gain by using Apples function will be smaller as the G5 puts a lot of silicon on the job of optimizing suboptimal code. BTW This is not limited to PPC. The MySQL bmove512 runs only about halve as fast as it could be on a AMD-K7. Cheers Gunnar
[12 Jul 2006 15:44]
Gunnar von Boehn
Hi Mark, I haven't got a response from you. Was the information enough for you or do you need more? Can I help you with anything? Cheers Gunnar
[14 Sep 2006 14:43]
Gunnar von Boehn
Today, I have changed to severity of this bug to Critical. I feeled that I needed to do this to draw your attention to it for now. I apologize for doing this. I fear that you are accidently overlooking performance related bug reports and patches. I have send numerous performance related patches, just like this one, that usally were not responded to at all or not replied to for many month. We know that one of MySQL three main attributes is "Performance". I believe that performance related patches and reports are important to MySQL. This bug report provides you with a solution to speed up important parts of MySQL by 100%. The solution was already benchmarked for you and should involve close to zero work for you to implement. If the reason why you did not react on the report was that the report was not helpful or missed some information that you needed, then please tell me, so that I can provide you with the information and improve any future reports. If you, in general, can not or do want to implement performance related patches then please tell me also. Spending time on benchmaking and improving MySQL is very unsatisfying when these reports are not looked at or only looked at many month or years later. Its certainly more difficult for you to apply patches that were written for a MySQL version current many month before and it will be very difficult for the reported to answer any questions or provide you with more information after some time. Many thanks for your attention. Kind regards Gunnar von Boehn
[15 Sep 2006 8:43]
Mark Leith
Sorry about the delay with this - I have re-assigned the bug to somebody with an more appropriate test machine (I have an Intel-Mac). Best regards Mark
[15 Sep 2006 15:34]
Elliot Murphy
Updating Severity to S5 to reflect that this is a performance issue.
[6 Feb 2007 16:38]
Gunnar von Boehn
I think it important to change to priority of this report to CRITICAL now again. You seem to have complete forgotten about this again. Another 5 month have passed and this "one line change" that increases key copy performance on MAC OS by 100% was not looked at. BTW there is another patch that increases key copy performance on LINUX which was not looked at by you since about 9 month too.
[12 Feb 2007 13:29]
Lenz Grimmer
Hi Gunnar, first of all I'd like to apologize for the delay. I can understand your frustration and would like to help to get this issue moved forward. I've taken a look at http://www.greyhound-data.com/gunnar/glibc/bmove512.c but could not find any difference to the implementation used inside the MySQL source tree (I looked at the 5.0 source). Note that I am not a hardcore C developer, I just try to figure out what we need to do here in order to finally get some traction on this issue. What exactly do you propose? I could not find a explicit patch on your site that would have given me a hint what needs to be changed. Could you elaborate in laymen's terms (for people like me) what you mean by "one line change"? Do we need to expand our autoconf/automake scripts to check for the system's memcpy routine? Do we need to patch bmove512.c somehow? Anything else that requires modification? And can you point me to that "another patch that increases key copy performance"? Is there a separate bug ID? I'd like to take a look at this one, too. Again, I am sorry for the delay. I hope we can get this issue resolved quickly.
[12 Feb 2007 14:20]
Gunnar von Boehn
Hi Lenz, Many thanks for getting back to me. Okay I'll explain again why the current MySQL routine performs badly. The current move512 copies the memory "word per word" in a rather simple copy. This routine was propably good on a CPU like an 80486. But since the development of 2nd level cache, a cache that got included in the CPU core such a routine will perform very badly on many CPUs. All CPUs that have 2nd level cache have to consider a number of things when they copy memory. The aligmenent to the memory is a criteria, there are other critirieas as data prefetching and or streaming of memory without flushing the caches. All CPU that have 2nd level cache provide ASM instructins to do prefetching and streaming memory copies. If you do not use these CPU specifyc ASM instructions, but use such a simple copy loop then your performance will be totally dissapointing on many CPUs. This is true for popular CPUs like PPC-G3, PPC-G4, Athlon, Pentium Using a "simple copy" liek bmove512 has two big disavantages: a) its slow as most CPU will have read stalls with it. You need to to use cache prefetching instructions to prevent this. Without prefetch your copy will run on many CPUs at halve speed. b) It will flush out your 2nd level cache. So if you copy some MB of memory with it, you will have flushed out all cached code of your application. Depending on the cicumstance this will create another serious performance penalty. Luckely on Mac OS the solution is rather simple. Mac OS is aware of the feature of your CPU and MAC OS will install a memcpy function that uses the feature for your CPU. Bingo! So on Mac OS just use the MEMCPY function from MAC OS X instead of your own. Using the MACOS X function will increase the performance a lot. Depending the computer, it will up to double the thoughput. So the solution on Mac is simple. On Linux the problem is actually the same. Only that most glibc memcpy functions are not much better than the copy function used by MySQL. At least this was the case, 1 year ago when I did report this. Maybe this has changed by now. One year ago, the best approach on Linux was to use a selfwritten memcpy. If you look into MacOS you will see a very good implementation of it. Some opensource software like mplayer include optimized memcpy routines that are about 100% faster than the MySQL routine. And AMD and Intel have published some recommended routines too. Out of my head I would recommend a routine like the one below to use on x86 It prefetching and should be about twice as fast as MySQL routine because of this. And it does NOT flush out the 2nd level cache out like the MySQL routine does so your server code will stay in cache even after copying some memory. I very much hope this help. I'll double check the status on Linux and my copy routines again. I'll get back to you later this week for Linux. But on Mac the solutoin is simply I think. Cheers Gunnar void memcpy_x86(void *dst, void *src, int nbytes) { asm ( " mov esi, src \n" " mov edi, dst \n" " mov ecx, nbytes \n" " shr ecx, 6 \n" " \n" "loop1: \n" " prefetchnta 64[ESI] \n" " prefetchnta 96[ESI] \n" " \n" " movq mm1, 0[ESI] \n" " movq mm2, 8[ESI] \n" " movq mm3, 16[ESI] \n" " movq mm4, 24[ESI] \n" " movq mm5, 32[ESI] \n" " movq mm6, 40[ESI] \n" " movq mm7, 48[ESI] \n" " movq mm0, 56[ESI] \n" " \n" " movntq 0[EDI], mm1 \n" " movntq 8[EDI], mm2 \n" " movntq 16[EDI], mm3 \n" " movntq 24[EDI], mm4 \n" " movntq 32[EDI], mm5 \n" " movntq 40[EDI], mm6 \n" " movntq 48[EDI], mm7 \n" " movntq 56[EDI], mm0 \n" " \n" " add esi, 64 \n" " add edi, 64 \n" " dec ecx \n" " jnz loop1 \n" " \n" " emms \n" ); }
[12 Feb 2007 14:38]
Gunnar von Boehn
Lenz, again thanks for getting back to me. Yes, you are right when compilin for OS X you should use the OS X function for copying memory. Its much faster than MySQLs own. The bugreport for Linux is http://bugs.mysql.com/bug.php?id=19975 For Linux its propably still recommendable to selfwrite the copy function. You can expect 100% performance increase with the exmaple posted above. I'll recheck the details and get back to you again this week. One thing is worth to mind. The bmove512 is only one example where MySQL works with memory. There are of course many more functions in MySQL that scan or process bigger amounts of memory. As far as I know, in NO function at all does MySQL use any of the streaming techniques that became important with CPU having 2nd level cache. Using streaming starts to pay of if you copy or scan over 100 or 200 bytes. If you process only some dozend KB then streaming will already be twice as fast on many CPUs. I think that many functions in MySQL have the potential to gain a nice speed increas by using these techniques. If you are interested in speeding some more functions up then I will be happy to exchange some ideas about this. But this should be done after this simple bmove. Cheers
[4 Sep 2007 14:48]
Gunnar von Boehn
Replacing move512 with the above given optimized function will increase throughput on AMD and INTEL CPU by 50% for bigger copies. For the case that you had problems reading the above ASM examples. Here is a drop in replacement for your mem-copy using GCC syntax. I'm looking forward to see it used to improve MySQL performance! void * memcpy_mmx (void *dst, void *src, int len) { int i; for(i = 0; i < len / 64; i++) { __asm__ __volatile__ ( "prefetchnta 64(%0) \n" // prefetch 32 bytes of source 64 bytes ahead "prefetchnta 96(%0) \n" // prefetch 32 bytes of source 96 bytes ahead "\tmovq (%0), %%mm0\n" // loading 64 bytes into MMX "\tmovq 8(%0), %%mm1\n" "\tmovq 16(%0), %%mm2\n" "\tmovq 24(%0), %%mm3\n" "\tmovq 32(%0), %%mm4\n" "\tmovq 40(%0), %%mm5\n" "\tmovq 48(%0), %%mm6\n" "\tmovq 56(%0), %%mm7\n" "\tmovntq %%mm0, (%1)\n" // storing 64 bytes "\tmovntq %%mm1, 8(%1)\n" // we use non cache trashing stores "\tmovntq %%mm2, 16(%1)\n" // this will maintain our data cache content "\tmovntq %%mm3, 24(%1)\n" "\tmovntq %%mm4, 32(%1)\n" "\tmovntq %%mm5, 40(%1)\n" "\tmovntq %%mm6, 48(%1)\n" "\tmovntq %%mm7, 56(%1)\n" : : "r" (src), "r" (dst) : "%mm0","%mm1","%mm2","%mm3","%mm4","%mm5","%mm6","%mm7"); src+=64; dst+=64; } __asm__ __volatile__ ( "emms"); // MMX switch back }
[26 Mar 2011 1:52]
Sveta Smirnova
bmove512 does not exists in 5.6.3 and 5.5 series, so closing as "Can't repeat"