MySQL Bugs: #19975: Performance of memory functions on PowerPC and x86

Bug #19975	Performance of memory functions on PowerPC and x86
Submitted:	20 May 2006 22:12	Modified:	29 Mar 2011 20:37
Reporter:	Gunnar von Boehn	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: General	Severity:	S1 (Critical)
Version:	all	OS:	Linux (Linux)
Assigned to:	Chad MILLER	CPU Architecture:	Any
Tags:	Contribution

Description:
The performance of the Linux glibc memory functions
and the perfomance of MySQLs internal functions like bmove512
is unfortunately very dissapointing on PowerPC.

The linux glibc functions, as for example: memcmp, memcpy, memset, strcmp
often reach less than 60% of what Apples or Freescales implementations archieve.

This is also true for MySQL internal functions like bmove512.
On a G4 the MySQL bmove512 function reaches only about 60% of
the memory thoughput that Freescales reference memcpy archieves.

For benchmarks on the memcpy/bmove512 on PowerPC please go here:
http://www.greyhound-data.com/gunnar/glibc/

There are a few possiblies to easely improve the memory throughput of MySQL on PowerPC
a) Look at the functions of Mac OS X. Apple has excellent optimized functions that will run optimally on each PowerPC generation (G3/G4/G5) 
b) Use Freescale glibc reference implementations for Altivec enabled PowerPC (G4+)
http://www.freescale.com/webapp/sps/site/overview.jsp?nodeId=02VS0l81285Nf2d9nb
c) Use the freevec glibc replacements for G4+ CPUs

I hope you will find this proposal usefull

Cheers
Gunnar

How to repeat:
For benchmarks on the memcpy/bmove512 on PowerPC please go here:
http://www.greyhound-data.com/gunnar/glibc/

Suggested fix:
To highly improve the memory throughput of MySQL on PowerPC
please consider linking of the MySQL server to the optimized glibc functions provided by Freescale.

To improve the performance of the MySQL function bmove512,
please consider to use a PPC optimized memcpy instead.

Another simple but very effective replacement for bmove512 would be using the below copy:
The below function will run on all PowerPCs and achieve up to 75% higher memory throughput on G4+ CPUs.

--
// FC64
// very simple memcpy, using float (64) registers and data cache prefetching
// loop unrolled to copy 64 byte (2 cache) lines per iteration
// the copy get a lot of speed by cache prefetching the next 64 byte 
// while the current 64 bytes are copied.

#define dcba dcbz
#define dcbi dcbf

#define r0 0
#define r1 1
#define r2 2
#define r3 3
#define r4 4
#define r5 5

#define c96 9
#define c128 10

#define f1 1
#define f2 2
#define f3 3
#define f4 4
#define f5 5
#define f6 6
#define f7 7
#define f8 8

#define dst r3
#define src r4
#define num r5

.global memcpy_asmFC64
memcpy_asmFC64:
  srawi. 0,r5,6
  mtctr 0
  bclr 4,1

   li   c96,96
   li   c128,128
  .align  5
loop_top:                       // work on 64 byte per loop
  dcbt  c96,src                 // prefetch next two cache line
  dcbt  c128,src                //
  lfd   f1, 0(src)                 // read 64 byte
  lfd   f2, 8(src)
  lfd   f3,16(src)
  lfd   f4,24(src)
  lfd   f5,32(src)
  lfd   f6,40(src)
  lfd   f7,48(src)
  lfd   f8,56(src)
  addi  src,src,64
 
  stfd  f1, 0(dst)             // store 64 byte
  stfd  f2, 8(dst)
  stfd  f3,16(dst)
  stfd  f4,24(dst)
  stfd  f5,32(dst)
  stfd  f6,40(dst)
  stfd  f7,48(dst)
  stfd  f8,56(dst)
  addi  dst,dst,64

  bdnz  loop_top
  blr

Update: I have verified the behavior on x86 Linux as well.

On an AMD K7  bmove512 is about 20% slower than the simple stream_copy

void STREAM_Copy(double *source,double *destination, int size){
    int j;
    size=size/8;
    for (j=0; j < size; j++)   source[j] = destination[j];
}

Cheers
Gunnar

The membench sources http://www.greyhound-data.com/gunnar/glibc/stream_memspeed02.tgz
include routines both for PowerPC and Intel/AMD.

These routines achieve twice the memory throughput of bmove512.
That true for both szenarios - warm AND for cold cache !

Its propably easier for you if I send you the routines as single files again.
Please get back to me if you want me to do this and if I can help to improve the performance on Linux somehow.

Cheers
Gunnar

Thank you for a problem report and solutions suggested.

I wanted to make clear how high the performance impact is:

On an IBM POWER 5 server, the MySQL memcpy function is 8 times !! slower than the Glibc memcpy function.

Its clear that MySQL runs measurable slower because of this.
Why does MySQL use its own Memcpy function when the official memcpy function is MUCH faster ?

- This bug was reported over 18 month ago.
- The issue was verified but nothing has happened.

- Benchmark results were provided to outline the performance hit of the issue.
- The issue can be fixed quickly and *very easily*!

I see that you are pushing the issue around in the bug tracker from one component to the other and you are increasing or decreasing the severity.

Wouldn't it be less work to fix it, than moving it around in the bug tracker?

Would it be too much to ask to make up your mind and to either fix it
or if you don't care about performance than just delete it?

Regards

This would be a nice fix.

Unfortunate that its being lost under the clutter.

Is there an appropriate method to try and raise the critical nature of such bugs, beyond the bug tracker?

Maybe a community leader that someone can fire an e-mail off to so the right people are notified.

Hi Gunnar.  I've adopted Bug#19975 .

I've looked at lots of source code, but I still don't have a plan for fixing MySQL source code.  Speaking of bmove512 on powerpc, you refer to three alternatives:  I've been testing for a few hours, and I don't see the speedups you claimed.  It is an old bug, and we won't discount your claims, but (at least) I can't reproduce them.  (I have not yet tested on x86.)

From my simple tests on machine "powermacg5":

uname -a
Darwin powermacg5.mysql.com 7.9.0 Darwin Kernel Version 7.9.0: Wed Mar 30 20:11:17 PST 2005; root:xnu/xnu-517.12.7.obj~1/RELEASE_PPC  Power Macintosh powerpc

For data: 1,000,000 times over 100,000x512B block

- bmove512 median speed: 11s
- bcopy median speed: 11s
- "stream copy" median speed: 48s
- ppc assembly code would not assemble.  ("Unknown pseudo-op: .global" and many "Parameter syntax error")

Attached is my test file.  Please tell me if you see anything wrong, and whether you get different results with it.  Define preprocessor symbol "MAIN", and write directly to an executable, to include my test.

You've linked and pasted several things, and I thank you.  However, the most useful thing you could contribute is a patch for our source.

Verified that on x86/glibc, bmove512() is half the speed of bcopy().  Changing that first.  Perhaps also on ppc, if I can reproduce it and make sure I'm not making it worse in any case.

Ironic, this was singled out as a sign of progress in the higher level of involvement from the community. Alas, there doesn't appear to be much progress yet. But I will try this out myself and if it is useful on x86 it will be in a Google patch. Maybe Percona wants to look at it too.

We shouldn't be in the businesses of re-implementing memory copying functions. If the libc provided ones are to blame, please go and report this to libc developers. I also find the lack of mentioning tradeoffs quite alarming, it's widely know the libc provided ones are targeted for somewhat generic usage. Also, the MySQL's implementation bmove512 looks broken, we should probably get rid of it and use a system provided one.

We don't have custom memory functions anymore.