MySQL Bugs: #68476: Suboptimal code in my_strnxfrm

Bug #68476	Suboptimal code in my_strnxfrm_simple()
Submitted:	24 Feb 2013 5:41	Modified:	3 Dec 2014 15:44
Reporter:	Alexey Kopytov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1, 5.5, 5.6	OS:	Any
Assigned to:	Tor Didriksen	CPU Architecture:	Any

Description:
This ancient code in my_strnxfrm_simple() is suboptimal:

  if (dest != src)
  {
    const uchar *end;
    for ( end=src+len; src < end ;  )
      *dest++= map[*src++];
  }
  else
  {
    const uchar *end;
    for ( end=dest+len; dest < end ; dest++)
      *dest= (char) map[(uchar) *dest];
  }

I can't see any difference between these branches. It's probably
remnants from _bin collations, where we don't have to do table
translation and thus the "dest == src" case is degenerate, e.g.:

static size_t my_strnxfrm_bin(CHARSET_INFO *cs __attribute__((unused)),
                              uchar *dest, size_t dstlen,
                              const uchar *src, size_t srclen)
{
  if (dest != src)
    memcpy(dest, src, min(dstlen,srclen));
  if (dstlen > srclen)
    bfill(dest + srclen, dstlen - srclen, 0);
  return dstlen;
}

But the "if()" statement in my_strnxfrm_simple() is redundant. It is
executed for each row processed by filesort and shows high branch
misprediction numbers in sysbench RO runs.

How to repeat:
Look at my_strnxfrm_simple().

Thank you for the bug report.

It depends on the input data. For long strings (100 bytes, like sysbench?)
the special handling of dst == src is actually much faster
than the other branch. For short strings the extra if/else overhead
becomes noticeable.

Tor,

Let's rewrite that code to an equivalent form for clarity:

  if (dest != src)
  {
    const uchar *end;
    for ( end=src+len; src < end ; dest++, src++ )
      *dest= map[*src];
  }
  else
  {
    const uchar *end;
    for ( end=src+len; src < end ; src++)
      *src= map[*src];
    dest= src;
  }

With this in mind, is there really any difference between the branches?

I agree that the code can be re-written, I'm just pointing out that this is much faster for long strings:
    const uchar *end;
    for (end= dst + frmlen; dst < end; ++dst)
      *dst= map[(uchar) *dst];
(and I'm assuming that this is actually the code you are executing running sysbench)

For short strings, something like this is better on my machine
    uint ix;
    for (ix= 0; ix < frmlen; ++ix)
      dst[ix] = map[src[ix]];

Bug 68476 patch for 5.7 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug68476.patch (application/octet-stream, text), 1.14 KiB.

Patch author Alexey Kopytov

Posted by developer:
 
Benchmark results (see unit test)

[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcDst/0 (147 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcDst/1 (239 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcDst/2 (1173 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcDst/3 (9310 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcDst/0 (150 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcDst/1 (231 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcDst/2 (753 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcDst/3 (5908 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcDst/0 (145 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcDst/1 (233 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcDst/2 (1165 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcDst/3 (9420 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcDst/0 (151 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcDst/1 (220 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcDst/2 (767 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcDst/3 (5948 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcSrc/0 (146 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcSrc/1 (225 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcSrc/2 (992 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalSrcSrc/3 (7384 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcSrc/0 (146 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcSrc/1 (221 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcSrc/2 (732 ms)
[       OK ] Strnxfrm/StrnxfrmTest.OriginalUnrolledSrcSrc/3 (6043 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcSrc/0 (145 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcSrc/1 (247 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcSrc/2 (1170 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedSrcSrc/3 (9286 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcSrc/0 (151 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcSrc/1 (235 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcSrc/2 (757 ms)
[       OK ] Strnxfrm/StrnxfrmTest.ModifiedUnrolledSrcSrc/3 (5900 ms)

Bug#16403708 SUBOPTIMAL CODE IN MY_STRNXFRM_SIMPLE()
    
    Based on a contribution from Alexey Kopytov
    
    Add benchmarking tests showing that my_strnxfrm_simple is not suboptimal
    compared to the proposed patch.
    
    However: if we take the proposed patch, and do some loop unrolling,
    we get improved results for both (src == dst) and (src != dst) transformations.

Noted in 5.7.6 changelog.

The code in my_strnxfrm_simple() was suboptimal and was improved.
Thanks to Alexey Kopytov for the patch.