Bug #49491 Much overhead for MD5() and SHA1() on short strings
Submitted: 6 Dec 2009 23:14 Modified: 12 Mar 2010 16:35
Reporter: Jan Steemann (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: General Severity:S5 (Performance)
Version:5.1.38 and above OS:Any
Assigned to: Davi Arnaut CPU Architecture:Any
Tags: Contribution, md5, performance, sha1

[6 Dec 2009 23:14] Jan Steemann
Description:
The calculation of MD5 and SHA1 hash values using the built-in MySQL functions does not seem to be as efficient as it could be.

There seem to be two factors that determine the performance of the hash generation:
- computation of the actual hash value (binary value)
- conversion of the binary value into a string field

The run time of the hash computation depends on the length of the input string whereas the overhead of the binary-to-string conversion can be considered as a fixed constant as it will always operate on hash values of 16 (MD5) or 20 (SHA1) bytes length.
The impact of the binary-to-string conversion will become more visible with shorter input strings than with long input strings. For short input strings it seems that more time is spent in the binary-to-string conversion than in the actual hash computation part.

The reason seems to be that the code that transforms the calculated binary hash value back into a string field result is very generic and therefore slower than necessary.

To convert the result into the string field, the code in sql/item_strfunc.cc currently does the following (for MD5, the implementation for SHA1 is similar):

    sprintf((char *) str->ptr(),
            "%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
            digest[0], digest[1], digest[2], digest[3],
            digest[4], digest[5], digest[6], digest[7],
            digest[8], digest[9], digest[10], digest[11],
            digest[12], digest[13], digest[14], digest[15]);
    str->length((uint) 32);
    return str;

The conversion can probably be sped up by not using the very generic sprintf but a more specialized algorithm.

I have done some tests on Linux 2.6.26-1-amd64 with the following replacement: 
    static char hexmap[]= { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };
    int i;
    char *p;

    p=(char *) str->ptr();
    for (i=0;i<=15;i++)
    {
      (*p++)=hexmap[digest[i] >> 4];
      (*p++)=hexmap[digest[i] & 15];
    }
    str->length((uint) 32);
    return str;

I am attached this patch to this bug report.

The savings that can be achieved by moving from sprintf to something more specialized may vary from compiler to compiler and from compile option to option. I can't really tell what works best in all environments.
However, the fixed overhead of the binary-to-string conversion should be minimized as much as possible as at least on some systems massive speedups can be achieved by this.

Example:
Original MySQL 5.1.38:
mysql> select count(*) from (select md5(firstname) from users) sub limit 1\G
*************************** 1. row ***************************
count(*): 840245
1 row in set (3.13 sec)

Patched MySQL 5.1.38:
mysql> select count(*) from (select md5(firstname) from users) sub limit 1\G
*************************** 1. row ***************************
count(*): 840245
1 row in set (1.36 sec)

Original MySQL 5.1.38:
mysql> select count(*) from (select sha1(firstname) from users) sub limit 1\G
*************************** 1. row ***************************
count(*): 840245
1 row in set (3.74 sec)

Patched MySQL 5.1.38:
mysql> select count(*) from (select sha1(firstname) from users) sub limit 1\G
*************************** 1. row ***************************
count(*): 840245
1 row in set (1.70 sec)

How to repeat:
Just generate a lot of MD5 or SHA1 hashes from some of the system table values and note the query times:

MD5:

mysql> select * from (select md5(t.name),md5(t.description),md5(t.example),md5(t.url) from mysql.help_topic t,mysql.help_keyword k) sub limit 1\G
*************************** 1. row ***************************
       md5(t.name): ce31e2a082d17e038fcc6e3006166653
md5(t.description): bdcda0595dc1c938038b13d51df0a38d
    md5(t.example): 3279bf9c13baf8b749e29aa71cb2ccd2
        md5(t.url): ccb705dbce9c9827d93fb0d4b6c0b378
1 row in set (2.99 sec)

SHA1:

mysql> select * from (select sha1(t.name),sha1(t.description),sha1(t.example),sha1(t.url) from mysql.help_topic t,mysql.help_keyword k) sub limit 1\G
*************************** 1. row ***************************
       sha1(t.name): 04e66352aa8f9c4c5f26b71bf380973ada994760
sha1(t.description): c040e24b56fe6023617ab884d8715da7efb24de3
    sha1(t.example): a52b8b0b11e1d1d82839d0de1805192279ca0b1b
        sha1(t.url): ef7507b561798b9b877232fcb64719cee9eef057
1 row in set (4.84 sec)

Then apply the attached patch and run the same queries again:

mysql> select * from (select md5(t.name),md5(t.description),md5(t.example),md5(t.url) from mysql.help_topic t,mysql.help_keyword k) sub limit 1\G
*************************** 1. row ***************************
       md5(t.name): ce31e2a082d17e038fcc6e3006166653
md5(t.description): bdcda0595dc1c938038b13d51df0a38d
    md5(t.example): 3279bf9c13baf8b749e29aa71cb2ccd2
        md5(t.url): ccb705dbce9c9827d93fb0d4b6c0b378
1 row in set (1.27 sec)

mysql> select * from (select sha1(t.name),sha1(t.description),sha1(t.example),sha1(t.url) from mysql.help_topic t,mysql.help_keyword k) sub limit 1\G
*************************** 1. row ***************************
       sha1(t.name): 04e66352aa8f9c4c5f26b71bf380973ada994760
sha1(t.description): c040e24b56fe6023617ab884d8715da7efb24de3
    sha1(t.example): a52b8b0b11e1d1d82839d0de1805192279ca0b1b
        sha1(t.url): ef7507b561798b9b877232fcb64719cee9eef057
1 row in set (2.77 sec)

The query times for the patched version are a lot better than for the original version. 

Suggested fix:
Will attach a patch for MySQL 5.1.38 to this bug report with a suggestion how to modify the binary-to-string conversion in 5.1.38.
[6 Dec 2009 23:15] Jan Steemann
Patch for sql/item_strfunc.cc for MySQL 5.1.38

Attachment: md5_sha1_speedup.diff (application/octet-stream, text), 2.08 KiB.

[7 Dec 2009 20:19] Lenz Grimmer
Hi Jan, thank you very much for your contribution! We appreciate it.

In order for us to be able to accept and incorporate your contribution, I have to ask you for one additional favor, which you only have to perform once for all future contributions (including any other Sun-governed Open Source project).

Can you please fill out and send us a copy of the Sun Contributor Agreement (SCA), as outlined on the following Wiki page?

  http://forge.mysql.com/wiki/Contributing_Code

Thanks in advance! Please let us know, if you have any questions or concerns about this.
[10 Dec 2009 18:09] Jan Steemann
I have filled out the SCA and sent it to Sun via fax on Dec 8, 2009.
Now waiting to be listed in the SCA signatories list.
[25 Dec 2009 7:57] Sveta Smirnova
Thank you for the report.

Verified with following results.

Linux 64 bit 5.1 without patch:

md5 - 4 sec
sha1 - 8 sec

Linux 64 bit 5.1 with patch:

md5 - 3 sec
sha1 - 7 sec

MacOS 32-bit without patch:

md5 - 10 sec
sha1 - 20 sec

MacOS 32-bit with patch:

md5 - 7 sec
sha1 - 13 sec
[26 Jan 2010 14:54] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98208

3334 Davi Arnaut	2010-01-26
      Bug#49491: Much overhead for MD5() and SHA1() on short strings
      
      MySQL's hash functions MD5 and SHA relied on the somewhat slow 
      sprintf function to convert the digests to hex representations.
      This patch replaces the sprintf with a specific and inline hex
      conversion function.
      
      Patch contributed by Jan Steemann.
     @ sql/item_strfunc.cc
        Add a hex conversion function.
[26 Jan 2010 17:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98225

3334 Davi Arnaut	2010-01-26
      Bug#49491: Much overhead for MD5() and SHA1() on short strings
      
      MySQL's hash functions MD5 and SHA relied on the somewhat slow 
      sprintf function to convert the digests to hex representations.
      This patch replaces the sprintf with a specific and inline hex
      conversion function.
      
      Patch contributed by Jan Steemann.
     @ sql/item_strfunc.cc
        Add a hex conversion function.
[26 Jan 2010 17:08] Davi Arnaut
Queued to 5.1-bugteam
[4 Feb 2010 10:19] Bugs System
Pushed into 5.1.44 (revid:joro@sun.com-20100204101444-2j32mhqroo0iiio6) (version source revid:davi.arnaut@sun.com-20100126170519-bgtcp6eg8rp3tgst) (merge vers: 5.1.43) (pib:16)
[5 Feb 2010 11:48] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100204063540-9czpdmpixi3iw2yb) (version source revid:alik@sun.com-20100130212638-ai8t5v3u6647p6d2) (pib:16)
[5 Feb 2010 11:54] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100205113942-oqovjy0eoqbarn7i) (version source revid:alik@sun.com-20100204064210-ljwanqvrjs83s1gq) (merge vers: 6.0.14-alpha) (pib:16)
[5 Feb 2010 12:00] Bugs System
Pushed into 5.5.2-m2 (revid:alik@sun.com-20100203172258-1n5dsotny40yufxw) (version source revid:alik@sun.com-20100130191336-i53i9wx67n81ridm) (merge vers: 5.5.2-m2) (pib:16)
[10 Feb 2010 19:23] Paul DuBois
Noted in 5.1.44, 5.5.2, 6.0.14 changelogs.

The MD5() and SHA1() functions had excessive overhead for short
strings. 

Setting report to Need Merge pending push to Celosia.
[12 Mar 2010 14:14] Bugs System
Pushed into 5.1.44-ndb-7.0.14 (revid:jonas@mysql.com-20100312135944-t0z8s1da2orvl66x) (version source revid:jonas@mysql.com-20100312115609-woou0te4a6s4ae9y) (merge vers: 5.1.44-ndb-7.0.14) (pib:16)
[12 Mar 2010 14:30] Bugs System
Pushed into 5.1.44-ndb-6.2.19 (revid:jonas@mysql.com-20100312134846-tuqhd9w3tv4xgl3d) (version source revid:jonas@mysql.com-20100312060623-mx6407w2vx76h3by) (merge vers: 5.1.44-ndb-6.2.19) (pib:16)
[12 Mar 2010 14:46] Bugs System
Pushed into 5.1.44-ndb-6.3.33 (revid:jonas@mysql.com-20100312135724-xcw8vw2lu3mijrhn) (version source revid:jonas@mysql.com-20100312103652-snkltsd197l7q2yg) (merge vers: 5.1.44-ndb-6.3.33) (pib:16)
[12 Mar 2010 16:35] Paul DuBois
Fixed in earlier 5.1.x, 5.5.x.
[29 Jan 2011 6:46] Linhai Song
Description:

   There is a similar code fragment in sql/table.cc

   /*
  calculate md5 of query

  SYNOPSIS
    TABLE_LIST::calc_md5()
    buffer	buffer for md5 writing
*/

void  TABLE_LIST::calc_md5(char *buffer)
{
  uchar digest[16];
  MY_MD5_HASH(digest, (uchar *) select_stmt.str, select_stmt.length);
  sprintf((char *) buffer,
	    "%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
	    digest[0], digest[1], digest[2], digest[3],
	    digest[4], digest[5], digest[6], digest[7],
	    digest[8], digest[9], digest[10], digest[11],
	    digest[12], digest[13], digest[14], digest[15]);
}

I think this code fragment has similar problem. Should it also be patched?

thanks a lot
[11 Jun 2012 16:34] Paul DuBois
Noted in 5.6.6 changelog.