Bug #7414 mysqldumpslow improvements: inverse grep, rows examined, abstractions IN, VALUES
Submitted: 18 Dec 2004 20:22 Modified: 8 Mar 2009 8:57
Reporter: Nils Goroll Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: mysqldump Command-line Client Severity:S4 (Feature request)
Version:4.0, still present in 4.1.7, 5.0.45,5.1.21 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: Contribution

[18 Dec 2004 20:22] Nils Goroll
Description:
mysqldumpslow:

It would be very handy to get reports for queries which dont match
a pattern rather than mathing another, for instance to not report about
slow queries from mysqldump.

I added a inverse grep option -G to allow for easy filtering like in
this example:

mysqldumpslow -G=SQL_NO_CACHE

How to repeat:
n/a

Suggested fix:
--- /opt/mysql/bin/mysqldumpslow        2004-05-14 04:04:49.000000000 +0200
+++ /opt/mysql-standard-4.0.13-sun-solaris2.9-sparc-64bit/bin/mysqldumpslow     2004-07-20 09:43:27.958199000 +0200
@@ -25,6 +25,7 @@
     'a!',      # don't abstract all numbers to N and strings to 'S'
     'n=i',     # abstract numbers with at least n digits within names
     'g=s',     # grep: only consider stmts that include this string
+    'G=s',     # inverse grep: only consider stmts that DO NOT include this string
     'h=s',     # hostname of db server for *-slow.log filename (can be wildcard)
     'i=s',     # name of server instance (if using mysql.server startup script)
     'l!',      # don't subtract lock time from total time
@@ -92,6 +93,7 @@
     s/\s*;\s*(#\s*)?$//;       # remove trailing semicolon(+newline-hash)
 
     next if $opt{g} and !m/$opt{g}/io;
+    next if $opt{G} and m/$opt{G}/io;
 
     unless ($opt{a}) {
        s/\b\d+\b/N/g;
[29 Sep 2007 19:24] Valeriy Kravchuk
Thank you for a reasonable feature request.
[17 Oct 2007 15:37] Nils Goroll
I have added more improvements to mysqldumpslow:

- By default, abstract IN (something) to "IN (...)"
  and VALUES (something) to "VALUES (...)"
  to group more queries into the same mysqldumpslow line

  old behaviour available via command line switches

      'ain!',    # don't abstract IN (...)
      'avalues!',        # don't abstract VALUES (...)

- Also parse Rows_examined, print rows_examined in total and
  average and enable sorting for rows_examined
  (useful for getting a hold of queries doing full table scans)

  - sort order 're' sorts by rows examined (total)
  - sort order 'are' sorts by average rows examined per slow query

- Amend usage help

Here's a diff against the 5.0.45 code:

--- mysqldumpslow.dist  Wed Jul  4 17:55:23 2007
+++ mysqldumpslow       Wed Oct 17 17:33:23 2007
@@ -4,6 +4,9 @@
 # Original version by Tim Bunce, sometime in 2000.
 # Further changes by Tim Bunce, 8th March 2001.
 # Handling of strings with \ and double '' by Monty 11 Aug 2001.
+# inverse grep (G), rows_examined processing, IN and
+#  VALUES abstraction by Nils Goroll, 2006-2007
+#  see http://bugs.mysql.com/bug.php?id=7414

 use strict;
 use Getopt::Long;
@@ -24,8 +27,11 @@
     'r!',      # reverse the sort order (largest last instead of first)
     't=i',     # just show the top n queries
     'a!',      # don't abstract all numbers to N and strings to 'S'
+    'ain!',    # don't abstract IN (...)
+    'avalues!',        # don't abstract VALUES (...)
     'n=i',     # abstract numbers with at least n digits within names
     'g=s',     # grep: only consider stmts that include this string
+    'G=s',      # inverse grep: only consider stmts that DO NOT include this string
     'h=s',     # hostname of db server for *-slow.log filename (can be wildcard)
     'i=s',     # name of server instance (if using mysql.server startup script)
     'l!',      # don't subtract lock time from total time
@@ -83,8 +89,9 @@
     s/^#? Time: \d{6}\s+\d+:\d+:\d+.*\n//;
     my ($user,$host) = s/^#? User\@Host:\s+(\S+)\s+\@\s+(\S+).*\n// ? ($1,$2) : ('','');

-    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+).*\n//;
-    my ($t, $l, $r) = ($1, $2, $3);
+    # Query_time: 3  Lock_time: 0  Rows_sent: 3  Rows_examined: 532233
+    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+)  Rows_examined: (\d+).*\n//;
+    my ($t, $l, $r, $re) = ($1, $2, $3, $4);
     $t -= $l unless $opt{l};

     # remove fluff that mysqld writes to log when it (re)starts:
@@ -100,7 +107,19 @@
     s/\s*;\s*(#\s*)?$//;       # remove trailing semicolon(+newline-hash)

     next if $opt{g} and !m/$opt{g}/io;
+    next if $opt{G} and m/$opt{G}/io;

+    unless ($opt{ain}) {
+       s/IN\s*\([^\)]*\)/IN (...)/gi;
+    }
+
+    unless ($opt{avalues}) {
+       # this RE is not optimal (too greedy - eats everything after VALUES (),()... which ends with )
+       # but the correct version of it triggers a perl bug (MCS: see CTS#58124)
+       ## correct code triggering bug: s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;
+       s/VALUES\s*\(.+\)/VALUES (...)/gi;
+    }
+
     unless ($opt{a}) {
        s/\b\d+\b/N/g;
        s/\b0x[0-9A-Fa-f]+\b/N/g;
@@ -112,6 +131,9 @@
         s/"[^"]+"/"S"/g;
        # -n=8: turn log_20001231 into log_NNNNNNNN
        s/([a-z_]+)(\d{$opt{n},})/$1.('N' x length($2))/ieg if $opt{n};
+       ## slink@mcs.de: we can also abstract any duplicate whitespace - there is no
+       ## string or anywhing we would destroy by doing so
+       s/\s+/ /g;
        # abbreviate massive "in (...)" statements and similar
        s!(([NS],){100,})!sprintf("$2,{repeated %d times}",length($1)/2)!eg;
     }
@@ -121,6 +143,7 @@
     $s->{t} += $t;
     $s->{l} += $l;
     $s->{r} += $r;
+    $s->{re} += $re;
     $s->{users}->{$user}++ if $user;
     $s->{hosts}->{$host}++ if $host;

@@ -129,10 +152,11 @@

 foreach (keys %stmt) {
     my $v = $stmt{$_} || die;
-    my ($c, $t, $l, $r) = @{ $v }{qw(c t l r)};
+    my ($c, $t, $l, $r, $re) = @{ $v }{qw(c t l r re)};
     $v->{at} = $t / $c;
     $v->{al} = $l / $c;
     $v->{ar} = $r / $c;
+    $v->{are} = $re / $c;
 }

 my @sorted = sort { $stmt{$b}->{$opt{s}} <=> $stmt{$a}->{$opt{s}} } keys %stmt;
@@ -141,13 +165,13 @@

 foreach (@sorted) {
     my $v = $stmt{$_} || die;
-    my ($c, $t,$at, $l,$al, $r,$ar) = @{ $v }{qw(c t at l al r ar)};
+    my ($c, $t,$at, $l,$al, $r,$ar,$re,$are) = @{ $v }{qw(c t at l al r ar re are)};
     my @users = keys %{$v->{users}};
     my $user  = (@users==1) ? $users[0] : sprintf "%dusers",scalar @users;
     my @hosts = keys %{$v->{hosts}};
     my $host  = (@hosts==1) ? $hosts[0] : sprintf "%dhosts",scalar @hosts;
-    printf "Count: %d  Time=%.2fs (%ds)  Lock=%.2fs (%ds)  Rows=%.1f (%d), $user\@$host\n%s\n\n",
-           $c, $at,$t, $al,$l, $ar,$r, $_;
+    printf "Count: %d  Time=%.2fs (%ds)  Lock=%.2fs (%ds)  Rows=%.1f (%d)  Rows examined=%.1f (%d), $user\@$host\n%s\n\n",
+           $c, $at,$t, $al,$l, $ar,$r, $are, $re, $_;
 }

 sub usage {
@@ -163,12 +187,25 @@

   -v           verbose
   -d           debug
-  -s ORDER     what to sort by (t, at, l, al, r, ar etc), 'at' is default
+  -s ORDER     what to sort by, 'at' is default
+
+               absolute values:
+              t  : Query time
+              l  : Lock time
+              r  : Rows sent
+              re : Rows examined
+
+              for average values (average per query), prepend an a:
+              at, al, ar, are
+
   -r           reverse the sort order (largest last instead of first)
   -t NUM       just show the top n queries
   -a           don't abstract all numbers to N and strings to 'S'
+  -ain        don't abstract IN (something) to 'IN (...)'
+  -avalues     don't abstract VALUES (Something) to 'VALUES (...)'
   -n NUM       abstract numbers with at least n digits within names
   -g PATTERN   grep: only consider stmts that include this string
+  -G PATTERN   inverse grep: only consider stmts that do not include this string
   -h HOSTNAME  hostname of db server for *-slow.log filename (can be wildcard),
                default is '*', i.e. match all
   -i NAME      name of server instance (if using mysql.server startup script)
[17 Oct 2007 15:38] Nils Goroll
my current version of mysqldumpslow

Attachment: mysqldumpslow (application/octet-stream, text), 7.67 KiB.

[17 Oct 2007 16:10] Nils Goroll
Being at it, I have looked though mysqldump open bugs and
either left comments that I did fix things or integrated
fixes.

fixed:
Bug #20454 Usage bug in mysqldumpslow
Bug #25855 mysqldumpslow: needs better --help output
           (still, more could be done)
Bug #18260  mysqldumpslow doesn't properly handle multi-row inserts/updates

I also integrated the patch for this bug:

Bug #21359      mysqldumpslow: show the database name in the output
                (slightly modified parsing of use/connect)

See the latest attachment for mysqldump code
[17 Oct 2007 16:14] Nils Goroll
latest version with more fixes

Attachment: mysqldumpslow (application/octet-stream, text), 8.04 KiB.

[17 Oct 2007 16:21] Nils Goroll
Patch of the latest version against 5.0.45

Attachment: mysqldumpslow_patch_bug7414 (application/octet-stream, text), 5.89 KiB.

[18 Oct 2007 15:20] Chad MILLER
Hi Nils.

Thank you for your patches.  For our docs team and for justification to apply the patch, do you mind enumerating the new features, with a short example of its usage?
[22 Oct 2007 8:05] Nils Goroll
While working on the documentation, I spotted a glitch in my previous
version of mysqdumpslow:

    if(/(?:use|connect)\s+(\w+)/i) {
        $db=$1;
    }

... should not contain a next statement, as it did
[22 Oct 2007 8:05] Nils Goroll
Fixed glitch: no "next" when parsing use/connect

Attachment: mysqldumpslow (application/octet-stream, text), 7.98 KiB.

[22 Oct 2007 11:12] Nils Goroll
Hi Chad,
hi all,

sorry for the frequent updates, I really should have worked silently
for longer rather than bothering you with updates that often.

I have added a couple of more fixes and yet another command line
switch.

Minor changes:

- fix: debug and verbose options were broken
- fix: removal of MySQL startup message was not working properly
- cleanup: should not use literal tabs in perl regular expressions

I will attach the documentation of major changes and the current
version
[22 Oct 2007 11:14] Nils Goroll
Dokumentation of changes of latest version

Attachment: mysqldumpslow_changes_documentation.txt (x-application/octet-stream, text), 10.80 KiB.

[22 Oct 2007 11:14] Nils Goroll
latest version with more fixes

Attachment: mysqldumpslow (application/octet-stream, text), 8.80 KiB.

[9 Jan 2008 16:49] Chad MILLER
Nils' contrib in patch form (estimated)

Attachment: m.diff (text/x-diff), 9.57 KiB.

[9 Jan 2008 16:52] Nils Goroll
Chad, I noticed

@@ -1,9 +1,13 @@
-#!/usr/bin/perl
+#!/usr/local/bin/perl

in the patch. Sorry, I am using /usr/local/bin/perl, but we probably should not change the path for mysql distributions...
[9 Jan 2008 17:04] Chad MILLER
A diff would have been better.  I can't tell what we removed and what you added.  :(

Review:

-    my @chunks = split /^\/.*Version.*started with[\000-\377]*?Time.*Id.*Command.*Argument.*\n/m;
+    my @chunks = split /\S+\s+Version.*started\s+with.*\n.*\nTime\s+Id\s+Command\s+Argument\s*\n/m;

"with" + any character.  Does "." match a newline?  Carriage return?   Is "." a subset of all characters?  I think so.  Is this safe to change? 

-    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+).*\n//;
+    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+)  Rows_examined: (\d+).*\n//;

I wonder if this is backwards compatible.  When was "Rows_examined" added?

+    unless ($opt{ain}) {
+   s/IN\s*\([^\)]*\)/IN (...)/gi;
+    }

I don't think a regex can handle that.  What if the text was  IN('x', 'y (not z)', 'z') ?

+    unless ($opt{avalues}) {
+   # this RE is not optimal (too greedy - eats everything after VALUES (),()... which ends with )
+   # but the correct version of it triggers a perl bug (MCS: see CTS#58124)
+   ## correct code triggering bug: s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;
+   s/VALUES\s*\(.+\)/VALUES (...)/gi;
+    }

Why not minimally-matching?  s/VALUES\s*\(.+?\)/VALUES (...)/gi;  I think the Perl regex engine supports that.

+  -ain         don't abstract IN (something) to 'IN (...)'
+  -avalues     don't abstract VALUES (Something) to 'VALUES (...)'
+  -awhitespace don't normalise arbitrary whitespace (preserve original formatting)
+               note: leading whitespace is normalised always

I dislike changing behavior unless some new option is specified.  Please make that "behave as usual unless you know of these new flags".
[9 Jan 2008 18:18] Nils Goroll
Chad,

thanks for reviewing my suggestions.

> -    my @chunks = split /^\/.*Version.*started with[\000-\377]*?Time.*Id.*Command.*Argument.*\n/m;
> +    my @chunks = split /\S+\s+Version.*started\s+with.*\n.*\nTime\s+Id\s+Command\s+Argument\s*\n/m;
>
> "with" + any character.  Does "." match a newline?  Carriage return?
> Is "." a subset of all characters?  I think so.  Is this safe to
> change?

First of all, the reason why I did change this is that the original
code did not work on current mysqld outputs any more. Unfortunetely I
have not exactly documented the cases for which the old version was
broken.

I changed

  with[\000-\377]*

into

  with.*\n.*\n

because I thought it was clearer to explicitly read one line of
something that cathing anything plus newlines with that octal
character class.

My change would stop working if the startup message was changed to be
more then three lines long, but almost any other change to the startup
message would make it necessary to adjust mysqldumpslow anyway.

> -    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+).*\n//;
> +    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+) Rows_examined: (\d+).*\n//;
>
> I wonder if this is backwards compatible.  When was "Rows_examined"
> added?

I guess it will be easier for you than for me to find out, but if we
included the changes to mysqldumpslow only for MySQL5 and above, we
should not need to care about being backwards compatible (and as 4.1
is EOL, we should be safe as this already worked with 4.1)

> +    unless ($opt{ain}) {
> +   s/IN\s*\([^\)]*\)/IN (...)/gi;
> +    }
>
>
> I don't think a regex can handle that.  What if the text was  IN('x',
> 'y (not z)', 'z') ?

You are right, other than parsing the SQL code to a large extent, we
cant handle these things.

The point is, IMHO, that noone using mysqldumpslow will care, because
it assists you in analysing the slow.log, even if its query
abstraction is not perfect. We are trying to find queries in the slow
log which are "the same", which can become a very difficult task. But
if we can happily live with something which is "good enough", why
should we make code more complex then necessary?

> +    unless ($opt{avalues}) {
> +   # this RE is not optimal (too greedy - eats everything after VALUES
> (),()... which ends with )
> +   # but the correct version of it triggers a perl bug (MCS: see
> CTS#58124)
> +   ## correct code triggering bug:
> s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;
> +   s/VALUES\s*\(.+\)/VALUES (...)/gi;
> +    }
>
>
> Why not minimally-matching?  s/VALUES\s*\(.+?\)/VALUES (...)/gi;  I
> think the Perl regex engine supports that.

Because this code would not abtract VALUES (a,b,c),(d,e,f),(g,h,i)...

Originally, I used the code which is commented out:

        s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;

which did work but would trigger a bug in some cases.

You are suggesting to match minimally, so the statement would look
like this:

        s/VALUES([,\s]*\(.*?\)){1,}/VALUES (...)/gi;

but I think chances are that it'll hit the same perl bug (which I have
only trapped upon after having successfully used the code stated above
for a month or so on daily slow.logs, so I would rather be
conservative about this).

> +  -ain         don't abstract IN (something) to 'IN (...)'
> +  -avalues     don't abstract VALUES (Something) to 'VALUES (...)'
> +  -awhitespace don't normalise arbitrary whitespace (preserve original
>                 formatting)
> +               note: leading whitespace is normalised always
>
> I dislike changing behavior unless some new option is specified.
> Please make that "behave as usual unless you know of these new flags".

I understand your point, and if this was any other mysql component, I
would be extremely conservative about changing defaults (I know, for
instance, how much trouble the change of the precedence of the comma
operator for JOINs has caused).

But, again, this is not a mysql core component - mysqldumpslow is to
help people analyse the slow.log. I have realised that the old code
would not abstract enough to be helpful for real life problems, and I
have tried to find some mechanisms to abstract more without
introducing the complexity of real parsing.

The new defaults I suggest are a recommendation which I believe will
be what most users will find most helpful, but I might be wrong.

So if you think that we should better not change the defaults, that's
fine for me.
[7 Mar 2009 20:57] Masood Mortazavi
Nils -

Since you're interested in contribution to MySQL, I thought I should let you know that MySQL is now using a new contributor agreement (i.e. the "Sun Contributor Agreement"). 

Under the Sun Contributor Agreement (SCA), the contributor retains copyrights while also granting those same rights to Sun as the project sponsor. It supersedes the previously used MySQL Contributor License Agreement (CLA).

You can get a copy of the SCA document here: http://www.sun.com/software/opensource/sca.pdf 

Once you sign and send your copy of the SCA document as described here: 
  http://forge.mysql.com/wiki/Sun_Contributor_Agreement , 
your signatory status will be reflected, along with your interest in contributing to MySQL, here: 
   https://sca.dev.java.net/CA_signatories.htm 

Having a clear SCA status will enable and facilitate the review and interaction regarding your code contributions to MySQL (as well as to any other Sun-managed Open-Source project).
[8 Mar 2009 8:57] Nils Goroll
Hi Masood,

as I am also an OpenSolaris Contributor, I Habe already signed the SCA.
It's filed as SCA# OS0070. 

Cheers, Nils
[14 Apr 2009 6:51] Masood Mortazavi
Nils --  Have you checked to see if your patch works on 6.0?