MySQL Bugs: #7414: mysqldumpslow improvements: inverse grep, rows examined, abstractions IN, VALUES

Bug #7414	mysqldumpslow improvements: inverse grep, rows examined, abstractions IN, VALUES
Submitted:	18 Dec 2004 20:22	Modified:	8 Mar 2009 8:57
Reporter:	Nils Goroll	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: mysqldump Command-line Client	Severity:	S4 (Feature request)
Version:	4.0, still present in 4.1.7, 5.0.45,5.1.21	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	Contribution

Description:
mysqldumpslow:

It would be very handy to get reports for queries which dont match
a pattern rather than mathing another, for instance to not report about
slow queries from mysqldump.

I added a inverse grep option -G to allow for easy filtering like in
this example:

mysqldumpslow -G=SQL_NO_CACHE

How to repeat:
n/a

Suggested fix:
--- /opt/mysql/bin/mysqldumpslow        2004-05-14 04:04:49.000000000 +0200
+++ /opt/mysql-standard-4.0.13-sun-solaris2.9-sparc-64bit/bin/mysqldumpslow     2004-07-20 09:43:27.958199000 +0200
@@ -25,6 +25,7 @@
     'a!',      # don't abstract all numbers to N and strings to 'S'
     'n=i',     # abstract numbers with at least n digits within names
     'g=s',     # grep: only consider stmts that include this string
+    'G=s',     # inverse grep: only consider stmts that DO NOT include this string
     'h=s',     # hostname of db server for *-slow.log filename (can be wildcard)
     'i=s',     # name of server instance (if using mysql.server startup script)
     'l!',      # don't subtract lock time from total time
@@ -92,6 +93,7 @@
     s/\s*;\s*(#\s*)?$//;       # remove trailing semicolon(+newline-hash)
 
     next if $opt{g} and !m/$opt{g}/io;
+    next if $opt{G} and m/$opt{G}/io;
 
     unless ($opt{a}) {
        s/\b\d+\b/N/g;

Thank you for a reasonable feature request.

I have added more improvements to mysqldumpslow:

- By default, abstract IN (something) to "IN (...)"
  and VALUES (something) to "VALUES (...)"
  to group more queries into the same mysqldumpslow line

  old behaviour available via command line switches

      'ain!',    # don't abstract IN (...)
      'avalues!',        # don't abstract VALUES (...)

- Also parse Rows_examined, print rows_examined in total and
  average and enable sorting for rows_examined
  (useful for getting a hold of queries doing full table scans)

  - sort order 're' sorts by rows examined (total)
  - sort order 'are' sorts by average rows examined per slow query

- Amend usage help

Here's a diff against the 5.0.45 code:

--- mysqldumpslow.dist  Wed Jul  4 17:55:23 2007
+++ mysqldumpslow       Wed Oct 17 17:33:23 2007
@@ -4,6 +4,9 @@
 # Original version by Tim Bunce, sometime in 2000.
 # Further changes by Tim Bunce, 8th March 2001.
 # Handling of strings with \ and double '' by Monty 11 Aug 2001.
+# inverse grep (G), rows_examined processing, IN and
+#  VALUES abstraction by Nils Goroll, 2006-2007
+#  see http://bugs.mysql.com/bug.php?id=7414

 use strict;
 use Getopt::Long;
@@ -24,8 +27,11 @@
     'r!',      # reverse the sort order (largest last instead of first)
     't=i',     # just show the top n queries
     'a!',      # don't abstract all numbers to N and strings to 'S'
+    'ain!',    # don't abstract IN (...)
+    'avalues!',        # don't abstract VALUES (...)
     'n=i',     # abstract numbers with at least n digits within names
     'g=s',     # grep: only consider stmts that include this string
+    'G=s',      # inverse grep: only consider stmts that DO NOT include this string
     'h=s',     # hostname of db server for *-slow.log filename (can be wildcard)
     'i=s',     # name of server instance (if using mysql.server startup script)
     'l!',      # don't subtract lock time from total time
@@ -83,8 +89,9 @@
     s/^#? Time: \d{6}\s+\d+:\d+:\d+.*\n//;
     my ($user,$host) = s/^#? User\@Host:\s+(\S+)\s+\@\s+(\S+).*\n// ? ($1,$2) : ('','');

-    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+).*\n//;
-    my ($t, $l, $r) = ($1, $2, $3);
+    # Query_time: 3  Lock_time: 0  Rows_sent: 3  Rows_examined: 532233
+    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+)  Rows_examined: (\d+).*\n//;
+    my ($t, $l, $r, $re) = ($1, $2, $3, $4);
     $t -= $l unless $opt{l};

     # remove fluff that mysqld writes to log when it (re)starts:
@@ -100,7 +107,19 @@
     s/\s*;\s*(#\s*)?$//;       # remove trailing semicolon(+newline-hash)

     next if $opt{g} and !m/$opt{g}/io;
+    next if $opt{G} and m/$opt{G}/io;

+    unless ($opt{ain}) {
+       s/IN\s*\([^\)]*\)/IN (...)/gi;
+    }
+
+    unless ($opt{avalues}) {
+       # this RE is not optimal (too greedy - eats everything after VALUES (),()... which ends with )
+       # but the correct version of it triggers a perl bug (MCS: see CTS#58124)
+       ## correct code triggering bug: s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;
+       s/VALUES\s*\(.+\)/VALUES (...)/gi;
+    }
+
     unless ($opt{a}) {
        s/\b\d+\b/N/g;
        s/\b0x[0-9A-Fa-f]+\b/N/g;
@@ -112,6 +131,9 @@
         s/"[^"]+"/"S"/g;
        # -n=8: turn log_20001231 into log_NNNNNNNN
        s/([a-z_]+)(\d{$opt{n},})/$1.('N' x length($2))/ieg if $opt{n};
+       ## slink@mcs.de: we can also abstract any duplicate whitespace - there is no
+       ## string or anywhing we would destroy by doing so
+       s/\s+/ /g;
        # abbreviate massive "in (...)" statements and similar
        s!(([NS],){100,})!sprintf("$2,{repeated %d times}",length($1)/2)!eg;
     }
@@ -121,6 +143,7 @@
     $s->{t} += $t;
     $s->{l} += $l;
     $s->{r} += $r;
+    $s->{re} += $re;
     $s->{users}->{$user}++ if $user;
     $s->{hosts}->{$host}++ if $host;

@@ -129,10 +152,11 @@

 foreach (keys %stmt) {
     my $v = $stmt{$_} || die;
-    my ($c, $t, $l, $r) = @{ $v }{qw(c t l r)};
+    my ($c, $t, $l, $r, $re) = @{ $v }{qw(c t l r re)};
     $v->{at} = $t / $c;
     $v->{al} = $l / $c;
     $v->{ar} = $r / $c;
+    $v->{are} = $re / $c;
 }

 my @sorted = sort { $stmt{$b}->{$opt{s}} <=> $stmt{$a}->{$opt{s}} } keys %stmt;
@@ -141,13 +165,13 @@

 foreach (@sorted) {
     my $v = $stmt{$_} || die;
-    my ($c, $t,$at, $l,$al, $r,$ar) = @{ $v }{qw(c t at l al r ar)};
+    my ($c, $t,$at, $l,$al, $r,$ar,$re,$are) = @{ $v }{qw(c t at l al r ar re are)};
     my @users = keys %{$v->{users}};
     my $user  = (@users==1) ? $users[0] : sprintf "%dusers",scalar @users;
     my @hosts = keys %{$v->{hosts}};
     my $host  = (@hosts==1) ? $hosts[0] : sprintf "%dhosts",scalar @hosts;
-    printf "Count: %d  Time=%.2fs (%ds)  Lock=%.2fs (%ds)  Rows=%.1f (%d), $user\@$host\n%s\n\n",
-           $c, $at,$t, $al,$l, $ar,$r, $_;
+    printf "Count: %d  Time=%.2fs (%ds)  Lock=%.2fs (%ds)  Rows=%.1f (%d)  Rows examined=%.1f (%d), $user\@$host\n%s\n\n",
+           $c, $at,$t, $al,$l, $ar,$r, $are, $re, $_;
 }

 sub usage {
@@ -163,12 +187,25 @@

   -v           verbose
   -d           debug
-  -s ORDER     what to sort by (t, at, l, al, r, ar etc), 'at' is default
+  -s ORDER     what to sort by, 'at' is default
+
+               absolute values:
+              t  : Query time
+              l  : Lock time
+              r  : Rows sent
+              re : Rows examined
+
+              for average values (average per query), prepend an a:
+              at, al, ar, are
+
   -r           reverse the sort order (largest last instead of first)
   -t NUM       just show the top n queries
   -a           don't abstract all numbers to N and strings to 'S'
+  -ain        don't abstract IN (something) to 'IN (...)'
+  -avalues     don't abstract VALUES (Something) to 'VALUES (...)'
   -n NUM       abstract numbers with at least n digits within names
   -g PATTERN   grep: only consider stmts that include this string
+  -G PATTERN   inverse grep: only consider stmts that do not include this string
   -h HOSTNAME  hostname of db server for *-slow.log filename (can be wildcard),
                default is '*', i.e. match all
   -i NAME      name of server instance (if using mysql.server startup script)

my current version of mysqldumpslow

Attachment: mysqldumpslow (application/octet-stream, text), 7.67 KiB.

Being at it, I have looked though mysqldump open bugs and
either left comments that I did fix things or integrated
fixes.

fixed:
Bug #20454 Usage bug in mysqldumpslow
Bug #25855 mysqldumpslow: needs better --help output
           (still, more could be done)
Bug #18260  mysqldumpslow doesn't properly handle multi-row inserts/updates

I also integrated the patch for this bug:

Bug #21359      mysqldumpslow: show the database name in the output
                (slightly modified parsing of use/connect)

See the latest attachment for mysqldump code

latest version with more fixes

Attachment: mysqldumpslow (application/octet-stream, text), 8.04 KiB.

Patch of the latest version against 5.0.45

Attachment: mysqldumpslow_patch_bug7414 (application/octet-stream, text), 5.89 KiB.

Hi Nils.

Thank you for your patches.  For our docs team and for justification to apply the patch, do you mind enumerating the new features, with a short example of its usage?

While working on the documentation, I spotted a glitch in my previous
version of mysqdumpslow:

    if(/(?:use|connect)\s+(\w+)/i) {
        $db=$1;
    }

... should not contain a next statement, as it did

Fixed glitch: no "next" when parsing use/connect

Attachment: mysqldumpslow (application/octet-stream, text), 7.98 KiB.

Hi Chad,
hi all,

sorry for the frequent updates, I really should have worked silently
for longer rather than bothering you with updates that often.

I have added a couple of more fixes and yet another command line
switch.

Minor changes:

- fix: debug and verbose options were broken
- fix: removal of MySQL startup message was not working properly
- cleanup: should not use literal tabs in perl regular expressions

I will attach the documentation of major changes and the current
version

Dokumentation of changes of latest version

Attachment: mysqldumpslow_changes_documentation.txt (x-application/octet-stream, text), 10.80 KiB.

latest version with more fixes

Attachment: mysqldumpslow (application/octet-stream, text), 8.80 KiB.

Nils' contrib in patch form (estimated)

Attachment: m.diff (text/x-diff), 9.57 KiB.

Chad, I noticed

@@ -1,9 +1,13 @@
-#!/usr/bin/perl
+#!/usr/local/bin/perl

in the patch. Sorry, I am using /usr/local/bin/perl, but we probably should not change the path for mysql distributions...

A diff would have been better.  I can't tell what we removed and what you added.  :(

Review:

-    my @chunks = split /^\/.*Version.*started with[\000-\377]*?Time.*Id.*Command.*Argument.*\n/m;
+    my @chunks = split /\S+\s+Version.*started\s+with.*\n.*\nTime\s+Id\s+Command\s+Argument\s*\n/m;

"with" + any character.  Does "." match a newline?  Carriage return?   Is "." a subset of all characters?  I think so.  Is this safe to change? 

-    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+).*\n//;
+    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+)  Rows_examined: (\d+).*\n//;

I wonder if this is backwards compatible.  When was "Rows_examined" added?

+    unless ($opt{ain}) {
+   s/IN\s*\([^\)]*\)/IN (...)/gi;
+    }

I don't think a regex can handle that.  What if the text was  IN('x', 'y (not z)', 'z') ?

+    unless ($opt{avalues}) {
+   # this RE is not optimal (too greedy - eats everything after VALUES (),()... which ends with )
+   # but the correct version of it triggers a perl bug (MCS: see CTS#58124)
+   ## correct code triggering bug: s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;
+   s/VALUES\s*\(.+\)/VALUES (...)/gi;
+    }

Why not minimally-matching?  s/VALUES\s*\(.+?\)/VALUES (...)/gi;  I think the Perl regex engine supports that.

+  -ain         don't abstract IN (something) to 'IN (...)'
+  -avalues     don't abstract VALUES (Something) to 'VALUES (...)'
+  -awhitespace don't normalise arbitrary whitespace (preserve original formatting)
+               note: leading whitespace is normalised always

I dislike changing behavior unless some new option is specified.  Please make that "behave as usual unless you know of these new flags".

Chad,

thanks for reviewing my suggestions.

> -    my @chunks = split /^\/.*Version.*started with[\000-\377]*?Time.*Id.*Command.*Argument.*\n/m;
> +    my @chunks = split /\S+\s+Version.*started\s+with.*\n.*\nTime\s+Id\s+Command\s+Argument\s*\n/m;
>
> "with" + any character.  Does "." match a newline?  Carriage return?
> Is "." a subset of all characters?  I think so.  Is this safe to
> change?

First of all, the reason why I did change this is that the original
code did not work on current mysqld outputs any more. Unfortunetely I
have not exactly documented the cases for which the old version was
broken.

I changed

  with[\000-\377]*

into

  with.*\n.*\n

because I thought it was clearer to explicitly read one line of
something that cathing anything plus newlines with that octal
character class.

My change would stop working if the startup message was changed to be
more then three lines long, but almost any other change to the startup
message would make it necessary to adjust mysqldumpslow anyway.

> -    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+).*\n//;
> +    s/^# Query_time: (\d+)  Lock_time: (\d+)  Rows_sent: (\d+) Rows_examined: (\d+).*\n//;
>
> I wonder if this is backwards compatible.  When was "Rows_examined"
> added?

I guess it will be easier for you than for me to find out, but if we
included the changes to mysqldumpslow only for MySQL5 and above, we
should not need to care about being backwards compatible (and as 4.1
is EOL, we should be safe as this already worked with 4.1)

> +    unless ($opt{ain}) {
> +   s/IN\s*\([^\)]*\)/IN (...)/gi;
> +    }
>
>
> I don't think a regex can handle that.  What if the text was  IN('x',
> 'y (not z)', 'z') ?

You are right, other than parsing the SQL code to a large extent, we
cant handle these things.

The point is, IMHO, that noone using mysqldumpslow will care, because
it assists you in analysing the slow.log, even if its query
abstraction is not perfect. We are trying to find queries in the slow
log which are "the same", which can become a very difficult task. But
if we can happily live with something which is "good enough", why
should we make code more complex then necessary?

> +    unless ($opt{avalues}) {
> +   # this RE is not optimal (too greedy - eats everything after VALUES
> (),()... which ends with )
> +   # but the correct version of it triggers a perl bug (MCS: see
> CTS#58124)
> +   ## correct code triggering bug:
> s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;
> +   s/VALUES\s*\(.+\)/VALUES (...)/gi;
> +    }
>
>
> Why not minimally-matching?  s/VALUES\s*\(.+?\)/VALUES (...)/gi;  I
> think the Perl regex engine supports that.

Because this code would not abtract VALUES (a,b,c),(d,e,f),(g,h,i)...

Originally, I used the code which is commented out:

        s/VALUES([,\s]*\([^\)]+\)){1,}/VALUES (...)/gi;

which did work but would trigger a bug in some cases.

You are suggesting to match minimally, so the statement would look
like this:

        s/VALUES([,\s]*\(.*?\)){1,}/VALUES (...)/gi;

but I think chances are that it'll hit the same perl bug (which I have
only trapped upon after having successfully used the code stated above
for a month or so on daily slow.logs, so I would rather be
conservative about this).

> +  -ain         don't abstract IN (something) to 'IN (...)'
> +  -avalues     don't abstract VALUES (Something) to 'VALUES (...)'
> +  -awhitespace don't normalise arbitrary whitespace (preserve original
>                 formatting)
> +               note: leading whitespace is normalised always
>
> I dislike changing behavior unless some new option is specified.
> Please make that "behave as usual unless you know of these new flags".

I understand your point, and if this was any other mysql component, I
would be extremely conservative about changing defaults (I know, for
instance, how much trouble the change of the precedence of the comma
operator for JOINs has caused).

But, again, this is not a mysql core component - mysqldumpslow is to
help people analyse the slow.log. I have realised that the old code
would not abstract enough to be helpful for real life problems, and I
have tried to find some mechanisms to abstract more without
introducing the complexity of real parsing.

The new defaults I suggest are a recommendation which I believe will
be what most users will find most helpful, but I might be wrong.

So if you think that we should better not change the defaults, that's
fine for me.

Nils -

Since you're interested in contribution to MySQL, I thought I should let you know that MySQL is now using a new contributor agreement (i.e. the "Sun Contributor Agreement"). 

Under the Sun Contributor Agreement (SCA), the contributor retains copyrights while also granting those same rights to Sun as the project sponsor. It supersedes the previously used MySQL Contributor License Agreement (CLA).

You can get a copy of the SCA document here: http://www.sun.com/software/opensource/sca.pdf 

Once you sign and send your copy of the SCA document as described here: 
  http://forge.mysql.com/wiki/Sun_Contributor_Agreement , 
your signatory status will be reflected, along with your interest in contributing to MySQL, here: 
   https://sca.dev.java.net/CA_signatories.htm 

Having a clear SCA status will enable and facilitate the review and interaction regarding your code contributions to MySQL (as well as to any other Sun-managed Open-Source project).

Hi Masood,

as I am also an OpenSolaris Contributor, I Habe already signed the SCA.
It's filed as SCA# OS0070. 

Cheers, Nils

Nils --  Have you checked to see if your patch works on 6.0?