Bug #43418 MTR2: does not notice a memory leak occuring at shutdown of mysqld w/ --valgrind
Submitted: 5 Mar 2009 14:58 Modified: 20 Dec 2010 3:10
Reporter: Guilhem Bichot Email Updates:
Status: Closed Impact on me:
None 
Category:Tools: MTR / mysql-test-run Severity:S3 (Non-critical)
Version:6.0, next-mr-bugfixing OS:Linux
Assigned to: Bjørn Munch CPU Architecture:Any

[5 Mar 2009 14:58] Guilhem Bichot
Description:
I'm using 6.0-maria. Put this in a .test file:

mkdir $MYSQLTEST_VARDIR/tmp/backup;
SET @@global.backupdir = '../../tmp/backup';
SET @@global.backupdir = @@global.datadir;
#source include/mysqladmin_shutdown.inc;
exit;

Then run this test with
./mtr --mem --valgrind <testname> "--valgrind-option=--show-reachable=yes"

output is:

TEST                                      RESULT   TIME (ms)
------------------------------------------------------------

SET @@global.backupdir = '../../tmp/backup';
SET @@global.backupdir = @@global.datadir;
backup.bug                               [ pass ]    168
------------------------------------------------------------
The servers were restarted 0 times
Spent 0.168 of 22 seconds executing testcases

All 1 tests were successful.

and if you look into var/mysqld.1/mysqld.err, its last line is:
Version: '6.0.10-alpha-valgrind-max-debug-log'  socket: '/home/mysql_src/bzrrepos/mysql-6.0-maria/mysql-test/var/tmp/mysqld.1.sock'  port: 10670  Source distribution

so the portion describing the shutdown is missing (mtr shuts down mysqld after the test, so "Shutdown complete" should be in the .err).
That's an issue, because in fact, there is a memory leak at shutdown; to see it, remove the # in the test (the effect is to do the shutdown before the test ends), and re-run:

TEST                                      RESULT   TIME (ms)
------------------------------------------------------------

SET @@global.backupdir = '../../tmp/backup';
SET @@global.backupdir = @@global.datadir;
backup.bug                               [ fail ]
        Test ended at 2009-03-05 15:50:24

CURRENT_TEST: backup.bug
==26341== Memcheck, a memory error detector.
==26341== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==26341== Using LibVEX rev 1804, a library for dynamic binary translation.
==26341== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==26341== Using valgrind-3.3.0, a dynamic binary instrumentation framework.
==26341== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==26341== For more details, rerun with: -v
==26341==
==26341==
==26341== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 1)
==26341== malloc/free: in use at exit: 28 bytes in 1 blocks.
==26341== malloc/free: 282 allocs, 281 frees, 322,847 bytes allocated.
==26341== For counts of detected errors, rerun with: -v
==26341== searching for pointers to 1 not-freed blocks.
==26341== checked 2,113,648 bytes.
==26341==
==26341== LEAK SUMMARY:
==26341==    definitely lost: 0 bytes in 0 blocks.
==26341==      possibly lost: 0 bytes in 0 blocks.
==26341==    still reachable: 0 bytes in 0 blocks.
==26341==         suppressed: 28 bytes in 1 blocks.

The server [mysqld.1 - pid: 26243, winpid: 26243, exit: 0] crashed while running 'check warnings'

 - saving '/home/mysql_src/bzrrepos/mysql-6.0-maria/mysql-test/var/log/backup.bug/' to '/home/mysql_src/bzrrepos/mysql-6.0-maria/mysql-test/var/log/backup.bug/'

and in the saved .err file:

090305 17:53:33 [Note] /home/mysql_src/bzrrepos/mysql-6.0-maria/sql/mysqld: Shutdown complete

==26705==
==26705== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 1)
==26705== malloc/free: in use at exit: 980 bytes in 9 blocks.
==26705== malloc/free: 7,019 allocs, 7,010 frees, 330,594,831 bytes allocated.
==26705== For counts of detected errors, rerun with: -v
==26705== searching for pointers to 9 not-freed blocks.
==26705== checked 61,577,292 bytes.
==26705==
==26705== 88 bytes in 2 blocks are definitely lost in loss record 2 of 3
==26705==    at 0x4023858: malloc (vg_replace_malloc.c:207)
==26705==    by 0x8911B58: my_malloc (my_malloc.c:34)
==26705==    by 0x8911D4C: my_strdup (my_malloc.c:79)
==26705==    by 0x83179E1: fix_paths() (mysqld.cc:8969)
==26705==    by 0x831B445: get_options(int*, char**) (mysqld.cc:8836)
==26705==    by 0x831B82B: init_common_variables(char const*, int, char**, char const**) (mysqld.cc:3450)
==26705==    by 0x831D9E1: main (mysqld.cc:4561)

It is an annoying problem, because if a test causes a leak, this leak will be noticed only when a much later test shuts down mysqld. For example, running:
./mtr backup.backup_backupdir backup.backup_maria_other_instance --no-reorder "--valgrind-option=--show-reachable=yes" --valgrind --mem,
the warning is issued for backup.backup_maria_other_instance (which has a server shutdown/restart inside), making this test look guilty (leading it to be disabled as in BUG#42924). Whereas the real guilty is backup.backup_backupdir.

How to repeat:
see above mtr command lines.

Suggested fix:
find out why the shutdown messages don't end up in the .err; they should.
[11 Mar 2009 9:44] Kristian Nielsen
There are several problems related to this in the new mtr code with seperate worker process talking to test case server.

 - No graceful shutdown of mysqld after running last test (kill -9 is used instead), preventing safemalloc and Valgrind from detecting leaks.

 - No parsing of .err logs for warnings after graceful shutdown during the tests, though as Guilhem writes that parsing may or may not be done at a later time as part of an unrelated test.
[11 Mar 2009 13:44] Kristian Nielsen
Found a further problem that the parsing of the server logs is actually done by an SQL stored procedure running within mysqld!

That makes it somewhat difficult to check for warnings generated after shutdown of the mysqld... unless we were to start up mysqld *again*, which would slow down things, especially for running a single test case.

Seems to me that parsing warnings in SQL is a mistake, and the code should go back to doing everything from outside in the mtr framework.
[17 Mar 2009 8:38] Kristian Nielsen
For a possible fix, see

    https://lists.launchpad.net/maria-developers/msg00041.html
[25 May 2009 12:32] Davi Arnaut
One problem is that relying on the shutdown to detect a memory leak is problematic because MTR2 tries to avoid restarting the server between test cases. One solution is to use valgrind's memcheck API to kick a leak check after each test case run.

Before any test case is run, we would call a leak check and count the number of leaks (VALGRIND_COUNT_LEAKS) so that this can be compared with the count after each test case run. If the numbers differ, there is probably a memory leak somewhere..
[22 Jun 2009 12:08] Bjørn Munch
I've been testing the referenced patch in 5.1 (minus the Test.pm changes which is a separate issue), will need to make some changes and make sure I understand what it does.

But as tomorrow is my last day before vacation, I find it best to wait.
[11 Aug 2009 16:04] Guilhem Bichot
still got bitten by this...
[29 Sep 2009 8:20] Tor Didriksen
I'm running with this ad-hoc patch, to avoid the 'kill -9' problem.

=== modified file 'mysql-test/lib/My/SafeProcess/safe_process.cc'
--- mysql-test/lib/My/SafeProcess/safe_process.cc	2009-07-08 12:31:22 +0000
+++ mysql-test/lib/My/SafeProcess/safe_process.cc	2009-09-28 08:00:28 +0000
@@ -95,7 +95,7 @@
 
   message("Killing child: %d", child_pid);
   // Terminate whole process group
-  kill(-child_pid, SIGKILL);
+  kill(-child_pid, SIGTERM);
 
   pid_t ret_pid= waitpid(child_pid, &status, 0);
   if (ret_pid == child_pid)
[30 Sep 2009 14:04] Bjørn Munch
My patch just submitted for for Bug #46327 should also fix the production of memory leak reports, but it still won't be detected by MTR.

I will pick up this again now, I understand a bit more about what the suggested patch was doing :-)
[1 Oct 2009 9:14] Bjørn Munch
I'm considering a total rewrite of the warning/error reporting, doing it in MTR itself. So patching up this extensively now may not be worth it.

When memory leaks are reported, we can't link it to any particular tests when many have been run since the server was started. It's also not obvious to me that a test should be flagged as failed even if we can.

So instead, we could add a step of post processing of the server logs after the test execution when running with valgrind, and add a separate report to the output; this could also have special logic and patterns to look for memory leak errors etc.
[5 Oct 2009 7:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/85701

2839 Bjorn Munch	2009-10-05
      Bug #43418 MTR2: does not notice a memory leak occuring at shutdown of mysqld w/ --valgrind
      Checking is done before server is terminated
      Adds post processing of server logs if --valgrind
      NB this fix depends on (and the diff is relative to) unpushed fix no 46327
[9 Nov 2009 9:48] Guilhem Bichot
comments sent by mail
[11 Nov 2009 15:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/90107

2853 Bjorn Munch	2009-11-11
      Bug #43418 MTR2: does not notice a memory leak occuring at shutdown of mysqld w/
      --valgrind
      Checking is done before server is terminated
      Adds post processing of server logs if --valgrind
      Also had to remove --quiet option to valgrind
[12 Nov 2009 8:46] Bjørn Munch
Quick note on what was changed in the second patch vs. the first:

1. In the while loop, stored $1 in local variable to protect against change until I use it. Thanks to Guilhem for pointing this out in email.

2. Had to remove --quiet option to valgrind which had recently been added by other fix. This would remove the ERROR SUMMARY.

3. Because of (2), added some global suppressions to cover benign messages from valgrind

4. Slight fix to the general pattern for valgring messages.
[12 Nov 2009 9:34] Bjørn Munch
I need to find a second reviewer too.
[12 Nov 2009 10:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/90195

2853 Bjorn Munch	2009-11-12
      Bug #43418 MTR2: does not notice a memory leak occuring at shutdown of mysqld w/ --valgrind
      Checking is done before server is terminated
      Adds post processing of server logs if --valgrind
      Also had to remove --quiet option to valgrind
[12 Nov 2009 11:57] Bjørn Munch
Pushed to 5.1-mtr, trunk-mtr, next-mr-mtr, 6.0-codebase-mtr
[16 Nov 2009 10:21] Bjørn Munch
A followup problem observed in a few tests, see Bug #48795 which alrerady has a patch committed.
[20 Feb 2010 9:32] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100220092844-jh07ayojyxn8uh9p) (version source revid:bjorn.munch@sun.com-20091112115053-tg3016hrewy7ngbf) (merge vers: 6.0.14-alpha) (pib:16)
[20 Feb 2010 9:34] Bugs System
Pushed into 5.5.3-m2 (revid:alik@sun.com-20100220092622-wvhh1vfy5tjq4mhu) (version source revid:bjorn.munch@sun.com-20091112101456-2piurunq3z0x2u5r) (merge vers: 5.5.0-beta) (pib:16)
[20 Feb 2010 9:37] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100220092730-paoo5t9gcajs5dq8) (version source revid:bjorn.munch@sun.com-20091112102024-f386rvyu4rbimemb) (pib:16)
[20 Feb 2010 18:38] Paul DuBois
Change to test suite. No changelog entry needed.

Setting report to Need Merge pending push to 5.1.x.
[1 Mar 2010 8:47] Bugs System
Pushed into 5.1.45 (revid:joro@sun.com-20100301083827-xnimmrjg6bh33o1o) (version source revid:azundris@mysql.com-20100222175719-viuh0f3gdsrkgv0r) (merge vers: 5.1.45) (pib:16)
[1 Mar 2010 15:55] Paul DuBois
No changelog entry needed.
[17 Jun 2010 12:20] Bugs System
Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 13:08] Bugs System
Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:martin.skold@mysql.com-20100609140708-52rvuyq4q500sxkq) (merge vers: 5.1.45-ndb-6.2.19) (pib:16)
[17 Jun 2010 13:48] Bugs System
Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[13 Sep 2010 9:07] Guilhem Bichot
Hello Bjorn. I think that this has not been fixed after all. 'mtr --valgrind' does not notice that my server code is leaking memory; it seems to still be killing mysqld with SIGKILL, which doesn't give any chance to Valgrind to do its at-shutdown checks to find memory which has been leaked. To give a repeatable test case:

1) force a leak in mysqld, with this patch:
=== modified file 'sql/mysqld.cc'
--- sql/mysqld.cc	2010-09-08 07:10:51 +0000
+++ sql/mysqld.cc	2010-09-13 09:00:08 +0000
@@ -4350,6 +4350,8 @@
 int mysqld_main(int argc, char **argv)
 #endif
 {
+  char *leak= (char*)malloc(1000);
+  leak= NULL;
   /*
     Perform basic thread library and malloc initialization,
     to be able to read defaults files and parse options.

2) run this
./mtr --mem alias --valgrind
3) look at var/log/warnings or var/log/mysqld.1.err: there may be a warning about InnoDB ("Uninitialised byte(s) found during client check request"), but nothing about my leak.
The last line of mysqld.1.err is:
"Version: '5.6.99-m5-valgrind-max-debug-log'  socket: '/home/mysql_src/bzrrepos_new/mysql-next-mr-bugfixing2/mysql-test/var/tmp/mysqld.1.sock'  port: 10670  Source distribution"
i.e. the message printed when server starts. If the server had been shut down properly (which is needed for valgrind leak error reporting), there would the message which the server prints when it shuts down (message saying "normal shutdown"). Looks like it's kill -9 here :-(

Tor's quick fix (SIGKILL->SIGTERM as in his post above) makes the valgrind leak warning appear in var/log/mysqld.1.err as desired:
==20022== 1,000 bytes in 1 blocks are definitely lost in loss record 1 of 1
==20022==    at 0x4C25153: malloc (vg_replace_malloc.c:195)
==20022==    by 0x550F49: mysqld_main(int, char**) (mysqld.cc:4353)
==20022==    by 0x549FB3: main (main.cc:24)
Note: even with that quick fix, I don't see the error in var/log/warnings , which is a second bug.
I am using next-mr-bugfixing revision-id:jorgen.loland@oracle.com-20100909105840-i80sa0637fsimlrc compile-pentium64-valgrind-max.
Thanks for looking into this.
[13 Sep 2010 14:37] Bjørn Munch
It has been fixed for the normal case. What happens here is that the test *fails* due to those valgrind warnings at startup, and when the test fails, we do not do normal shutdown of the server.

If you run the same test with --nowarnings, mtr will not fail the test and you should see it reporting the leak.
[14 Sep 2010 8:43] Guilhem Bichot
Hello.
First, when I run a test with --valgrind I expect to get all Valgrind errors. With this test I don't get them all. The logic as you explain it seems to be that as there has been an InnoDB valgrind warning even before shutdown, mtr doesn't need to show me the shutdown valgrind warnings, and thus it can freely kill-9? Given that InnoDB warnings are not something that I can fix, what should I do? Use --nowarnings all the time with --valgrind? Won't I miss some important valgrind warning this way? Don't we have a usability problem here?
Second, even if run with --nowarnings:
main.alias                               [ pass ]   2334
worker[1] Valgrind report from /home/mysql_src/bzrrepos_new/mysql-next-mr-bugfixing2/mysql-test/var/log/mysqld.1.err after tests:
 main.alias
------------------------------------------------------------
ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)

------------------------------------------------------------
The servers were restarted 0 times
Spent 2.334 of 26 seconds executing testcases

Completed: All 1 tests were successful.

and the exit code is 0. The 1000-byte leak, which this time is present in mysqld.1.err, isn't shown above and doesn't make the test fail. Looks like a bug.
[14 Sep 2010 8:58] Bjørn Munch
I think it's intentional that we don't attempt normal shutdown after a failure, but it would make sense to make an exception for valgrind, at least if the test failed due to server log warnings. You have point there.

As for the missing report with --nowarnings, that looks strange. I definitely got one when I tested it. After the SUMMARY line it should print all following lines that start with ==<pid>==. Can you verify they that are in the server log?

We cannot fail the test due to these warnings, as mtr finds them in the log *after* all tests have been run. This cannot be changed without a complete rewrite of the server log analysis.
[16 Sep 2010 14:24] Guilhem Bichot
>I think it's intentional that we don't attempt normal shutdown after a
>failure, but it would make sense to make an exception for valgrind,
>at least if the test failed due to server log warnings. You have point there.

Thanks. Especially as I cannot fix InnoDB warnings, in the current situation they hide the shutdown warnings of my code.

>As for the missing report with --nowarnings, that looks strange. I definitely
>got one when I tested it. After the SUMMARY line it should print all
>following lines that start with ==<pid>==. Can you verify they that are in
>the server log?

The output is really:
worker[1] Using MTR_BUILD_THREAD 67, with reserved ports 10670..10679
main.alias                               [ pass ]   3053
worker[1] Valgrind report from /home/mysql_src/bzrrepos_new/mysql-next-mr-bugfixing2/mysql-test/var/log/mysqld.1.err after tests:
 main.alias
------------------------------------------------------------
ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)

------------------------------------------------------------
The servers were restarted 0 times
Spent 3.053 of 42 seconds executing testcases

Completed: All 1 tests were successful.

and in var/log/mysqld.1.err there is:

100916 16:58:54 [Note] /home/mysql_src/bzrrepos_new/mysql-next-mr-bugfixing2/sql/mysqld: Shutdown complete

==9391== 
==9391== HEAP SUMMARY:
==9391==     in use at exit: 1,000 bytes in 1 blocks
==9391==   total heap usage: 46,334 allocs, 46,333 frees, 416,176,710 bytes allocated
==9391== 
==9391== 1,000 bytes in 1 blocks are definitely lost in loss record 1 of 1
==9391==    at 0x4C25153: malloc (vg_replace_malloc.c:195)
==9391==    by 0x550F49: mysqld_main(int, char**) (mysqld.cc:4353)
==9391==    by 0x549FB3: main (main.cc:24)
==9391== 
==9391== LEAK SUMMARY:
==9391==    definitely lost: 1,000 bytes in 1 blocks
==9391==    indirectly lost: 0 bytes in 0 blocks
==9391==      possibly lost: 0 bytes in 0 blocks
==9391==    still reachable: 0 bytes in 0 blocks
==9391==         suppressed: 0 bytes in 0 blocks
==9391== 
==9391== For counts of detected and suppressed errors, rerun with: -v
==9391== Use --track-origins=yes to see where uninitialised values come from
==9391== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)

I think I know why we don't see it on stdout: valgrind_exit_reports() has those lines:
      # This line marks the start of a valgrind report
      $found_report= 1 if $line =~ /ERROR SUMMARY:/;
but we see in the above .err that the info about the 1000-byte leak is _before_ ERROR SUMMARY. Maybe mtr should also look for "LEAK SUMMARY" (this section is not present when there are no leaks, from what I have tested). Even then, it seems that valgrind_exit_reports() only prints, does not influence mtr's exit code.

>We cannot fail the test due to these warnings, as mtr finds them in the log
>*after* all tests have been run. This cannot be changed without a complete >rewrite of the server log analysis.
I understand that we cannot mark one test as failed, as we don't know what test
caused this. But we can certainly make mtr return a non-zero error code
to indicate that something is wrong.
[17 Sep 2010 8:42] Bjørn Munch
When I tested this (and as I've seen it run elsewhere), the line ERROR SUMMARY has always been printed first by valgrind. Maybe you're running a version that behaves differently.

It might be enough to change the trigger pattern from ERROR SUMMARY: to just SUMMARY:
[23 Nov 2010 13:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/124738

2952 Bjorn Munch	2010-11-23
      Bug #43418 MTR2: does not notice a memory leak occuring at shutdown of mysqld w/ --valgrind
      Follow-up discussed with Reporter:
      Avoid hard shutdown after test failure, if caused by server log warning
        AND we are running valgrind
      More general pick-up of valgrind summaries, order may apparently vary
      Do exit(1) if we did find valgrind summary warnings
[5 Dec 2010 12:44] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:alexander.nozdrin@oracle.com-20101205122447-6x94l4fmslpbttxj) (version source revid:alexander.nozdrin@oracle.com-20101205122447-6x94l4fmslpbttxj) (merge vers: 5.6.1) (pib:23)
[7 Dec 2010 12:49] Paul DuBois
Changes to test suite. No changelog entry needed.
[17 Dec 2010 12:49] Bugs System
Pushed into mysql-5.1 5.1.55 (revid:georgi.kodinov@oracle.com-20101217124435-9imm43geck5u55qw) (version source revid:mats.kindahl@oracle.com-20101201193331-1c07sjno2g7m46ix) (merge vers: 5.1.55) (pib:24)
[17 Dec 2010 12:52] Bugs System
Pushed into mysql-5.5 5.5.9 (revid:georgi.kodinov@oracle.com-20101217124733-p1ivu6higouawv8l) (version source revid:bjorn.munch@oracle.com-20101127105217-ngg48wiwnfra2ik3) (merge vers: 5.5.8) (pib:24)