Bug #35319 Almost certain crash(es) of mysqld with the "rpl*" tests from mysql-test-run.pl
Submitted: 16 Mar 2008 12:34 Modified: 30 Mar 2010 6:16
Reporter: Axel Luttgens Email Updates:
Status: Can't repeat Impact on me:
None 
Category:Tests: Replication Severity:S7 (Test Cases)
Version:5.0.51a + 5.1 OS:MacOS (10.4 + 10.5, PPC)
Assigned to: Daogang Qu CPU Architecture:Any

[16 Mar 2008 12:34] Axel Luttgens
Description:
Hello,

Sorry if this is a duplicate, but I couldn't find a similar report.

While running mysql-test-run.pl with the mysql-5.0.51a-osx10.5-x86.tar.gz binary distribution, one almost always get at least one error message similar to these ones:

    Warning;  Aborted waiting on pid file: ...
    mysqltest: At line 56: query 'stop slave' failed: 2013: Lost connection ...

Such messages in fact are systematically related to crashes of mysqld.

After many runs of the test suite, it seems very likely that those crashes never happen outside the replication tests (those starting with "rpl").

As a consequence, running:

    ./mysql-test-run.pl --force --do-test=rpl

will almost certainly trigger at least one crash of the mysqld daemon.

However, those crashes are random in the sense that they aren't related to a particular test: they happen _somewhere_ in the course of the "rpl*" tests.

The same happen with the binaries from the 64-bit distribution mysql-5.0.51a-osx10.5-x86_64.tar.gz, as well as with self-compiled 32/64 binaries on both Intel and PPC.

On the other hand, configuring/building with the "--with-debug" option does not improve nor change the above behavior.

Of course, please let me know should I overlook something: one never may exclude such an hypothesis...

How to repeat:
Download distribution mysql-5.0.51a-osx10.5-x86.tar.gz, unpack it, cd to its mysql-test directory and run:

    ./mysql-test-run.pl --force --do-test=rpl

several times.

Do the same with distribution mysql-5.0.51a-osx10.5-x86.tar.gz.

Similar behaviors may be observed with binaries compiled for (and run on) PPC.
[16 Mar 2008 12:36] Axel Luttgens
Reports for the 32-bit distribution

Attachment: Distribution32.txt (text/plain), 105.68 KiB.

[16 Mar 2008 12:36] Axel Luttgens
Reports for the 64-bit distribution

Attachment: Distribution64.txt (text/plain), 16.61 KiB.

[16 Mar 2008 12:38] Axel Luttgens
Reports for a 32-bit "debug" compiled binary

Attachment: DebugBinaries.txt (text/plain), 29.61 KiB.

[24 Mar 2008 23:49] Sveta Smirnova
Thank you for the report.

I can not repeat described behavior. My guess you can catch some system port allocation conflict. Please check system logs located at /var/log (system.log, netinfo.log) in case if they contain related information.
[25 Apr 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[28 Apr 2008 13:00] Axel Luttgens
[25 Mar 0:49] Sveta Smirnova wrote:
> Thank you for the report.
> 
> I can not repeat described behavior.
>
> My guess you can catch some system port allocation conflict. Please check
> system logs located at /var/log (system.log, netinfo.log) in case if they
> contain related information.

Please agree my apologies: I didn't figure how quickly time had flown.

Here, I still can reproduce the problem rather consistently on various boxes.
But there's a point I didn't immediately notice: it looks like one needs some "luck" to observe the crashes.

So, I may sometimes need to execute a command like this one:
	./mysql-test-run.pl --force --skip-ndb --do-test=rpl
several times (10-15 times?) before getting a crash; thereafter, crashes will occur almost surely for several subsequent runs.
But those crashes may also occur immediately, even on a freshly rebooted machine...

As far as the system logs are concerned, I never could find any relevant info.
I even tried today to increase syslogd's verbosity (in both system.log and asl.log), in case something could be emitted at a DEBUG level only: no more info to be found.
That is, other parts of the system don't seem to complain because of unsuspectedly unavailable resources, or the like.

In fact, the problem seems to be strictly internal to mysqld, possibly at cleanup/quit time. For example, with the i386 code, the crashed thread seems to always show following pattern:
  Thread 2 Crashed:
  0   libSystem.B.dylib     0x92f255ba __pthread_kill + 10
  1   mysqld                0x00170229 write_core + 43
  2   mysqld                0x00079714 handle_segfault + 656
  3   libSystem.B.dylib     0x92eab5eb _sigtramp + 43
  4   ???                   0xffffffff 0 + 4294967295
  5   libSystem.B.dylib     0x92e44aed free + 106
  6   mysqld                0x0016d9ba mysql_close_free_options + 136
  7   mysqld                0x0016f5b9 mysql_close + 133
  8   mysqld                0x00165618 handle_slave_io + 4664
  9   libSystem.B.dylib     0x92e6fc55 _pthread_start + 321
  10  libSystem.B.dylib     0x92e6fb12 thread_start + 34
But I definitely am not a debug wizard...

Now, I've reported the problem as "critical", because such seemingly self-inflicted crashes never are good news for a daemon...

Thanks for your follow-up,
Axel
[8 Jul 2008 18:39] Joerg Bruehe
I had this symptom in some tests of the 5.1.26-rc
on our OS X machines with PowerPC (both 32 and 64 bit).

Some examples:

=====
rpl.rpl_extraColmaster_innodb 'mix' [ fail ]

mysqltest: In included file "./extra/rpl_tests/rpl_extraMaster_Col.test": At line 728: query 'STOP SLAVE' fail
ed: 2013: Lost connection to MySQL server during query

The result from queries just before the failure was:
< snip >
Last_IO_Errno   #
Last_IO_Error   #
Last_SQL_Errno  1091
Last_SQL_Error  Error 'Can't DROP 'c7'; check that column/key exists' on query. Default database: 'test'. Quer
y: 'ALTER TABLE t14 DROP COLUMN c7'
STOP SLAVE;
RESET SLAVE;

*** Drop
DROP TABLE t14;
DROP TABLE t14;
RESET MASTER;
START SLAVE;

*************************************************
* - Alter Master adding columns at end of table *
*   Expect: Error 1054                          *
*************************************************

*** Create
STOP SLAVE;

More results from queries before failure can be found in /PATH/mysql-test/var/log/rpl_extraColmaster_innodb.log
=====
Similar for log mode 'stmt',
and for test "rpl.rpl_extraColmaster_myisam" in log modes 'mix', 'row', and 'stmt'.

=====
rpl.rpl_slave_status 'mix'     [ fail ]

mysqltest: At line NNN: query 'stop slave' failed: 2013: Lost connection to MySQL server during query

The result from queries just before the failure was:
stop slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
reset master;
reset slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
start slave;
grant replication slave on *.* to rpl@127.0.0.1 identified by 'rpl';
stop slave;

More results from queries before failure can be found in /PATH/mysql-test/var/log/rpl_slave_status.log
rpl.rpl_slave_status 'mix'     [ fail ]

mysqltest: At line NNN: query 'stop slave' failed: 2013: Lost connection to MySQL server during query

The result from queries just before the failure was:
stop slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
reset master;
reset slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
start slave;
grant replication slave on *.* to rpl@127.0.0.1 identified by 'rpl';
stop slave;

More results from queries before failure can be found in /PATH/mysql-test/var/log/rpl_slave_status.log
=====
Similar in log mode 'stmt'.

There are other tests that fail with this symptom.

Also, this does not start with 5.1.26-rc, it has a longer tradition.

Searching the bug DB led me to report #15671 which reports a similar issue for OS X using the x86 CPU, and claims it got fixed by July 2006.
Scanning all isolated test failures in the build archive for crashes reported with "stop slave", I found a spike in versions 5.0.20 - 5.0.22 on such platforms, and this then really dropped off (July 2006).

So it seems this problem got fixed for x86, but still exists for PPC.

I will attach a file with a list of all such occurrences after July 2006 which got isolated, the test logs are available in the archive.
[8 Jul 2008 18:49] Joerg Bruehe
List of isolated occurrences in the test suite runs

Attachment: list-stopslave-2013 (application/octet-stream, text), 7.43 KiB.

[18 Nov 2008 23:12] Trudy Pelzer
Per Vlad, bug#40796 is probably related:
"After searching a bit in the bugdb, heap corruption with rpl involved
appears to be reported on MacOSX http://bugs.mysql.com/bug.php?id=35319 
Stack looks similar to what I've seen in http://bugs.mysql.com/bug.php?id=40796"
[29 Mar 2010 2:50] Daogang Qu
The bug can't be repeated. 
I run the './mysql-test-run.pl --force --skip-ndb --do-test=rpl' command base on newest version of mysql-5.0-bugteam and mysql-5.1-bugeam for more than twenty times. But the bug was not been reproduced on both trees.
[29 Mar 2010 2:53] Daogang Qu
The above tests are done on slid07.Norway.Sun.COM with Mac OS X 10.5, PPC arch.