Bug #40065 Crash involving Falcon on Linux/PPC during the test suite
Submitted: 15 Oct 2008 18:54 Modified: 11 Aug 2009 9:09
Reporter: Joerg Bruehe Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Compiling Severity:S1 (Critical)
Version:6.0.8 OS:Linux (PowerPC, gcc 4.3)
Assigned to: Joerg Bruehe CPU Architecture:Any
Tags: F_PLATFORM

[15 Oct 2008 18:54] Joerg Bruehe
Description:
Since some days, I have the test suite hang when running it locally on my Linux/PPC machine ("Pegasos", running Debian "testing") using the current 6.0-build team tree.

Externally, the symptom is:
- the suite hangs,
- the Perl script has two "defunct" child processes
  (by now, I found: "mysqltest" and "mysqladmin"),
- the timeout for the whole suite is running,
- no server process.

Today, I found the server leaves core dumps - this is a backtrace:
(gdb) where
#0  0x0ff70454 in pthread_kill () from /lib/libpthread.so.0
#1  0x108ddfb0 in my_write_core (sig=11) at stacktrace.c:307
#2  0x102d43cc in handle_segfault (sig=11) at mysqld.cc:2669
#3  <signal handler called>
#4  ~Dbb (this=0x48715758) at Dbb.cpp:139
#5  0x105a7d20 in ~Database (this=0x48514630) at Database.cpp:568
#6  0x1059d728 in Connection::shutdownDatabase (this=<value optimized out>) at Connection.cpp:1805
#7  0x1055929c in StorageDatabase::close (this=0x48715120) at StorageDatabase.cpp:925
#8  0x1055da88 in StorageHandler::shutdownHandler (this=0x48514020) at StorageHandler.cpp:196
#9  0x10547e84 in StorageInterface::panic (hton=<value optimized out>, flag=<value optimized out>)
    at ha_falcon.cpp:266
#10 0x103eb358 in ha_finalize_handlerton (plugin=0x115701b8) at handler.cc:385
#11 0x1049287c in reap_plugins () at sql_plugin.cc:819
#12 0x104936e0 in plugin_shutdown () at sql_plugin.cc:1512
#13 0x102d3500 in clean_up (print_message=true) at mysqld.cc:1353
#14 0x102d73f0 in kill_server (sig_ptr=<value optimized out>) at mysqld.cc:1269
#15 0x102d7560 in kill_server_thread (arg=<value optimized out>) at mysqld.cc:1232
#16 0x0ff69e34 in start_thread () from /lib/libpthread.so.0
#17 0x0fb9e9d0 in clone () from /lib/libc.so.6
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

How to repeat:
I had it three times, each in new builds, when running the test suite.
Last output lines were:

main.lowercase_utf8            [ pass ]              8
Warning;  Aborted waiting on pid file: '/MySQL/REPO/V60/work-6.0/mysql-test/var/run/master.pid' after 20 seconds
main.lowercase_view            [ pass ]            106
Warning;  Aborted waiting on pid file: '/MySQL/REPO/V60/work-6.0/mysql-test/var/run/master.pid' after 20 seconds

In all three cases, the output was the same.

I just try to reproduce it on a machine in Uppsala,
will add details here depending on that result.
[16 Oct 2008 11:28] Joerg Bruehe
The test I started in Uppsala yesterday did not show this problem.
(It had others which I will report separately.)

I just pulled a big change from the main 6.0 tree to my local machine and will repeat the test.

If it still fails, I have to check details about compiler, libraries etc.
[16 Oct 2008 19:57] Joerg Bruehe
The failure persists on my home Pegasos.

I now try to mimic the way we do a release build, hoping this may have some effect.
[21 Oct 2008 10:48] Joerg Bruehe
Update:

- Mimicing a release build
  (rather than building within the BZR source tree)
  has no effect, the problem persists.

- The problem depends on including Falcon in the build:
  Using "--with-falcon", it happens; leaving that out, all is ok.

- I cannot run my locally generated binaries on our Uppsala host,
  because they are built against "libstdc++.so.6"
  whereas our Uppsala host is still on version 5.

- I can run the Uppsala binaries on my machine (I have versions 5 + 6),
  they do not show the problem.

- My local gcc is version 4.3.1, and the only option seems to be "-O3".

- The Uppsala host uses gcc 3.3.5 with these options:
  CC="ccache gcc -static-libgcc"
  CFLAGS="-g -O3 -mpowerpc -mcpu=powerpc"
  CXX="ccache gcc -static-libgcc"
  CXXFLAGS="-g -O3 -mpowerpc -mcpu=powerpc"

- In Uppsala, we do not have any build host with gcc 4.3.
  The newest we have is 4.2 on AIX (where we don't use it),
  the newest we use are 4.0 and 4.1 versions.

So there is quite some likelihood it is caused by an incompatibility between Falcon and gcc 4.3, maybe specific on some platforms only.

I will try a build + test in Trondheim using gcc 4.3, but that is no PPC.
[23 Oct 2008 17:01] Joerg Bruehe
Update:

- gcc4.3 on x86 creates binaries that pass the tests.

- gcc 4.3 on PPC with option "-O1" also creates working binaries.

I will try a tool that should isolate the module which fails when compiled using a higher optimization.
[13 Nov 2008 17:13] Joerg Bruehe
Update:
Using a script from Kent, I could isolate the critical module. It is
   storage/falcon/SymbolManager.cpp

Compiling just this one module using "-O1" while all others use "-O3" solved the issues. In particular, running "make test-force" it got rid of 1,517 messages of the pattern
  "Aborted waiting on pid file ... after 20 seconds"
(for "master.pid", "master1.pid", or "slave.pid").

However, I have not yet found a way to automatically set "-O1" for just this one file using the autotools.
Still searching ...
[14 Nov 2008 11:36] Kevin Lewis
Hakan Kuecuekyilmaz wrote:
We had a similar issue with Apple's gcc this February. Back then I opened a bug report (Bug#5760443) at Apple:
 
22-Feb-2008 01:19 PM Hakan Kuecuekyilmaz:
When using gcc 4.0.1 to compile MySQL with Falcon storage engine we see regressions in our test suite. In the Falcon project we use exceptions and deconstructors. In a certain case the exception misses to call a deconstructor which leads to an assertion in our code.

The bug shows up with optimized build only. When using a debug compile, the problem does not show up.

Using gcc 4.2.2 from the fink project does not show the problem.

Details of the bug can be found at
http://bugs.mysql.com/bug.php?id=33184

We are fairly sure, that this is a compiler bug in Apple's gcc 4.0.1 as the regression in the test case does not show up on any other platform where gcc is used as compiler.

==========================

I have Mac/PPC and gcc 4.3.2. Jörg can you give me your exact build instructions to reproduce the failure?
[14 Nov 2008 11:39] Kevin Lewis
I got it with Linux (Debian) on PPC, using "BUILD/compile-ppc-max". But see the report: I also had it when mimicing a release build, where I exported a source tarball and ran just "configure ; make" on it.

  joerg@debian:/V60/try40065-6.0$ gcc --version
  gcc (Debian 4.3.1-2) 4.3.1
  Copyright (C) 2008 Free Software Foundation, Inc.
  This is free software; see the source for copying conditions. .....

I could not reproduce it on our Linux/PPC host in Uppsala (also Debian, but older) which still uses gcc 3.3.5

Originally, I got hangs during the test suite, as described in the bug report. The tests did "pass", the issues were purely within the server, from the stack trace it looked like "shutdown".

With current sources, the hang is gone, but there is obviously still some issue when the tests finish - maybe during server shutdown.  On the outside, it just shows up as "Aborted waiting on pid file ..."

With the problem present, MTR spends enormous times waiting:
  Spent 3199.187 of 16655 seconds executing testcases     (normal)
  Spent 3549.562 of 18985 seconds executing testcases     (PS)
Using "-O1" for "SymbolManager.cpp", it becomes reasonable:
  Spent 3173.326 of 5931 seconds executing testcases      (normal)
  Spent 3498.277 of 6498 seconds executing testcases      (PS)
(Same source tree)

The sources I used last are the current 6.0 main tree.

What surprises me is that I don't see a function of "SymbolManager.cpp" in the backtrace of the core dumps I got originally (see bug report).  But I am quite sure I did the module isolation with the same sources that had produced the core, so it seems the crash does not happen in the critical module but at some remote place. As I don't really know which part of the binary code acts wrong, anything more would be speculating.

Regards,
and good luck in reproducing!
Joerg Bruehe,  MySQL Build Team
[26 Nov 2008 19:24] Joerg Bruehe
To make it obvious:
This happens using a newer, "future" compiler, not an old one.
The "workaround" would be to go back to an older compiler (provided one exists for the platform), which might cause trouble with the C++ features used by Falcon.

I can avoid the issue by the following patch:

=== modified file 'storage/falcon/Makefile.am'
--- storage/falcon/Makefile.am  2008-10-20 09:16:47 +0000
+++ storage/falcon/Makefile.am  2008-11-21 14:42:01 +0000
@@ -361,7 +361,6 @@ falcon_sources= Agent.cpp Alias.cpp \
                StorageTableShare.cpp \
                Stream.cpp \
                StreamSegment.cpp \
-               SymbolManager.cpp \
                Sync.cpp \
                Synchronize.cpp \
                SyncHandler.cpp \
@@ -397,7 +396,7 @@ ha_falcon_la_LDFLAGS=       -module -rpath $(p
 ha_falcon_la_CXXFLAGS= $(AM_CXXFLAGS) -DMYSQL_DYNAMIC_PLUGIN
 ha_falcon_la_CFLAGS=   $(AM_CFLAGS) -DMYSQL_DYNAMIC_PLUGIN
 ha_falcon_la_LIBADD=   TransformLib/libtransform.la
-ha_falcon_la_SOURCES=  $(falcon_sources)
+ha_falcon_la_SOURCES=  $(falcon_sources) SymbolManager.cpp

 if HAVE_DTRACE
   ha_falcon_la_LIBADD+= falcon_probes.o
@@ -412,13 +411,19 @@ libhafalcon_a_CXXFLAGS=   $(AM_CXXFLAGS)
 libhafalcon_a_CFLAGS=  $(AM_CFLAGS)
 libhafalcon_a_SOURCES= $(falcon_sources)

+EXTRA_LIBRARIES+=      lib_no_opt.a
+lib_no_opt_a_CXXFLAGS= $(AM_CXXFLAGS)
+lib_no_opt_a_CFLAGS=   $(AM_CFLAGS)
+lib_no_opt_a_SOURCES=  SymbolManager.cpp
+
 libfalcon.a:           $(libfalcon_a_LIBADD)
                -rm -f $@
+               CXXFLAGS=`echo $(CXXFLAGS) | sed 's/-O[2-9]/-O1/'` $(MAKE) CXXFLAGS="$$CXXFLAGS" lib_no_opt.a
                if test "$(host_os)" = "netware" ; \
                then \
-                 $(libfalcon_a_AR) $@ $(libfalcon_a_LIBADD) ; \
+                 $(libfalcon_a_AR) $@ $(libfalcon_a_LIBADD) lib_no_opt.a ; \
                else \
-                 for arc in $(libfalcon_a_LIBADD); do \
+                 for arc in $(libfalcon_a_LIBADD) lib_no_opt.a ; do \
                     arpath=`echo $$arc|sed 's|[^/]*$$||'`; \
                     $(AR) t $$arc|sed "s|^|$$arpath|"; \
                  done | sort -u | xargs $(AR) cq $@ ; \

I know this is not nice, calling a sub-make, but it was the only way I found to override the "-O3" which is set in CXXFLAGS.
Setting specific flags for a non-optimized library does not solve it, because these will be used too early on the command line, CXXFLAGS is always last.

I even found entries in the automake manual documenting that:
http://www.gnu.org/software/automake/manual/html_node/Per_002dObject-Flags.html
http://www.gnu.org/software/automake/manual/html_node/Flag-Variables-Ordering.html#Flag-Va...

Of course, this patch would do the hack unconditionally, so it should rather be made dependant on a conditional.

Please tell me whether you want me to proceed along this line.
[27 Nov 2008 7:24] Hakan Küçükyılmaz
Joerg,

I tried on my Mac/PPC wiht gcc 4.3.2, but I can't even compile

BUILD/autorun.sh
CC="/sw/bin/ccache /sw/bin/gcc-4"
CXX="/sw/bin/ccache /sw/bin/gcc-4"
./configure --disable-shared --enable-assembler --enable-local-infile --enable-shared --enable-thread-safe-client --libexecdir=/usr/local/mysql/bin --localstatedir=/usr/local/mysql/data --with-big-tables --with-client-ldflags='-static' --with-comment='MySQL-Community-Server' --with-extra-charsets=all --with-plugins=max-no-ndb --prefix=/usr/local/mysql --with-readline --with-ssl --with-libevent --with-zlib-dir=bundled

make -j2

/sw/bin/ccache /sw/bin/gcc-4 -DDEFAULT_BASEDIR=\"/usr/local/mysql\" -DDATADIR="\"/usr/local/mysql/data\"" -DDEFAULT_CHARSET_HOME="\"/usr/local/mysql\"" -DSHAREDIR="\"/usr/local/mysql/share/mysql\"" -DDEFAULT_HOME_ENV=MYSQL_HOME -DDEFAULT_GROUP_SUFFIX_ENV=MYSQL_GROUP_SUFFIX -DDEFAULT_SYSCONFDIR="\"/usr/local/mysql/etc\"" -DHAVE_CONFIG_H -I. -I../include -I../zlib -I../include -I../include -I.    -g   -O3   -fno-omit-frame-pointer   -Wunused-function   -Wunused-label   -Wunused-value   -Wunused-variable   -Wimplicit   -Wreturn-type   -Wswitch   -Wtrigraphs   -Wcomment   -W   -Wchar-subscripts   -Wformat   -Wparentheses   -Wsign-compare   -Wwrite-strings   -Wuninitialized   -Wunused   -DFORCE_INIT_OF_VARS   -mno-fused-madd -D_P1003_1B_VISIBLE -DSIGNAL_WITH_VIO_CLOSE -DSIGNALS_DONT_BREAK_READ -DIGNORE_SIGHUP_SIGQUIT  -DDONT_DECLARE_CXA_PURE_VIRTUAL -MT waiting_threads.o -MD -MP -MF .deps/waiting_threads.Tpo -c -o waiting_threads.o waiting_threads.c
mv -f .deps/thr_lock.Tpo .deps/thr_lock.Po
waiting_threads.c: In function 'wt_resource_id_memcmp':
waiting_threads.c:400: error: size of array 'compile_time_assert' is negative
make[1]: *** [waiting_threads.o] Error 1
make: *** [all-recursive] Error 1

[esslingen:~/work/mysql/mysql-6.0-falcon] hakan$ gcc-4 --version
gcc-4 (GCC) 4.3.2

Regarding your workaround. I think it is quite a "hack". As long as we cannot reproduce this failure on a variety of platforms, I hesitate to vote for your workaround.

Best,

Hakan
[27 Nov 2008 7:35] Hakan Küçükyılmaz
Joerg,

gcc 4.3.1 is outdated by 4.3.2.

Can you please verify with gcc 4.3.2 on your PPC machine?
[2 Dec 2008 18:09] Joerg Bruehe
I upgraded gcc on my box:
   joerg@debian:~$ gcc --version
   gcc (Debian 4.3.2-1) 4.3.2

I now started a new run.
[4 Dec 2008 16:15] Joerg Bruehe
Bug is still present with gcc 4.3.2

It does not show with gcc 4.1:
> gcc --version
gcc (GCC) 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)

I can switch back to gcc 4.1 for now, but I need a working installation on this box which I can use to verify merges, even in 6.0.

I agree that my proposed patch is a hack, but unless someone really does analyze the assembler code and find the exact place (and then can fix it in the C++ source) it is the only thing I can propose.

We'll see what happens when more target platforms upgrade the compiler ...
[18 Jun 2009 10:31] Hakan Küçükyılmaz
Got it compiled now on Mac/PPC with gcc 4.3.3 with mysql-6.0 sources from bzr.

BUILD/autorun.sh
export CC="/sw/bin/ccache /sw/bin/gcc-4"
export CXX="/sw/bin/ccache /sw/bin/g++-4"
./configure --disable-shared \
  --enable-assembler \
  --enable-local-infile \
  --enable-shared \
  --enable-thread-safe-client \
  --libexecdir=/usr/local/mysql/bin \
  --localstatedir=/usr/local/mysql/data \
  --with-big-tables \
  --with-client-ldflags='-static' \
  --with-comment='MySQL-Community-Server' \
  --with-extra-charsets=all \
  --with-plugins=max-no-ndb \
  --prefix=/usr/local/mysql \
  --with-readline \
  --with-ssl \
  --with-libevent \
  --with-zlib-dir=bundled
[18 Jun 2009 10:33] Hakan Küçükyılmaz
Joerg,

I found one more thing:
   CXX has to be g++ to get Falcon compiled properly.

Can you give it another try on your Linux/PPC?

Thanks,

Hakan
[10 Aug 2009 16:21] Joerg Bruehe
I sure know "g++" is needed for Falcon's C++ code, always did that.

I am just running a check whether the currently available 6.0 sources still fail with gcc 4.3.2 (Debian 4.3.2-1.1) which is the up-to-date gcc (4.3) for Debian Lenny.
[11 Aug 2009 9:09] Joerg Bruehe
My test finished, and I couldn't reproduce the failure.

Given that I had used gcc 4.1 in the meantime, only now switched to gcc 4.3 again, it may have been any intermediate change which made the bug disappear.

There will be plenty of time till we have a RC or even GA of 6.0, and who knows which compiler updates we will get till then.

I set this bug to "Can't repeat", there is no use in pursuing it any further unless it shows up again.