Bug #42789 ndb test programs crash on Solaris 10 compiled with Sun Studio 12
Submitted: 12 Feb 2009 13:20 Modified: 9 Sep 2009 13:20
Reporter: Guido Ostkamp Email Updates:
Status: Closed Impact on me:
None 
Category:Tests: Cluster Severity:S7 (Test Cases)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jørgen Austvik CPU Architecture:Any
Tags: 6.4 -> 6.4.3

[12 Feb 2009 13:20] Guido Ostkamp
Description:
The NDB test program flexAsync is crashing (SIGSEGV) immediately after start on the first attempt to print something.

> dbx ./flexAsynch 
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading flexAsynch
Reading ld.so.1
Reading libmtmalloc.so.1
Reading libndbclient.so.4.0.0
Reading libpthread.so.1
Reading libthread.so.1
Reading librt.so.1
Reading libgen.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading libm.so.2
Reading libCstd.so.1
Reading libCrun.so.1
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -t 2
Running: flexAsynch -t 2 
(process id 2755)
Reading libc_psr.so.1
t@1 (l@1) signal SEGV (no mapping at the fault address) in NdbOut::endline at line 70 in file "NdbOut.cpp"
   70     m_out->println("");
(dbx) where
current thread: t@1
=>[1] NdbOut::endline(this = 0x1001128b0), line 70 in "NdbOut.cpp"
  [2] endl(_NdbOut = CLASS), line 103 in "NdbOut.hpp"
  [3] NdbOut::operator<<(this = 0x1001128b0, _f = 0x10000b9d8 = &endl(NdbOut&)), line 98 in "NdbOut.hpp"
  [4] main(argc = 3, argv = 0xffffffff7ffff538), line 195 in "flexAsynch.cpp"
(dbx) print m_out
m_out = (nil)

It turns out the m_out, public variable in class NdbOut is NIL. It appears this should have been initialized in storage/ndb/src/common/util/NdbOut.cpp through

  static FileOutputStream ndbouts_fileoutputstream(stdout);
  NdbOut ndbout(ndbouts_fileoutputstream);

However, it seems something is not properly setup at this time when the initialization takes places if this code is really executed which we are not sure about.

The compilation was done using
CC=cc CXX=CC CFLAGS="-g -mt -m64" CXXFLAGS="-g -mt -m64" ./configure --prefix=/export/home/wsch/6.4_2009_01_29 --with-plugins=all --without-docs --without-man --with-debug=full

We are using MySQL-Cluster revision martin.skold@mysql.com-20090211204523-03nx13fjekybwez2 dated Wed 2009-02-11 21:45:23 +0100 on branch mysql-5.1-telco-6.4 build with Sun Studio 12.

We managed to workaround this problem using an ugly workaround as listed below which sets m_out again at runtime of main(), but the problem needs a real solution. The problem might also affect other tools, as class NdbOut might be used elsewhere.

=== modified file 'storage/ndb/test/ndbapi/flexAsynch.cpp'
--- storage/ndb/test/ndbapi/flexAsynch.cpp      2008-11-04 17:15:38 +0000
+++ storage/ndb/test/ndbapi/flexAsynch.cpp      2009-02-12 12:35:43 +0000
@@ -15,6 +15,7 @@
 
 
 
+#include <stdio.h>
 #include <ndb_global.h>
 #include "NdbApi.hpp"
 #include <NdbSchemaCon.hpp>
@@ -30,6 +31,7 @@
 
 #include <NdbTest.hpp>
 #include <NDBT_Stats.hpp>
+#include <OutputStream.hpp>
 
 #define MAX_PARTS 4 
 #define MAX_SEEK 16 
@@ -182,6 +184,9 @@
   int                   tLoops=0;
   int                   returnValue = NDBT_OK;
 
+FileOutputStream ndbouts_fileoutputstream(stdout);
+ndbout.m_out = &ndbouts_fileoutputstream;
+
   flexAsynchErrorData = new ErrorData;
   flexAsynchErrorData->resetErrorCounters();

Best regards

Guido Ostkamp

How to repeat:
Compile using Sun Studio 12 on Solaris 10 and execute flexAsync.
[9 Mar 2009 16:41] Maitrayi Sabaratnam
I could not reproduce the case for the specified version and compilor (executed the test 500 times in a loop, executed thru debugger).

We need the info about the version the ndbapi application is compiled against. is it the same version as the binaries (Was the test program recompiled when the new binaries were taken into use? Any difference might have possibly caused the problem).
[11 Mar 2009 15:11] Guido Ostkamp
Hello Maitrayi,

it was all compiled from the same version. I just verified using the most current bazaar version frazer@mysql.com-20090309160754-r14u7v0om9ajnoii dated Mon 2009-03-09 16:07:54 +0000, that the bug still exists.

I compile as outlined in earlier message. Then in .../storage/ndb/test/ndbapi, I did 'make flexAsynch'. I took the binary 'flexAsync' (not the shell script of equal name) as in storage/ndb/test/ndbapi/.libs and called it either directly or in dbx. It still bombs out

$ dbx ./flexAsynch
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading flexAsynch
Reading ld.so.1
Reading libmtmalloc.so.1
Reading libndbclient.so.4.0.0
Reading libpthread.so.1
Reading libthread.so.1
Reading librt.so.1
Reading libgen.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading libm.so.2
Reading libCstd.so.1
Reading libCrun.so.1
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
(dbx) run -t 2
Running: flexAsynch -t 2 
(process id 14354)
Reading libc_psr.so.1
t@1 (l@1) signal SEGV (no mapping at the fault address) in NdbOut::endline at line 72 in file "NdbOut.cpp"
   72     m_out->println("");
(dbx) where
current thread: t@1
=>[1] NdbOut::endline(this = ???) (optimized), at 0xffffffff7f2fb680 (line ~72) in "NdbOut.cpp"
  [2] NdbOut::operator<<(this = ???, _f = ???) (optimized), at 0x100009b7c (line ~98) in "NdbOut.hpp"
  [3] main(argc = ???, argv = ???) (optimized), at 0x100004e3c (line ~195) in "flexAsynch.cpp"
(dbx) quit

$ ldd -r ./flexAsynch
        libndbclient.so.4 =>     /export/home/wsch/6.4_2009_01_29/lib/mysql/libndbclient.so.4
        libpthread.so.1 =>       /lib/sparcv9/libpthread.so.1
        libthread.so.1 =>        /lib/sparcv9/libthread.so.1
        librt.so.1 =>    /lib/sparcv9/librt.so.1
        libgen.so.1 =>   /lib/sparcv9/libgen.so.1
        libsocket.so.1 =>        /lib/sparcv9/libsocket.so.1
        libmtmalloc.so.1 =>      /usr/lib/sparcv9/libmtmalloc.so.1
        libnsl.so.1 =>   /lib/sparcv9/libnsl.so.1
        libm.so.2 =>     /lib/sparcv9/libm.so.2
        libCstd.so.1 =>  /usr/lib/sparcv9/libCstd.so.1
        libCrun.so.1 =>  /usr/lib/sparcv9/libCrun.so.1
        libc.so.1 =>     /lib/sparcv9/libc.so.1
        libaio.so.1 =>   /lib/64/libaio.so.1
        libmd.so.1 =>    /lib/64/libmd.so.1
        libmp.so.2 =>    /lib/64/libmp.so.2
        libscf.so.1 =>   /lib/64/libscf.so.1
        libdoor.so.1 =>  /lib/64/libdoor.so.1
        libuutil.so.1 =>         /lib/64/libuutil.so.1
        /platform/SUNW,Netra-T2000/lib/sparcv9/libc_psr.so.1
        /platform/SUNW,Netra-T2000/lib/sparcv9/libmd_psr.so.1

Our running platform is installed in /export/home/wsch/6.4_2009_01_29/... so the path for libndbclient.so is ok. The other libraries are system libraries.

This is SunOS pelton1 5.10 Generic_137111-08 sun4v sparc SUNW,Netra-T2000.

Best regards

Guido
[16 Mar 2009 16:16] Maitrayi Sabaratnam
Hi

I still think that the ndbclient library being linked is outdated (the ABI interface might have changed afterwords).

There are 2 ways to verify this hypothesis:
1) run 'make install' to update the lib found in /export/home/wsch/6.4_2009_01_29/...
2) explicitely link the current (from your 11th of March version) library: 
  - you can run the shell script flexAsync from ndbapi (This sets the correct LD_LIBRARY PATH before calling flexAsync) or
  - setting LD_LIBRARY_PATH to your currnt version's storage/ndb/src/.libs
[23 Mar 2009 14:37] Maitrayi Sabaratnam
Need feedback for my comments from 17th of March.
[23 Apr 2009 7:51] Guido Ostkamp
I have repeated the test with the current version revision-id: pekka@mysql.com-20090417190212-yifsmutw0fef59qc dated Fri 2009-04-17 22:02:12 +0300 after switching to new branch mysql-cluster-7.0.

I used the ~/mysql_3rd/storage/ndb/test/ndbapi/flexAsync (the shell script) this time which automatically sets the LD_LIBRARY_PATH (as you suggested). The effect is still the same:

$ cd /export/home/ostkamp/mysql_3rd/storage/ndb/test/ndbapi
$ /flexAsynch -t 2
Segmentation Fault (core dumped)
dbx .libs/flexAsynch /TspCore/core.flexAsynch.22988.1240472997 
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading flexAsynch
core file header read successfully
Reading ld.so.1
Reading libmtmalloc.so.1
Reading libndbclient.so.4.0.0
Reading libpthread.so.1
Reading libthread.so.1
Reading librt.so.1
Reading libgen.so.1
Reading libsocket.so.1
Reading libnsl.so.1
Reading libm.so.2
Reading libCstd.so.1
Reading libCrun.so.1
Reading libc.so.1
Reading libaio.so.1
Reading libmd.so.1
Reading libc_psr.so.1
t@1 (l@1) program terminated by signal SEGV (no mapping at the fault address)
Current function is NdbOut::endline (optimized)
   72     m_out->println("");
(dbx) where
current thread: t@1
=>[1] NdbOut::endline(this = ???) (optimized), at 0xffffffff7f2fbe80 (line ~72) in "NdbOut.cpp"
  [2] NdbOut::operator<<(this = ???, _f = ???) (optimized), at 0x100009b7c (line ~98) in "NdbOut.hpp"
  [3] main(argc = ???, argv = ???) (optimized), at 0x100004e3c (line ~195) in "flexAsynch.cpp"
(dbx) quit

Please let me know if you need additional information.

Regards

Guido Ostkamp
[4 Sep 2009 12:41] Jørgen Austvik
Also seen elsewhere:

If you use NDB API to connect to a wrongly configured cluster, you can get a core dump instead of an error message.

Code that connect to cluster:

---------8<------------------8<------------------8<------------------8<---------
    ndb_init();

    vector<Ndb_cluster_connection *> connections;
    for (long i = 0; i < threads; i++) {
        cout << "Connecting thread " << i << "..." << endl;
        Ndb_cluster_connection *conn = new Ndb_cluster_connection(connectString.c_str());
        if (conn->connect(4, 5, 1)) {
            cout << "Unable to connect to cluster within 30 secs." << endl;
            exit(-1);
        }

        cout << "We have a connection, wait until cluster ready..." << endl;
        // Optionally connect and wait for the storage nodes (ndbd's)
        if (conn->wait_until_ready(30, 0) < 0) {
            std::cout << "Cluster was not ready within 30 secs.\n";
            exit(-1);
        }
        connections.push_back(conn);
    }
---------8<------------------8<------------------8<------------------8<---------

With OK configuration this works fine, but on configuration errors, like "Configuration error: Error : Could not alloc node id at localhost port 1186: No free node id found for mysqld(API)", my NDB API client application core dumps:

---------8<------------------8<------------------8<------------------8<---------
Current function is NdbOut::operator<<
   61   NdbOut::operator<<(const char* val){ m_out->print("%s", val ? val : "(null)"); return * this; }
(dbx) print val
val = 0x80a6aa0 "Configuration error: Error : Could not alloc node id at localhost port 1186: No free node id found for mysqld(API)."
(dbx) print m_out
m_out = (nil)
(dbx) where
current thread: t@1
=>[1] NdbOut::operator<<(this = 0xfef37ec4, val = 0x80a6aa0 "Configuration error: Error : Could not alloc node id at localhost port 1186: No free node id found for mysqld(API)."), line 61 in "NdbOut.cpp"
  [2] Ndb_cluster_connection_impl::connect(this = 0x80a6220, no_retries = 4, retry_delay_in_seconds = 5, verbose = 1), line 769 in "ndb_cluster_connection.cpp"
  [3] Ndb_cluster_connection::connect(this = 0x80b81f8, no_retries = 4, retry_delay_in_seconds = 5, verbose = 1), line 780 in "ndb_cluster_connection.cpp"
  [4] main(0xa, 0x8047494, 0x80474c0, 0x8053e48), at 0x8055403 
---------8<------------------8<------------------8<------------------8<---------
[8 Sep 2009 13:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82685

2983 Jorgen Austvik	2009-09-08
      bug#42789: initialize ndbout
[8 Sep 2009 13:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82695

2987 Jorgen Austvik	2009-09-08
      bug#42789: initialize ndbout
[8 Sep 2009 13:54] Bugs System
Pushed into 5.1.37-ndb-6.3.27 (revid:jorgen.austvik@sun.com-20090908133858-dldnac6vc0fr6qiy) (version source revid:jorgen.austvik@sun.com-20090908133858-dldnac6vc0fr6qiy) (merge vers: 5.1.37-ndb-6.3.27) (pib:11)
[8 Sep 2009 13:55] Bugs System
Pushed into 5.1.37-ndb-7.0.8 (revid:jorgen.austvik@sun.com-20090908134812-7pnc0kkap433qbjy) (version source revid:jorgen.austvik@sun.com-20090908134812-7pnc0kkap433qbjy) (merge vers: 5.1.37-ndb-7.0.8) (pib:11)
[8 Sep 2009 13:55] Bugs System
Pushed into 5.1.35-ndb-7.1.0 (revid:jorgen.austvik@sun.com-20090908135124-m74z86wuwaqsyzfi) (version source revid:jorgen.austvik@sun.com-20090908135124-m74z86wuwaqsyzfi) (merge vers: 5.1.35-ndb-7.1.0) (pib:11)
[8 Sep 2009 18:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82725
[8 Sep 2009 18:20] Bugs System
Pushed into 5.1.35-ndb-7.1.0 (revid:magnus.blaudd@sun.com-20090908181903-js6r7i1yzxyaqu9k) (version source revid:magnus.blaudd@sun.com-20090908181903-js6r7i1yzxyaqu9k) (merge vers: 5.1.35-ndb-7.1.0) (pib:11)
[9 Sep 2009 13:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82813

2992 Jonas Oreland	2009-09-09
      ndb - bug#42789
        reintroduce ndbouts_fileoutputstream allocated statically (but initialized by ndb_init)
        to avoid memory leak (as reported by valgrind)
      
        also, while i'm at it, create a NdbOut_Init() so that ndb_init() doesnt have to be so contaminated with NdbOut internals
[9 Sep 2009 13:20] Jon Stephens
Test failure, no user-facing changes -> nothing to document; closed without taking further action.