Bug #30366 NDB fails to start on OS X, 64 bit
Submitted: 10 Aug 2007 19:33 Modified: 17 Jan 2008 22:34
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1 OS:MacOS (64bit)
Assigned to: Magnus Blåudd CPU Architecture:Any
Tags: sr5_1

[10 Aug 2007 19:33] Joerg Bruehe
Description:
I do not know how long-standing this issue is,
definitely *not* new in 5.1.21 -
but sadly, I could not find a bug report for it.
(I have not searched for a working version in history,
AFAIR 5.1.19 was failing, 5.1.20 definitely was.)

The problem is specific to OS X ppc-64bit, has not been observed on any other platform.

The symptom is a bit vague:
When starting the test suite ("mysql-test-run.pl") *without* "--skip-ndbcluster", it hangs.
This hang seems infinite, there is no progress until we notice it.
The existing processes are "ndb_waiter" and "ndb_mgmd",
killing both of them lets the test run finish (no tests attempted),
and "make test-bt" (what we use to run the tests) proceeds to the next run.
If that again involves NDB, the problem occurs again.

The following is a log extract:

Logging: ./mysql-test-run.pl --comment=ps+rowrepl+NDB --force --timer --ps-protocol --mysqld=--binlog-format=row
070809 23:50:47 [Warning] Setting lower_case_table_names=2 because file system for /Users/mysqldev/tmp-200708081852-5.1.21-beta-26112/osx-tiger-ppc-64bit/test/mysql-5.1.21-beta-osx10.4-powerpc-64bit/share/mysql/english/ is case insensitive
MySQL Version 5.1.21

##############################################################################
# ps+rowrepl+NDB
##############################################################################

Using binlog format 'row'
Using ndbcluster when necessary, mysqld supports it
Setting mysqld to support SSL connections
Using MTR_BUILD_THREAD      = 201
Using MASTER_MYPORT         = 12010
Using MASTER_MYPORT1        = 12011
Using SLAVE_MYPORT          = 12012
Using SLAVE_MYPORT1         = 12013
Using SLAVE_MYPORT2         = 12014
Using NDBCLUSTER_PORT       = 12015
Using NDBCLUSTER_PORT_SLAVE = 12016
Using IM_PORT               = 12017
Using IM_MYSQLD1_PORT       = 12018
Using IM_MYSQLD2_PORT       = 12019
Killing Possible Leftover Processes
mysql-test-run: WARNING: Found non pid file master-slow.log in /Users/mysqldev/tmp-200708081852-5.1.21-beta-26112/osx-tiger-ppc-64bit/test/mysql-5.1.21-beta-osx10.4-powerpc-64bit/mysql-test/var/run
Removing Stale Files
Creating Directories
Installing Master Database
Installing Master Database
Installing Slave1 Database
Installing Master Cluster
mysql-test-run: *** ERROR: Failed to wait for start of ndb_mgmd
Autoreleasing /tmp/mysql-test-ports:201
make: [test-bt] Error 1 (ignored)

From current "make test-bt", these runs are affected:
./mysql-test-run.pl --comment=ps+rowrepl+NDB --force --timer --ps-protocol --mysqld=--binlog-format=row
./mysql-test-run.pl --comment=NDB --force --timer --with-ndbcluster-only
./mysql-test-run.pl --force --comment=funcs1_ps --ps-protocol --suite=funcs_1
./mysql-test-run.pl --force --comment=funcs2 --suite=funcs_2
./mysql-test-run.pl --force --comment=partitions --suite=parts

These runs are *not*:
./mysql-test-run.pl --comment=debug --force --timer --skip-ndbcluster --skip-rpl --report-features
   (that was a debug build)
./mysql-test-run.pl --comment=normal --force --timer --skip-ndbcluster --report-features
./mysql-test-run.pl --comment=ps --force --timer --skip-ndbcluster --ps-protocol
./mysql-test-run.pl --comment=normal+rowrepl --force --timer --skip-ndbcluster --mysqld=--binlog-format=row
./mysql-test-run.pl --comment=embedded --force --timer --embedded-server --skip-rpl --skip-ndbcluster
./mysql-test-run.pl --force --comment=rpl --suite=rpl
./mysql-test-run.pl --comment=NIST+normal --force --suite=nist
./mysql-test-run.pl --comment=NIST+ps --force --suite=nist --ps-protocol

(I have not checked why "suite=rpl" and "suite=nist" worked,
even without "--skip-ndbcluster".)

I do *not* think it is a load problem from the parallel 32 + 64 bit build+test runs,
because at least the second and following hangs happened when the 32 bit run had already finished.

How to repeat:
Run a build (including NDB) and test on that platform.

I will save the current build tree here:

  mysqldev@osx-tiger-ppc:tmp-bug#####-5.1.21-beta-build

(using the bug# when I have it).
[13 Aug 2007 7:36] Stewart Smith
Hi Joerg!

Could you please:
- check output of 'ndb_mgm -e "show"' when it "hangs" (pass -c for connectstring for test or set NDB_CONNECTSTRING env variable)
- check (and attach) the cluster log as well as logs for mgm server and data nodes
  (basically *.log in the ndbcluster directory)

This should help in tracking it down.

I gather we don't have a host like this in pb running this sort of build regularly.... :(
[13 Aug 2007 8:01] Joerg Bruehe
I will try to do as requested, but I have to repeat:

This happens while automated builds and tests are running,
so in general we have little chance for manual intervention and analysis.

Currently, the "classic" build is nearly done -
if "advanced" gets into this hang, I can try as requested;
if not, the saved tree must be used to reproduce the bug.
[13 Sep 2007 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[22 Sep 2007 14:13] Joerg Bruehe
Bug was reproduced in a 5.1.22-rc build,
and shown to Cluster support (Stewart).
[29 Oct 2007 16:41] Magnus Blåudd
The mgm client can't connect properly:

osx-tiger-ppc:~/magnus/mysql-5.1.23-beta-pb1577/mysql-test mysqldev$ ../storage/ndb/src/mgmclient/ndb_mgm --ndb-connectstring=host=localhost:10175 -e "show"
Connected to Management Server at: localhost:10175
[29 Oct 2007 16:54] Magnus Blåudd
Repeatable with 64-bit debug compile on osx-tiger-ppc

The ndb_mgmd starts and set up the listening socket. It does not seem to respond when you connect to it with ndb_mgm, but telnet works. See below.

osx-tiger-ppc:~/magnus/mysql-5.1.23-beta-pb1577/mysql-test mysqldev$ telnet localhost 10175
Connected to localhost.
Escape character is '^]'.
get version

version
id: 327959
major: 5
minor: 1
string: Version 5.1.23 (beta)
[29 Oct 2007 16:55] Magnus Blåudd
But telnet + "get status" hangs half way through.

get status

node status
nodes: 11
node.1.type: NDB
<< hangs here
[29 Oct 2007 20:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/36601

ChangeSet@1.2569, 2007-10-29 21:33:30+01:00, msvensson@pilot.mysql.com +1 -0
  Bug#30366 NDB fails to start on OS X, PPC, 64 bit
   - The errno variable should only be used when the previous socket
     write failed, it should be regarded as undefined at other times
[29 Oct 2007 20:36] Magnus Blåudd
The client was now hanging half way through the response. It would probably be better it the server closed the connection when a timeout has occured.
[29 Oct 2007 20:41] Magnus Blåudd
Something like this, but prefferably for all our users of SocketServer.

msvensson@pilot:~/mysql/my51-ndb-bug30366/storage/ndb/src/common$ bk -r diffs -u
===== storage/ndb/src/mgmsrv/Services.cpp 1.95 vs edited =====
--- 1.95/storage/ndb/src/mgmsrv/Services.cpp    2007-07-11 14:36:40 +02:00
+++ edited/storage/ndb/src/mgmsrv/Services.cpp  2007-10-29 21:40:11 +01:00
@@ -349,6 +349,10 @@ MgmApiSession::runSession()
 
     m_parser->run(ctx, *this);
 
+    if (m_output->timedout() ||
+        m_input->timedout())
+      m_stop= true;
+
     if(ctx.m_currentToken == 0)
     {
       NdbMutex_Unlock(m_mutex);
[26 Nov 2007 17:54] Magnus Blåudd
Pushed to mysql-5.1-ndb
[4 Dec 2007 8:08] Mattias Jonsson
I can verify this on an intel macbook with Mac OS X 10.5.1 (uname -a: Darwin witty 9.1.0 Darwin Kernel Version 9.1.0: Wed Oct 31 17:46:22 PDT 2007; root:xnu-1228.0.2~1/RELEASE_I386 i386).

The patch works, now I finally start the full test suite on my new macbook!
[10 Dec 2007 23:24] Omer Barnir
Root Cause Analysis
-------------------
The problem was a result of a change made back in March 22, 2007.
The result behavior is different on different platforms so the problem was observed only on OS-X
From a testing point of view, once packaged verification is in place, similar problems will be caught
[15 Jan 2008 14:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41012

ChangeSet@1.2652, 2008-01-15 15:01:21+01:00, msvensson@pilot.mysql.com +1 -0
  Bug#30366 NDB fails to start on OS X, PPC, 64 bit
     - The errno variable should only be used when the previous socket
       write failed, it should be regarded as undefined at other times
  
  OutputStream.cpp:
    Only use "errno" after the attempt to write to the socket has failed
[16 Jan 2008 16:03] Magnus Blåudd
Pushed to mysql-5.1-release
[17 Jan 2008 22:34] Jon Stephens
Documented bugfix in 5.1.23 changelog.
[24 Jan 2008 11:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41198

ChangeSet@1.2657, 2008-01-24 12:06:40+01:00, tomas@whalegate.ndb.mysql.com +1 -0
  Bug#30366 (recommit) NDB fails to start on OS X, PPC, 64 bit
  - The errno variable should only be used when the previous socket
    write failed, it should be regarded as undefined at other times
[7 Feb 2008 9:51] Magnus Blåudd
Pushed also to mysql-5.1-ndb, mysql-5.1-telco-6.2, mysql-5.1-telco-6.3 and mysql-5.1-telco-6.4
[20 Feb 2008 16:02] Bugs System
Pushed into 5.1.24-rc
[20 Feb 2008 16:02] Bugs System
Pushed into 6.0.5-alpha
[25 Feb 2008 15:58] Bugs System
Pushed into 5.1.24-rc
[25 Feb 2008 16:04] Bugs System
Pushed into 6.0.5-alpha
[30 Mar 2008 18:57] Jon Stephens
Fix also documented for 6.0.5.