Bug #44342 ndbmtd on startup: DBLQH:..during LogFileOperationRecord state 17. OS errno:5
Submitted: 17 Apr 2009 11:14 Modified: 25 May 2009 13:39
Reporter: Henrik Ingo Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-6.4 OS:Linux
Assigned to: CPU Architecture:Any
Tags: mysql-5.1.32-ndb-6.4.3-innodb-linux-i686

[17 Apr 2009 11:14] Henrik Ingo
Description:
Using 6.4.3 from severalnines.com sandbox package, on a 2 core lenovo T61 laptop (uname -a  ... 2.6.24-23-generic #1 SMP ... i686 GNU/Linux), ndbmtd shuts down on initial start.

This has happened fairly consistently, but some weeks ago when I wanted to actually file a bug, I could not reproduce this (hence the somewhat late bug report).

Attached is ndb_error_report. I had to comment out RealTimeScheduler=1 and ODirect=1 in config.ini to make it read config.ini, they were uncommented during cluster startup.

How to repeat:
# ./start-cluster-initial.sh
Starting management server on localhost ...
2009-04-17 13:12:15 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.32 ndb-6.4.3-beta
2009-04-17 13:12:15 [MgmSrvr] INFO     -- Loaded config from '/usr/local/mysqlcluster-64/scripts/../config/ndb_47_config.bin.1'
Sleeping for three seconds
Starting data nodes on  ...
Waiting for cluster to start...

yadayada...

Data nodes are starting. This can take a few minutes...
waitNodeState(STARTED, -1) timeout after 856 attempts
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2 (not connected, accepting connect from 127.0.0.1)
id=3 (not connected, accepting connect from 127.0.0.1)

[ndb_mgmd(MGM)] 1 node(s)
id=47   @localhost  (mysql-5.1.32 ndb-6.4.3)

[mysqld(API)]   10 node(s)
id=4 (not connected, accepting connect from 127.0.0.1)
id=5 (not connected, accepting connect from 127.0.0.1)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
id=9 (not connected, accepting connect from any host)
id=10 (not connected, accepting connect from any host)
id=11 (not connected, accepting connect from any host)
id=12 (not connected, accepting connect from any host)
id=13 (not connected, accepting connect from any host)

yadayada...

2009-04-17 13:14:35 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 1006  Illegal reply from server line: 2803
2009-04-17 13:14:35 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 1006  Illegal reply from server line: 2803
2009-04-17 13:14:35 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 1006  Illegal reply from server line: 2803
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2 (not connected, accepting connect from 127.0.0.1)
id=3 (not connected, accepting connect from 127.0.0.1)

[ndb_mgmd(MGM)] 1 node(s)
id=47   @localhost  (mysql-5.1.32 ndb-6.4.3)

[mysqld(API)]   10 node(s)
id=4 (not connected, accepting connect from 127.0.0.1)
id=5 (not connected, accepting connect from 127.0.0.1)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
id=9 (not connected, accepting connect from any host)
id=10 (not connected, accepting connect from any host)
id=11 (not connected, accepting connect from any host)
id=12 (not connected, accepting connect from any host)
id=13 (not connected, accepting connect from any host)

...

Suggested fix:
Workaround: use ndbd instead of ndbmtd, it starts fine every time.

Real fix: ndbmtd should start without errors.
[17 Apr 2009 11:15] Henrik Ingo
ndb_error_report

Attachment: ndb_error_report_20090417132742.tar.bz2 (application/octet-stream, text), 121.84 KiB.

[17 Apr 2009 11:31] Jonas Oreland
logs you uploaded shows error occurring once, with ndbd NOT ndbmtd.
so statement that it's only ndbmtd is false.

generally "OS error code 5: Input/output error" is a OS error.
but we have made some work-around, and identified (for various) customers
some linux-kernel bugs/fixes addresses these.

a "dmesg" would be interesting.

also, customers experiencing this problem has gotten it on 6.3

---

conclusion: to make more progress on this we need more elements.
[17 Apr 2009 11:41] Henrik Ingo
Äsch... Sorry about this one guys.

What happens is actually a combination of [normal startup problem] + [old ndb_3_error.log]. I now realise the current problem does not create an error.log, I was just looking at an old file.

My current error is simply:

2009-04-17 14:34:50 [ndbd] ERROR    -- Unable to setup transporter service port: 192.168.0.100:0!
Please check if the port is already used,
(perhaps the node is already running)
globalTransporterRegistry.start_service() failed
2009-04-17 14:34:50 [ndbd] ALERT    -- Node 3: Forced node shutdown completed.

This would likely be because I'm on 3G today, so the interface that normally would have that IP address is not up.

Feel free to close the bug. (The ndb_3_error.log file might actually have something to do with the problem I used to have, but today is not a reproduction of that one, unfortunately.)
[17 May 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[18 May 2009 5:52] Henrik Ingo
This bug should be closed.

As explained above, the error was because I was using 3G, while in my config.ini I actually had specified some IP addresses from my internal LAN. (And then mistakenly looked at some old error file.)