Description:
Hi All -
On freeBSD 7.1, with a custom compiled mysql (based on the soure and makefiles from 'sandbox') the data nodes fail to connect when they are on separate computers, using the multi-threaded data node.
They do connect OK without the multithreaded data node.
There is no firewall on the hosts.
Communication seems Ok during startup till stage 5.
The error log on the data nodes report this, then all nodes shut down:
From primary node.
Time: Monday 30 March 2009 - 18:53:53
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: DBTC (Line: 501) 0x0000000a
Program: /usr/mogreet_distro/mysql-5.1.32-ndb-6.4.3/libexec/ndbmtd
Pid: 72519
Trace: /data/cluster6.4.3/ndb/ndb_21_trace.log.2 /data/cluster6.4.3/ndb/ndb_21_trace.log.2_t1 /data/cluster6.4.3/ndb/ndb_21_t
From the second node:
2009-03-30 12:44:07 [ndbd] INFO -- Node 21 disconnected
2009-03-30 12:44:07 [ndbd] INFO -- QMGR (Line: 2908) 0x0000000e
2009-03-30 12:44:07 [ndbd] INFO -- Error handler startup shutting down system
2009-03-30 12:44:07 [ndbd] INFO -- Error handler shutdown completed - exiting
2009-03-30 12:44:07 [ndbd] INFO -- Angel received ndbd startup failure count 1.
2009-03-30 12:44:07 [ndbd] ALERT -- Node 22: Forced node shutdown completed. Occured during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
here is the (test) config file;
#Note: Production needs two mgmd servers
# but while we get it figured out, only one.
[NDB_MGMD]
id=13
#chimay.mogreet.com
hostname=192.168.100.125
datadir=/data/cluster6.4.3/ndb_mgmd
ArbitrationRank=1
[NDBD DEFAULT]
NoOfReplicas=2
datadir=/data/cluster6.4.3/ndb
DataMemory = 10G #Default for production
IndexMemory = 200M
NoOfFragmentLogFiles = 24
FragmentLogFileSize = 16M
BackupDataDir=/mnt/nas/databases/backup
StartFailureTimeout=1000000
HeartbeatIntervalDbDb=1500 #ATR tuning
# Note: Production needs four ndb nodes.
[NDBD]
id = 21
hostname=192.168.100.125
[NDBD]
id = 22
#duvel
hostname=192.168.100.126
[MYSQLD]
hostname=192.168.100.125
How to repeat:
1) start mgmd server on node Chimay
/usr/mogreet_distro/mysql-5.1.32-ndb-6.4.3/libexec/ndb_mgmd --defaults-file=/data/configs/my_cluster-6.4.3.cnf --initial --ndb-nodeid=13
2) start mgm on Chimay
bin/ndb_mgm --defaults-file=/data/configs/my_cluster-6.4.3.cnf
3) start the remote node... (on host duvel)
/usr/mogreet_distro/mysql-5.1.32-ndb-6.4.3/libexec/ndbd --defaults-file=/data/configs/my_cluster-6.4.3.cnf --ndb-nodeid=22 --initial
4) start Chimay's data node
/usr/mogreet_distro/mysql-5.1.32-ndb-6.4.3/libexec/ndbd --defaults-file=/data/configs/my_cluster-6.4.3.cnf --ndb-nodeid=21 --initial
5) do a 'show' in mgm, it shows nodes starting up.
6) wait 4-6 seconds. errors appear in mgm, and in node logs, both ndb data nodes quit.
Suggested fix:
let me know if you need the trace logs.
It's not a huge deal as long as I can get the single-threaded version running.
Let me know if you would like me to experiment further with the heartbeat timeouts and/or reduce the memory allocated (it's huge, at 10 GB).
Thanks!
ATR