Bug #66295 mysql cluster environment (mysql-5.1.56 ndb-7.1.15) date node often shutdown and
Submitted: 10 Aug 2012 2:41 Modified: 4 Sep 2012 1:43
Reporter: Ananth Narayanamoorthy Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S1 (Critical)
Version:mysql-5.1.56 ndb-7.1.15 OS:Linux (2.6.16.60-0.21-smp - x86_64 x86_64 x86_64 GNU/Linux)
Assigned to: CPU Architecture:Any

[10 Aug 2012 2:41] Ananth Narayanamoorthy
Description:
Hi

We are having issue with mysql cluster environment, we see the data node often shutdown and crash and the following error is seen in the ndb_out.logs

2012-08-09 18:41:34 [ndbd] INFO     -- Received signal 11. Running error handler.
2012-08-09 18:41:34 [ndbd] INFO     -- Signal 11 received; Segmentation fault
2012-08-09 18:41:34 [ndbd] INFO     -- ndbd.cpp
2012-08-09 18:41:34 [ndbd] INFO     -- Error handler signal restarting system
2012-08-09 18:41:34 [ndbd] INFO     -- Error handler shutdown completed - exiting
2012-08-09 18:41:34 [ndbd] ALERT    -- Node 2: Forced node shutdown completed, restarting. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-08-09 18:41:34 [ndbd] INFO     -- Ndb has terminated (pid 21995) restarting
2012-08-09 18:41:34 [ndbd] INFO     -- Angel reconnected to 'x.x.1.20:1186'

We see similar error at Node 3

2012-08-09 18:41:23 [ndbd] INFO     -- dbtup/DbtupRoutines.cpp
2012-08-09 18:41:23 [ndbd] INFO     -- DBTUP (Line: 669) 0x00000000
2012-08-09 18:41:23 [ndbd] INFO     -- Error handler restarting system
2012-08-09 18:41:23 [ndbd] INFO     -- Error handler shutdown completed - exiting
2012-08-09 18:41:23 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2012-08-09 18:41:23 [ndbd] INFO     -- Ndb has terminated (pid 16939) restarting
2012-08-09 18:41:23 [ndbd] INFO     -- Angel reconnected to 'x.x.1.20:1186'
2012-08-09 18:41:35 [ndbd] INFO     -- Angel reallocated nodeid: 3
2012-08-09 18:41:35 [ndbd] INFO     -- Angel pid: 16938 started child: 28194
2012-08-09 18:41:35 [ndbd] INFO     -- Configuration fetched from 'x.x.1.20:1186', generation: 1
NDBMT: non-mt
2012-08-09 18:41:35 [ndbd] INFO     -- NDB Cluster -- DB node 3
2012-08-09 18:41:35 [ndbd] INFO     -- mysql-5.1.56 ndb-7.1.15 --

In management console we see 

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2 (not connected, accepting connect from x.x.2.11)
id=3 (not connected, accepting connect from x.x.2.12)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @x.x.1.20  (mysql-5.1.56 ndb-7.1.15)

[mysqld(API)]   2 node(s)
id=4 (not connected, accepting connect from x.x.2.11)
id=5 (not connected, accepting connect from x.x.2.12)

Our configuration in config.ini is 

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=2G
IndexMemory=200M
UndoIndexBuffer=64M
RedoBuffer=256M
TimeBetweenLocalCheckpoints=6
NoOfFragmentLogFiles=256
FragmentLogFileSize=16M

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]

[TCP DEFAULT]

# Section for the cluster management node
[NDB_MGMD]
# IP address of the management node (CI server - this one)
HostName=x.x.1.20
NodeId=1

# Section for the storage nodes
[NDBD]
# IP address of the first data node (APP01 server holding ndb1)
HostName=x.x.2.11
DataDir=/usr/local/mysqlc/cluster/ndb_data
NodeId=2
TransactionDeadlockDetectionTimeout=10000
StopOnError =false

[NDBD]
# IP address of the second storage node (APP02 server holding ndb2)
HostName=x.x.2.12
DataDir=/usr/local/mysqlc/cluster/ndb_data
NodeId=3
TransactionDeadlockDetectionTimeout=10000
StopOnError =false

# one [MYSQLD] per storage node
[MYSQLD]
HostName=x.x.2.11
NodeId=4
[MYSQLD]
HostName=x.x.2.12
NodeId=5

Please let me know if you need more information

Thanks and Regards
Ananth

How to repeat:
It get's repeated when we try to insert data

Suggested fix:
We have to stop the data node and restart the node with initial command.

And redump the DB
[11 Aug 2012 6:53] Ananth Narayanamoorthy
Can any one please provide feedback on this bug.

Thanks 
Ananth
[4 Sep 2012 1:43] Ananth Narayanamoorthy
Can anyone please provide feedback on this bug.

This is affecting our production

Regards
Ananth
[26 Sep 2012 20:05] Brian Hobson
I had the same (or very similar) issue occur in my lab today.  Restarting the failed ndb node did not work.  I had to restart the ndb node with 'initial' and wait for it to sync before it would join the cluster.  The ndb_4_error.log did not offer any additional error messages.