Bug #51630 Internal error with shm transport
Submitted: 2 Mar 2010 9:43 Modified: 14 Apr 2010 13:51
Reporter: artem gorbyk Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:ndb-6.3.26, ndb-7.0.9 OS:Linux (2.6.18-164.el5)
Assigned to: CPU Architecture:Any
Tags: ndb, shm

[2 Mar 2010 9:43] artem gorbyk
Description:
Internal error when trying to setup SHM transport between sql and ndb processes.

For 6.3.26 error stack looks like

Failed to ADD epollfd: 3 fd 1048576 node 4 to epoll-set, errno: 9 Bad file descriptor
2010-03-01 18:47:39 [ndbd] INFO     -- Received signal 6. Running error handler.
2010-03-01 18:47:39 [ndbd] INFO     -- Signal 6 received; Aborted
2010-03-01 18:47:39 [ndbd] INFO     -- main.cpp
2010-03-01 18:47:39 [ndbd] INFO     -- Error handler signal shutting down system
2010-03-01 18:47:41 [ndbd] INFO     -- Error handler shutdown completed - exiting
2010-03-01 18:47:41 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

For 7.0.9 a little different -

Failed to ADD epollfd: 3 fd 27734 node 4 to epoll-set, errno: 9 Bad file descriptor
2010-03-01 17:50:02 [ndbd] INFO     -- Received signal 6. Running error handler.
2010-03-01 17:50:02 [ndbd] INFO     -- Signal 6 received; Aborted
2010-03-01 17:50:02 [ndbd] INFO     -- ndbd.cpp
2010-03-01 17:50:02 [ndbd] INFO     -- Error handler signal shutting down system
2010-03-01 17:50:02 [ndbd] INFO     -- Error handler shutdown completed - exiting
2010-03-01 17:50:02 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

No segfaults or other errors in /var/log/messages

How to repeat:
[shm] section of the ndb_mgmd config.ini file looks like
[SHM]
NodeId1=2
NodeId2=4
ShmKey=123
SigNum=10

Where 2 and 4 are nodeids of sql and ndbd processes, located on the same box.
Ndb node (id=2) starts ok, joins cluster and accepts tcp connections from sqls/apis on another hosts.
Then when trying to startup sql node on the same box, after several second I get the above error and ndbd goes down.
Shm segment with given shmkey remains in the system with nattch=0 and I had to remove it with ipcrm.