Bug #65938 ndbdmtd shutdown 1 min after started: Illegal signal ... (GSN 32 not added)
Submitted: 18 Jul 2012 13:56 Modified: 17 Feb 2013 17:24
Reporter: Jay Ward Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.5.22-ndb-7.2.6-gpl-log OS:Linux (CentOS release 6.3 (Final) 2.6.32-279.1.1.el6.x86_64)
Assigned to: Assigned Account CPU Architecture:Any
Tags: assertion failed, error 2301, GSN 32 not added, GSN_SCAN_TABREQ, interrupts, ndbdmtd, numa, send lock contentions, sendbufferpool lock contentions

[18 Jul 2012 13:56] Jay Ward
Description:
We were moving our NDB nodes to bigger hardware and taking them out of the virtualization software in which they had been running. After making changes to the config.ini and restarting both management node, I started ndbmtd on the data node we were moving first using:

[root@ndb2 mysql]# /usr/local/mysql/bin/ndbmtd --ndb-nodeid=4 -c MGMNode1:1186,MGMNode2:1186 --initial

The node started successfully, and then shortly thereafter shut itself down:

674871/0 (674870/4294967295) switchover complete bucket 1 state: 1starting
2012-07-17 16:40:29 [ndbd] INFO     -- Start phase 101 completed
2012-07-17 16:40:29 [ndbd] INFO     -- Node started
send lock node 5 waiting for lock, contentions: 200 spins: 367555
send lock node 5 waiting for lock, contentions: 400 spins: 611990
send lock node 5 waiting for lock, contentions: 600 spins: 867841
... More lines like unto these ...
jbalock thr: 0 waiting for lock, contentions: 7800 spins: 6899476
... More lines like unto these ...
send lock node 15 waiting for lock, contentions: 2800 spins: 5477963
send lock node 15 waiting for lock, contentions: 3000 spins: 5744907
send lock node 15 waiting for lock, contentions: 3200 spins: 5997054
2012-07-17 16:40:52 [ndbd] INFO     -- Illegal signal received (GSN 32 not added)
2012-07-17 16:40:52 [ndbd] INFO     -- Illegal signal received (GSN 32 not added)
2012-07-17 16:40:52 [ndbd] INFO     -- Error handler shutting down system
2012-07-17 16:40:52 [ndbd] INFO     -- Error handler shutdown completed - exiting
2012-07-17 16:40:54 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2301: 'Assertion(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

I expected it to do one of the following:
    1. Not get GSN signal 32, if inapplicable in that context OR
    2. Handle GSN signal properly and continue running OR
    3. Give real information as to the real problem so it can be fixed.

How to repeat:
1. Create a Cluster with two management nodes, two NDB nodes, and two or more MySQL nodes. 

2. Down one of the data nodes and use it's IP address (or hostname) on a new data node.

3. Create Data node with hardware like the profile I will upload when uploading ndb_error output.

3. Copy Linux Generic untar-ed directory to /usr/local/ and symlink to /usr/local/mysql (since configurations already point to that directory)

4. Start ndbmtd with --initial. 

5. Wait for node to enter started status.

6. Wait for node to shut down.

Suggested fix:
I am not sure what the fix should be.
[18 Jul 2012 14:23] Jay Ward
Profile of hardware used.

Attachment: hardwareprofile.tar.gz (application/gzip, text), 6.62 KiB.

[18 Jul 2012 14:24] Jay Ward
Uploaded ndb_error_reporter output to ftp.oracle.com/support/incoming/bug-data-65938.tar.bz2
[22 Jul 2012 13:06] Jay Ward
This ended up being the result of EL6's localhost entry in the /etc/hosts file:

127.0.0.1	localhost

Worked while

127.0.0.1	localhost.localdomain	localhost 

did not. NDB should be able to handle either.
[17 Jan 2013 17:19] Shahryar Ghazi
Hi Jay,

For some reason I am unable to access the file you uploaded to FTP server. Please upload it again and also include any configuration files (eg. config.ini, my.cnf) and OS network info (eg. hosts files).

The potential issue appears to be related to network configuration so I am assuming that hardware configuration (mentioned in step3 of "how to repeat" above) should not matter in this case. Please correct me if I am wrong. 

Also, please explain step2 of "How to repeat" in detail.

Thanks.
[18 Feb 2013 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".