Bug #15695 startings nodes hang in phase 2 forever on temporary network failure
Submitted: 13 Dec 2005 0:57 Modified: 11 Apr 2006 23:30
Reporter: Hartmut Holzgraefe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.0.16 OS:Linux (linux, hp/ux)
Assigned to: Jonas Oreland CPU Architecture:Any

[13 Dec 2005 0:57] Hartmut Holzgraefe
Description:
On systems with multiple network interfaces data nodes will get stuck in startup phase 2
if the interface that connects them to the management server is working on node startup
while the interface interconnecting the data nodes is not working temporarily

Even when the 2nd interface comes functional again later the data nodes still stay stuck
within phase 2.

How to repeat:
Using the following config file with one management and two data nodes,
using a [TCP] section to interconnect the data nodes using a different
interface:

[ndbd default]
NoOfReplicas = 2

[ndb_mgmd]
Id = 1
HostName = 10.100.1.112

[ndbd]
Id = 2
HostName = 10.100.1.116

[ndbd]
Id = 3
HostName = 10.100.1.117

[mysqld]
[mysqld]

[tcp]
NodeId1 = 2
NodeId2 = 3
HostName1 = 10.100.9.7
HostName2 = 10.100.9.8

before starting up the nodes the eth1 interface on node #3 was brought down using 

iptables -A INPUT -i eth1 -j REJECT

so disabling the data node interconnection specified in the [TCP] block

on startup both data nodes get stuck in phase 2

after deleting the firewall rule with

iptables -D INPUT -i eth1 -j REJECT

the two data nodes get in touch with each other as can seen in the cluster log file

2005-12-13 01:07:35 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2005-12-13 01:07:35 [MgmSrvr] INFO     -- Node 3: Node 2 Connected

but the data nodes till are stuck in startup phase 2 and will not continue startup
even after more than half an hour ...
[21 Dec 2005 4:30] Stewart Smith
not the case during SR.

(tested on localhost with  "iptables -A INPUT -i lo -p tcp -m tcp --dport 2202 -j REJECT" and appropriate port number)

note that there is a BIG difference between -j REJECT and -j DROP. REJECT is *not* network failure - it's an admin blocking access (there's an ICMP response saying "no, you can't connect"). DROP is just dropping the packet on the floor and totally ignoring it.
[21 Dec 2005 5:44] Stewart Smith
Verified with BK tree. Workaround is to restart one of the nodes, then everything will start. Working out why this is needed now.
[12 Jan 2006 4:14] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/944
[30 Mar 2006 7:14] Jonas Oreland
Hijacking this bug...
[30 Mar 2006 12:27] Jonas Oreland

 
[30 Mar 2006 12:27] Jonas Oreland
http://lists.mysql.com/commits/4321
[9 Apr 2006 21:13] Jonas Oreland
pushed into 5.1.10
[10 Apr 2006 8:16] Jonas Oreland
pushed into 5.0.21
[10 Apr 2006 10:59] Jonas Oreland
pushed into 4.1.19
[11 Apr 2006 23:30] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

I've documented the fix in the 4.1.19, 5.0.21, and 5.1.10 changelogs, and closed the bug.