Bug #33951 SQL/API nodes crash or hang repeatedly when my.cnf contains same server-id value
Submitted: 21 Jan 2008 0:38 Modified: 31 Oct 2008 19:10
Reporter: Jason Brooke Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.0.45, 5.1.22rc OS:Linux (RHEL 5.1)
Assigned to: Don Kehn CPU Architecture:Any
Tags: same server-id crash hang

[21 Jan 2008 0:38] Jason Brooke
Description:
Connecting more than one SQL node to a cluster with the same server-id value in my.cnf as another SQL node can result in:

- SQL nodes crashing randomly up to several times a day
- an SQL node getting hung up on some queries until kill -9'd

Cluster itself (data nodes) seem unaffected and function perfectly.

Clearly, this is a misconfiguration problem and a silly one at that, but I figured I'd report a bug anyway as it may be desirable that MySQL more gracefully handle this misconfiguration.

How to repeat:
This is my scenario, but some of this may or may not be required to repeat the issue:

2 data nodes
2 management nodes, number of replicas set to 2
2 SQL nodes on same hosts as management nodes - both with server-id=1 in my.cnf

Start cluster, connect SQL nodes and begin issuing queries to both SQL nodes - my sql nodes for example reported about 50 queries per second avg in 'status'. It doesn't seem to matter what sort of queries are used, there was nothing consistent about the query that appeared top of the list in execution time when one of the sql nodes hung.

Suggested fix:
If possible, modify behaviour so that when a second SQL node joins a cluster and has the same server-id value as an existing SQL node, it disconnects from the cluster and logs a message about the misconfiguration.
[6 Feb 2008 14:03] Sveta Smirnova
Thank you for the report.

If I understood you correct you say same server id leads one of API nodes to hang? Please confirm or reject.
[6 Feb 2008 14:21] Jason Brooke
That's correct Sveta. There seems to be two symptoms:

- both API nodes randomly crashing
- both API nodes get hung queries in processlist, which prevents most subsequent queries from completing, even against other tables

I haven't been able to find the exact condition that triggers these symptoms, I only that once I correct the misconfiguration, it all runs beautifully.
[6 Feb 2008 15:04] Sveta Smirnova
Thank you for the feedback.

Status was changed to "Verified" by mistake. I could not repeat hang with test data. Could you please describe which type of queries "hang" in your case? How much time servers run smoothly before hang?
[7 Feb 2008 23:09] Jason Brooke
The amount of time that would pass before a crash or hung queries seemed inconsistent - sometimes they'd go for a few days, other times just a few minutes. Meanwhile, the queries that were being run were always quite consistent, and it's a single application using the database. I'm attaching some files with some info about queries etc.
[31 Oct 2008 19:10] Don Kehn
Can't repeat with at this point with 5.1.29rc, will require a stack trace of the crash in order to determine exactly where the problem occurs. Note: After testing with 6.1, 6.2, 6.3, & 5.1.29rc have not seen this issue.