MySQL Bugs: #33951: SQL/API nodes crash or hang repeatedly when my.cnf contains same server-id value

Bug #33951	SQL/API nodes crash or hang repeatedly when my.cnf contains same server-id value
Submitted:	21 Jan 2008 0:38	Modified:	31 Oct 2008 19:10
Reporter:	Jason Brooke	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.0.45, 5.1.22rc	OS:	Linux (RHEL 5.1)
Assigned to:	Don Kehn	CPU Architecture:	Any
Tags:	same server-id crash hang

Description:
Connecting more than one SQL node to a cluster with the same server-id value in my.cnf as another SQL node can result in:

- SQL nodes crashing randomly up to several times a day
- an SQL node getting hung up on some queries until kill -9'd

Cluster itself (data nodes) seem unaffected and function perfectly.

Clearly, this is a misconfiguration problem and a silly one at that, but I figured I'd report a bug anyway as it may be desirable that MySQL more gracefully handle this misconfiguration.

How to repeat:
This is my scenario, but some of this may or may not be required to repeat the issue:

2 data nodes
2 management nodes, number of replicas set to 2
2 SQL nodes on same hosts as management nodes - both with server-id=1 in my.cnf

Start cluster, connect SQL nodes and begin issuing queries to both SQL nodes - my sql nodes for example reported about 50 queries per second avg in 'status'. It doesn't seem to matter what sort of queries are used, there was nothing consistent about the query that appeared top of the list in execution time when one of the sql nodes hung.

Suggested fix:
If possible, modify behaviour so that when a second SQL node joins a cluster and has the same server-id value as an existing SQL node, it disconnects from the cluster and logs a message about the misconfiguration.

Thank you for the report.

If I understood you correct you say same server id leads one of API nodes to hang? Please confirm or reject.

That's correct Sveta. There seems to be two symptoms:

- both API nodes randomly crashing
- both API nodes get hung queries in processlist, which prevents most subsequent queries from completing, even against other tables

I haven't been able to find the exact condition that triggers these symptoms, I only that once I correct the misconfiguration, it all runs beautifully.

Thank you for the feedback.

Status was changed to "Verified" by mistake. I could not repeat hang with test data. Could you please describe which type of queries "hang" in your case? How much time servers run smoothly before hang?

The amount of time that would pass before a crash or hung queries seemed inconsistent - sometimes they'd go for a few days, other times just a few minutes. Meanwhile, the queries that were being run were always quite consistent, and it's a single application using the database. I'm attaching some files with some info about queries etc.

Can't repeat with at this point with 5.1.29rc, will require a stack trace of the crash in order to determine exactly where the problem occurs. Note: After testing with 6.1, 6.2, 6.3, & 5.1.29rc have not seen this issue.