Bug #106314 startChangeNeighbour problem
Submitted: 27 Jan 2022 19:16 Modified: 20 Feb 2024 18:12
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:8.0.28 OS:Any
Assigned to: CPU Architecture:Any

[27 Jan 2022 19:16] Mikael Ronström
Description:
The NDB transporters handled in mt.cpp are using two different concepts.
The first concept is for neighbour transporters that carry signals between
nodes in the same node group. The second is all the other transporters.

The neighbour nodes are few and are checked individually, thus a neighbour
node ready to send is not in any list, it is simply marked as having data available.

The non-neighbour nodes are placed in a list of transporters ready to send.

This handling works fine, but with multi transporters it is a bit problematic when
a transporter goes from non-neighbour to neighbour node and vice versa.

When a node goes from non-neighbour to neighbour it needs to be removed
from the list if it is in the list of data available transporters.

Similarly if a neighbour has data available when moved to the non-neighbour
it must be placed in the list unless a send is already being expedited.

How to repeat:
testNodeRestart -n MultiSocketRestart T1
using a cluster with 4 nodes and 4 replicas executed a few times.

Suggested fix:
Remove from list and handle insertion into list when so required.
[28 Jan 2022 14:20] MySQL Verification Team
Hi Mikael,

Thanks for the report. This happens with 3 replicas too or only 4?

all best
Bogdan
[28 Jan 2022 17:01] Mikael Ronström
Can probably happen even with 2 replicas, but is quadratically more common with more replicas.
[20 Feb 2024 18:12] Jon Stephens
Documented fix 

Documented fix as follows in the NDB 8.0.37 and 8.4.0 changelogs:

    NDB transporter handling in mt.cpp differentiated between
    neighbor transporters carrying signals between nodes in the same
    node group, and all other transporters. This sometimes led to
    issues with multiple transporters when a transporter connected
    nodes that were neighbors with nodes that were not.

Closed.