Description:
In a cluster of four data nodes, one of the nodes encountered a serious ID assignment problem when rejoining the cluster.
How to repeat:
The log file is as follow:
```2024-10-08 16:28:10 [ndbd] INFO -- LDM(0): index id 121 rebuild done
2024-10-08 16:28:10 [ndbd] INFO -- LDM(0): index id 123 rebuild done
2024-10-08 16:28:10 [ndbd] INFO -- LDM(0): index id 128 rebuild done
2024-10-08 16:28:10 [ndbd] INFO -- LDM(0): index id 129 rebuild done
2024-10-08 16:28:10 [ndbd] INFO -- LDM(0): index id 131 rebuild done
2024-10-08 16:28:10 [ndbd] INFO -- LDM(0): We have completed restoring our fragments and executed REDO log and rebuilt ordered indexes
2024-10-08 16:28:10 [ndbd] INFO -- Start phase 4 completed
2024-10-08 16:28:10 [ndbd] INFO -- Phase 4 restored all fragments from local disk up to a consistent global checkpoint id
2024-10-08 16:28:10 [ndbd] INFO -- Grant nodes to start phase: 5, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:10 [ndbd] INFO -- Start NDB start phase 4
2024-10-08 16:28:10 [ndbd] INFO -- NDB start phase 4 completed
2024-10-08 16:28:10 [ndbd] INFO -- As master we will wait for other nodes to reach the state waitpoint52 as well
2024-10-08 16:28:10 [ndbd] INFO -- Node 3 have reached completion of NDB start phase 4
2024-10-08 16:28:10 [ndbd] INFO -- Start NDB start phase 5 (only to DBDIH)
2024-10-08 16:28:13 [ndbd] INFO -- NDB start phase 5 completed
2024-10-08 16:28:13 [ndbd] INFO -- Start NDB start phase 6
2024-10-08 16:28:13 [ndbd] INFO -- NDB start phase 6 completed
2024-10-08 16:28:13 [ndbd] INFO -- Start phase 5 completed
2024-10-08 16:28:13 [ndbd] INFO -- Phase 5 waited for local checkpoint to complete
2024-10-08 16:28:13 [ndbd] INFO -- Grant nodes to start phase: 6, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO -- Start phase 6 completed
2024-10-08 16:28:13 [ndbd] INFO -- Phase 6 updated blocks about that we've now reached the started state.
2024-10-08 16:28:13 [ndbd] INFO -- Grant nodes to start phase: 7, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO -- President restarts arbitration thread [state=1]
2024-10-08 16:28:13 [ndbd] INFO -- Activating bucket 0 in SUMA
2024-10-08 16:28:13 [ndbd] INFO -- Activating bucket 1 in SUMA
2024-10-08 16:28:13 [ndbd] INFO -- Start phase 7 completed
2024-10-08 16:28:13 [ndbd] INFO -- Phase 7 mainly activated the asynchronous change events process, and some other background processes
2024-10-08 16:28:13 [ndbd] INFO -- Grant nodes to start phase: 8, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO -- Start NDB start phase 7
2024-10-08 16:28:13 [ndbd] INFO -- Foreign Key enabling Starting
2024-10-08 16:28:13 [ndbd] INFO -- Foreign key enabling Completed
2024-10-08 16:28:13 [ndbd] INFO -- NDB start phase 7 completed
2024-10-08 16:28:13 [ndbd] INFO -- Start phase 8 completed
2024-10-08 16:28:13 [ndbd] INFO -- Phase 8 enabled foreign keys and waited forall nodes to complete start up to this point
2024-10-08 16:28:13 [ndbd] INFO -- Grant nodes to start phase: 9, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO -- Started arbitrator node 1 [ticket=58da00013576f9ad]
2024-10-08 16:28:15 [ndbd] INFO -- Restart complete, updated local sysfile
2024-10-08 16:28:15 [ndbd] INFO -- Start phase 9 completed
2024-10-08 16:28:15 [ndbd] INFO -- Phase 9 enabled APIs to start connecting
2024-10-08 16:28:15 [ndbd] INFO -- Grant nodes to start phase: 10, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO -- Start phase 50 completed
2024-10-08 16:28:15 [ndbd] INFO -- Grant nodes to start phase: 51, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO -- Start phase 101 completed
2024-10-08 16:28:15 [ndbd] INFO -- Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events
2024-10-08 16:28:15 [ndbd] INFO -- Grant nodes to start phase: 102, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO -- Node started
2024-10-08 16:28:15 [ndbd] INFO -- Node 2 has completed its restart
2024-10-08 16:28:15 [ndbd] INFO -- Node 3 has completed its restart
2024-10-08 16:28:15 [ndbd] INFO -- Start node: 4 using node restart (node not restorable on its own)
2024-10-08 16:28:15 [ndbd] INFO -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:28:15 [ndbd] INFO -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:28:16 [ndbd] INFO -- Alloc node id for node 6 succeeded
2024-10-08 16:28:19 [ndbd] INFO -- Start node: 5 using node restart (node not restorable on its own)
2024-10-08 16:28:33 [ndbd] INFO -- granting SumaStartMe dict lock to 4
2024-10-08 16:28:33 [ndbd] INFO -- clearing SumaStartMe dict lock for 4
2024-10-08 16:28:36 [ndbd] INFO -- granting SumaHandover dict lock to 4
2024-10-08 16:28:36 [ndbd] INFO -- prepare to handover bucket: 1
2024-10-08 16:28:36 [ndbd] INFO -- Allocate event buffering page chunk in SUMA, 16 pages, first page ref = 7094
2024-10-08 16:28:41 [ndbd] INFO -- granting SumaStartMe dict lock to 5
2024-10-08 16:28:41 [ndbd] INFO -- clearing SumaStartMe dict lock for 5
2024-10-08 16:28:41 [ndbd] INFO -- 5782186/0 (5782185/4294967295) switchover complete bucket 1 state: 2
2024-10-08 16:28:41 [ndbd] INFO -- handover
2024-10-08 16:28:41 [ndbd] INFO -- clearing SumaHandover dict lock for 4
2024-10-08 16:28:41 [ndbd] INFO -- Node 4 has completed its restart
2024-10-08 16:28:44 [ndbd] INFO -- granting SumaHandover dict lock to 5
2024-10-08 16:28:49 [ndbd] INFO -- clearing SumaHandover dict lock for 5
2024-10-08 16:28:49 [ndbd] INFO -- Node 5 has completed its restart
2024-10-08 16:52:24 [ndbd] INFO -- Node 6 disconnected in state: 0
2024-10-08 16:52:26 [ndbd] INFO -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:26 [ndbd] INFO -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:27 [ndbd] INFO -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:27 [ndbd] INFO -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:28 [ndbd] INFO -- Alloc node id for node 6 succeeded
2024-10-08 16:53:05 [ndbd] INFO -- DROP_TAB_REQ: tab: 129, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO -- DROP_TAB_REQ: tab: 131, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO -- DROP_TAB_REQ: tab: 128, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO -- DROP_TAB_REQ: tab: 132, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO -- DROP_TAB_REQ: tab: 130, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO -- DROP_TAB_REQ: tab: 127, tabLcpStatus: 3
```
I think the root cause is:
2024-10-08 16:52:24 [ndbd] INFO -- Node 6 disconnected in state: 0
2024-10-08 16:52:26 [ndbd] INFO -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:26 [ndbd] INFO -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:27 [ndbd] INFO -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:27 [ndbd] INFO -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:28 [ndbd] INFO -- Alloc node id for node 6 succeeded
However, the ndb6 is the SQL server node:
root@25e7d3ef1189 /h/m/output_v32-6# ndb_mgm -e show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @192.172.10.9 (mysql-8.0.35 ndb-8.0.35, Nodegroup: 0, *)
id=3 @192.172.10.10 (mysql-8.0.35 ndb-8.0.35, Nodegroup: 1)
id=4 @192.172.10.11 (mysql-8.0.35 ndb-8.0.35, Nodegroup: 0)
id=5 @192.172.10.12 (mysql-8.0.35 ndb-8.0.35, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @192.172.10.8 (mysql-8.0.35 ndb-8.0.35)
[mysqld(API)] 4 node(s)
id=6 @192.172.10.9 (mysql-8.0.35 ndb-8.0.35)
id=7 @192.172.10.10 (mysql-8.0.35 ndb-8.0.35)
id=8 @192.172.10.11 (mysql-8.0.35 ndb-8.0.35)
id=9 @192.172.10.12 (mysql-8.0.35 ndb-8.0.35)
Only binlogs:
-rw-r----- 1 root root 7.5K Oct 8 23:20 binlog.000001
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000002
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000003
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000004
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000005
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000006
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000007
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000008
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000009
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000010
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000011
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000012
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000013
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000014
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000015
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000016
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000017
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000018
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000019
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000020
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000021
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000022
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000023
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000024
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000025
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000026
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000027
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000028
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000029
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000030
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000031
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000032
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000033
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000034
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000035
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000036
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000037
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000038
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000039
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000040
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000041
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000042
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000043
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000044
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000045
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000046
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000047
-rw-r----- 1 root root 157 Oct 8 23:20 binlog.000048
-rw-r----- 1 root root 768 Oct 8 23:20 binlog.index
Do you need these files?
Suggested fix:
Is there a problem with the join process for sql nodes? It seems to be related to the granting and clearing of dictionary locks, and also to suma.