Bug #116351 Error assigning node ID during synchronization of mysql NDB cluster
Submitted: 14 Oct 1:22 Modified: 16 Oct 7:34
Reporter: CunDi Fang Email Updates:
Status: Need Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster/J Severity:S3 (Non-critical)
Version:8.0.35-cluster MySQL Cluster Community S OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[14 Oct 1:22] CunDi Fang
Description:
In a cluster of four data nodes, one of the nodes encountered a serious ID assignment problem when rejoining the cluster.

How to repeat:
The log file is as follow:
```2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 121 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 123 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 128 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 129 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 131 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): We have completed restoring our fragments and executed REDO log and rebuilt ordered indexes
2024-10-08 16:28:10 [ndbd] INFO     -- Start phase 4 completed
2024-10-08 16:28:10 [ndbd] INFO     -- Phase 4 restored all fragments from local disk up to a consistent global checkpoint id
2024-10-08 16:28:10 [ndbd] INFO     -- Grant nodes to start phase: 5, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:10 [ndbd] INFO     -- Start NDB start phase 4
2024-10-08 16:28:10 [ndbd] INFO     -- NDB start phase 4 completed
2024-10-08 16:28:10 [ndbd] INFO     -- As master we will wait for other nodes to reach the state waitpoint52 as well
2024-10-08 16:28:10 [ndbd] INFO     -- Node 3 have reached completion of NDB start phase 4
2024-10-08 16:28:10 [ndbd] INFO     -- Start NDB start phase 5 (only to DBDIH)
2024-10-08 16:28:13 [ndbd] INFO     -- NDB start phase 5 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Start NDB start phase 6
2024-10-08 16:28:13 [ndbd] INFO     -- NDB start phase 6 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 5 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 5 waited for local checkpoint to complete
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 6, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 6 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 6 updated blocks about that we've now reached the started state.
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 7, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- President restarts arbitration thread [state=1]
2024-10-08 16:28:13 [ndbd] INFO     -- Activating bucket 0 in SUMA
2024-10-08 16:28:13 [ndbd] INFO     -- Activating bucket 1 in SUMA
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 7 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 7 mainly activated the asynchronous change events process, and some other background processes
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 8, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- Start NDB start phase 7
2024-10-08 16:28:13 [ndbd] INFO     -- Foreign Key enabling Starting
2024-10-08 16:28:13 [ndbd] INFO     -- Foreign key enabling Completed
2024-10-08 16:28:13 [ndbd] INFO     -- NDB start phase 7 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 8 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 8 enabled foreign keys and waited forall nodes to complete start up to this point
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 9, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- Started arbitrator node 1 [ticket=58da00013576f9ad]
2024-10-08 16:28:15 [ndbd] INFO     -- Restart complete, updated local sysfile
2024-10-08 16:28:15 [ndbd] INFO     -- Start phase 9 completed
2024-10-08 16:28:15 [ndbd] INFO     -- Phase 9 enabled APIs to start connecting
2024-10-08 16:28:15 [ndbd] INFO     -- Grant nodes to start phase: 10, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO     -- Start phase 50 completed
2024-10-08 16:28:15 [ndbd] INFO     -- Grant nodes to start phase: 51, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO     -- Start phase 101 completed
2024-10-08 16:28:15 [ndbd] INFO     -- Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events
2024-10-08 16:28:15 [ndbd] INFO     -- Grant nodes to start phase: 102, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO     -- Node started
2024-10-08 16:28:15 [ndbd] INFO     -- Node 2 has completed its restart
2024-10-08 16:28:15 [ndbd] INFO     -- Node 3 has completed its restart
2024-10-08 16:28:15 [ndbd] INFO     -- Start node: 4 using node restart (node not restorable on its own)
2024-10-08 16:28:15 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:28:15 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:28:16 [ndbd] INFO     -- Alloc node id for node 6 succeeded
2024-10-08 16:28:19 [ndbd] INFO     -- Start node: 5 using node restart (node not restorable on its own)
2024-10-08 16:28:33 [ndbd] INFO     -- granting SumaStartMe dict lock to 4
2024-10-08 16:28:33 [ndbd] INFO     -- clearing SumaStartMe dict lock for 4
2024-10-08 16:28:36 [ndbd] INFO     -- granting SumaHandover dict lock to 4
2024-10-08 16:28:36 [ndbd] INFO     -- prepare to handover bucket: 1
2024-10-08 16:28:36 [ndbd] INFO     -- Allocate event buffering page chunk in SUMA, 16 pages, first page ref = 7094
2024-10-08 16:28:41 [ndbd] INFO     -- granting SumaStartMe dict lock to 5
2024-10-08 16:28:41 [ndbd] INFO     -- clearing SumaStartMe dict lock for 5
2024-10-08 16:28:41 [ndbd] INFO     -- 5782186/0 (5782185/4294967295) switchover complete bucket 1 state: 2
2024-10-08 16:28:41 [ndbd] INFO     -- handover
2024-10-08 16:28:41 [ndbd] INFO     -- clearing SumaHandover dict lock for 4
2024-10-08 16:28:41 [ndbd] INFO     -- Node 4 has completed its restart
2024-10-08 16:28:44 [ndbd] INFO     -- granting SumaHandover dict lock to 5
2024-10-08 16:28:49 [ndbd] INFO     -- clearing SumaHandover dict lock for 5
2024-10-08 16:28:49 [ndbd] INFO     -- Node 5 has completed its restart
2024-10-08 16:52:24 [ndbd] INFO     -- Node 6 disconnected in state: 0
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:28 [ndbd] INFO     -- Alloc node id for node 6 succeeded
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 129, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 131, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 128, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 132, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 130, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 127, tabLcpStatus: 3
```

I think the root cause is:
2024-10-08 16:52:24 [ndbd] INFO     -- Node 6 disconnected in state: 0
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:28 [ndbd] INFO     -- Alloc node id for node 6 succeeded

However, the ndb6 is the SQL server node:
root@25e7d3ef1189 /h/m/output_v32-6# ndb_mgm -e show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @192.172.10.9  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 0, *)
id=3    @192.172.10.10  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 1)
id=4    @192.172.10.11  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 0)
id=5    @192.172.10.12  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @192.172.10.8  (mysql-8.0.35 ndb-8.0.35)

[mysqld(API)]   4 node(s)
id=6    @192.172.10.9  (mysql-8.0.35 ndb-8.0.35)
id=7    @192.172.10.10  (mysql-8.0.35 ndb-8.0.35)
id=8    @192.172.10.11  (mysql-8.0.35 ndb-8.0.35)
id=9    @192.172.10.12  (mysql-8.0.35 ndb-8.0.35)

Only binlogs:
-rw-r----- 1 root root 7.5K Oct  8 23:20 binlog.000001
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000002
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000003
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000004
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000005
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000006
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000007
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000008
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000009
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000010
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000011
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000012
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000013
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000014
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000015
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000016
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000017
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000018
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000019
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000020
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000021
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000022
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000023
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000024
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000025
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000026
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000027
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000028
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000029
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000030
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000031
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000032
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000033
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000034
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000035
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000036
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000037
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000038
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000039
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000040
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000041
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000042
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000043
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000044
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000045
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000046
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000047
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000048
-rw-r----- 1 root root  768 Oct  8 23:20 binlog.index

Do you need these files?

Suggested fix:
Is there a problem with the join process for sql nodes? It seems to be related to the granting and clearing of dictionary locks, and also to suma.
[14 Oct 5:27] MySQL Verification Team
Hi,

Please upload full ndb_error_report and all the logs from SQL nodes.

Thanks
[16 Oct 4:13] MySQL Verification Team
Please attach the zip with *all* log files. Use ndb_error_reporter script

https://dev.mysql.com/doc/refman/8.4/en/mysql-cluster-programs-ndb-error-reporter.html
[16 Oct 7:34] MySQL Verification Team
Great, thanks.

Is this the same setup on improperly configured docker? It might be the same issue too. If it is, can you retest with properly configured docker?