MySQL Bugs: #116351: Error assigning node ID during synchronization of mysql NDB cluster

Bug #116351	Error assigning node ID during synchronization of mysql NDB cluster
Submitted:	14 Oct 2024 1:22	Modified:	18 Dec 2024 14:53
Reporter:	CunDi Fang	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.35-cluster MySQL Cluster Community S	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In a cluster of four data nodes, one of the nodes encountered a serious ID assignment problem when rejoining the cluster.

How to repeat:
The log file is as follow:
```2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 121 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 123 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 128 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 129 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): index id 131 rebuild done
2024-10-08 16:28:10 [ndbd] INFO     -- LDM(0): We have completed restoring our fragments and executed REDO log and rebuilt ordered indexes
2024-10-08 16:28:10 [ndbd] INFO     -- Start phase 4 completed
2024-10-08 16:28:10 [ndbd] INFO     -- Phase 4 restored all fragments from local disk up to a consistent global checkpoint id
2024-10-08 16:28:10 [ndbd] INFO     -- Grant nodes to start phase: 5, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:10 [ndbd] INFO     -- Start NDB start phase 4
2024-10-08 16:28:10 [ndbd] INFO     -- NDB start phase 4 completed
2024-10-08 16:28:10 [ndbd] INFO     -- As master we will wait for other nodes to reach the state waitpoint52 as well
2024-10-08 16:28:10 [ndbd] INFO     -- Node 3 have reached completion of NDB start phase 4
2024-10-08 16:28:10 [ndbd] INFO     -- Start NDB start phase 5 (only to DBDIH)
2024-10-08 16:28:13 [ndbd] INFO     -- NDB start phase 5 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Start NDB start phase 6
2024-10-08 16:28:13 [ndbd] INFO     -- NDB start phase 6 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 5 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 5 waited for local checkpoint to complete
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 6, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 6 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 6 updated blocks about that we've now reached the started state.
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 7, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- President restarts arbitration thread [state=1]
2024-10-08 16:28:13 [ndbd] INFO     -- Activating bucket 0 in SUMA
2024-10-08 16:28:13 [ndbd] INFO     -- Activating bucket 1 in SUMA
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 7 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 7 mainly activated the asynchronous change events process, and some other background processes
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 8, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- Start NDB start phase 7
2024-10-08 16:28:13 [ndbd] INFO     -- Foreign Key enabling Starting
2024-10-08 16:28:13 [ndbd] INFO     -- Foreign key enabling Completed
2024-10-08 16:28:13 [ndbd] INFO     -- NDB start phase 7 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Start phase 8 completed
2024-10-08 16:28:13 [ndbd] INFO     -- Phase 8 enabled foreign keys and waited forall nodes to complete start up to this point
2024-10-08 16:28:13 [ndbd] INFO     -- Grant nodes to start phase: 9, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:13 [ndbd] INFO     -- Started arbitrator node 1 [ticket=58da00013576f9ad]
2024-10-08 16:28:15 [ndbd] INFO     -- Restart complete, updated local sysfile
2024-10-08 16:28:15 [ndbd] INFO     -- Start phase 9 completed
2024-10-08 16:28:15 [ndbd] INFO     -- Phase 9 enabled APIs to start connecting
2024-10-08 16:28:15 [ndbd] INFO     -- Grant nodes to start phase: 10, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO     -- Start phase 50 completed
2024-10-08 16:28:15 [ndbd] INFO     -- Grant nodes to start phase: 51, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO     -- Start phase 101 completed
2024-10-08 16:28:15 [ndbd] INFO     -- Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events
2024-10-08 16:28:15 [ndbd] INFO     -- Grant nodes to start phase: 102, nodes: 000000000000000000000000000000000000000c
2024-10-08 16:28:15 [ndbd] INFO     -- Node started
2024-10-08 16:28:15 [ndbd] INFO     -- Node 2 has completed its restart
2024-10-08 16:28:15 [ndbd] INFO     -- Node 3 has completed its restart
2024-10-08 16:28:15 [ndbd] INFO     -- Start node: 4 using node restart (node not restorable on its own)
2024-10-08 16:28:15 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:28:15 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:28:16 [ndbd] INFO     -- Alloc node id for node 6 succeeded
2024-10-08 16:28:19 [ndbd] INFO     -- Start node: 5 using node restart (node not restorable on its own)
2024-10-08 16:28:33 [ndbd] INFO     -- granting SumaStartMe dict lock to 4
2024-10-08 16:28:33 [ndbd] INFO     -- clearing SumaStartMe dict lock for 4
2024-10-08 16:28:36 [ndbd] INFO     -- granting SumaHandover dict lock to 4
2024-10-08 16:28:36 [ndbd] INFO     -- prepare to handover bucket: 1
2024-10-08 16:28:36 [ndbd] INFO     -- Allocate event buffering page chunk in SUMA, 16 pages, first page ref = 7094
2024-10-08 16:28:41 [ndbd] INFO     -- granting SumaStartMe dict lock to 5
2024-10-08 16:28:41 [ndbd] INFO     -- clearing SumaStartMe dict lock for 5
2024-10-08 16:28:41 [ndbd] INFO     -- 5782186/0 (5782185/4294967295) switchover complete bucket 1 state: 2
2024-10-08 16:28:41 [ndbd] INFO     -- handover
2024-10-08 16:28:41 [ndbd] INFO     -- clearing SumaHandover dict lock for 4
2024-10-08 16:28:41 [ndbd] INFO     -- Node 4 has completed its restart
2024-10-08 16:28:44 [ndbd] INFO     -- granting SumaHandover dict lock to 5
2024-10-08 16:28:49 [ndbd] INFO     -- clearing SumaHandover dict lock for 5
2024-10-08 16:28:49 [ndbd] INFO     -- Node 5 has completed its restart
2024-10-08 16:52:24 [ndbd] INFO     -- Node 6 disconnected in state: 0
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:28 [ndbd] INFO     -- Alloc node id for node 6 succeeded
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 129, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 131, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 128, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 132, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 130, tabLcpStatus: 3
2024-10-08 16:53:05 [ndbd] INFO     -- DROP_TAB_REQ: tab: 127, tabLcpStatus: 3
```

I think the root cause is:
2024-10-08 16:52:24 [ndbd] INFO     -- Node 6 disconnected in state: 0
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:26 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc nodeid for node 6 failed,err: 1703
2024-10-08 16:52:27 [ndbd] INFO     -- Alloc node id for node 6 failed, err: 1703
2024-10-08 16:52:28 [ndbd] INFO     -- Alloc node id for node 6 succeeded

However, the ndb6 is the SQL server node:
root@25e7d3ef1189 /h/m/output_v32-6# ndb_mgm -e show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @192.172.10.9  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 0, *)
id=3    @192.172.10.10  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 1)
id=4    @192.172.10.11  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 0)
id=5    @192.172.10.12  (mysql-8.0.35 ndb-8.0.35, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @192.172.10.8  (mysql-8.0.35 ndb-8.0.35)

[mysqld(API)]   4 node(s)
id=6    @192.172.10.9  (mysql-8.0.35 ndb-8.0.35)
id=7    @192.172.10.10  (mysql-8.0.35 ndb-8.0.35)
id=8    @192.172.10.11  (mysql-8.0.35 ndb-8.0.35)
id=9    @192.172.10.12  (mysql-8.0.35 ndb-8.0.35)

Only binlogs:
-rw-r----- 1 root root 7.5K Oct  8 23:20 binlog.000001
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000002
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000003
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000004
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000005
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000006
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000007
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000008
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000009
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000010
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000011
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000012
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000013
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000014
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000015
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000016
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000017
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000018
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000019
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000020
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000021
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000022
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000023
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000024
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000025
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000026
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000027
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000028
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000029
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000030
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000031
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000032
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000033
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000034
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000035
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000036
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000037
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000038
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000039
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000040
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000041
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000042
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000043
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000044
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000045
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000046
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000047
-rw-r----- 1 root root  157 Oct  8 23:20 binlog.000048
-rw-r----- 1 root root  768 Oct  8 23:20 binlog.index

Do you need these files?

Suggested fix:
Is there a problem with the join process for sql nodes? It seems to be related to the granting and clearing of dictionary locks, and also to suma.

Hi,

Please upload full ndb_error_report and all the logs from SQL nodes.

Thanks

Please attach the zip with *all* log files. Use ndb_error_reporter script

https://dev.mysql.com/doc/refman/8.4/en/mysql-cluster-programs-ndb-error-reporter.html

Great, thanks.

Is this the same setup on improperly configured docker? It might be the same issue too. If it is, can you retest with properly configured docker?

Yes, this is same with 116349, I did not set the CAP_SYS_NICE:
```
        "HostConfig": {                                                                                                                                                                           
            "Binds": null,                                                                                                                                                                        
            "ContainerIDFile": "",                                                                                                                                                                
            "LogConfig": {                                                                                                                                                                        
                "Type": "json-file",                                                                                                                                                              
                "Config": {}                                                                                                                                                                      
            },                                                                                                                                                                                    
            "NetworkMode": "mysqlcluster_net2",                                                                                                                                                   
            "PortBindings": {},                                                                                                                                                                   
            "RestartPolicy": {                                                                                                                                                                    
                "Name": "no",                                                                                                                                                                     
                "MaximumRetryCount": 0                                                                                                                                                            
            },                                                                                                                                                                                    
            "AutoRemove": false,                                                                                                                                                                  
            "VolumeDriver": "",                                                                                                                                                                   
            "VolumesFrom": null,                                                                                                                                                                  
            "CapAdd": null,            //HERE                                                                                                                                                           
            "CapDrop": null,                                                                                                                                                                      
            "CgroupnsMode": "private",                                                                                                                                                            
            "Dns": [],                                                                                                                                                                            
            "DnsOptions": [],                                                                                                                                                                     
            "DnsSearch": [],                                                                                                                                                                      
            "ExtraHosts": null,  
```
Since I only keep log files, it's also difficult for me to reproduce a time series of strictly one. But I can confirm that I'm not involving thread allocation or Resource Groups or anything like that. I'm just interacting normally, so it's logical that this configuration item shouldn't affect it, right?

Documented fix as follows in the NDB 8.0.41, 8.4.4, and 9.2.0 changelogs:

    Errors of unknown provenance were logged while assigning node
    IDs during cluster synchronization, leading to user doubt and
    concern. Logging of the data node QMGR block and the ndb_mgmd
    process relating to node ID allocation issues has therefore been
    improved, to supply more information about what is being reported
    in such cases.

Closed.