Bug #116349 Sync crash in NDB cluster
Submitted: 14 Oct 2024 0:12 Modified: 24 Oct 2024 9:37
Reporter: CunDi Fang Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:8.0.35-cluster MySQL Cluster Community S OS:Any
Assigned to: CPU Architecture:Any

[14 Oct 2024 0:12] CunDi Fang
Description:
I was in a cluster of 4 data nodes and tried to remove a sync related node while doing a sync, causing the cluster to crash and all nodes ndbd all down.

How to repeat:
I am a four data node mysql NDB cluster cluster, I removed node 5 during the synchronization of node 2, probably due to the presence of data on node 5 that has a dependency on node 2, which ultimately led to the failure of the synchronization and the termination of all ndbd.

manager log as follows:
```
2024-10-08 08:26:15 [MgmtSrvr] INFO     -- Node 4: Node 8: API mysql-8.0.35 ndb-8.0.35
2024-10-08 08:26:15 [MgmtSrvr] INFO     -- Node 5: Node 8: API mysql-8.0.35 ndb-8.0.35
2024-10-08 08:26:15 [MgmtSrvr] INFO     -- Node 4: Suma: initiate handover for startup with nodes 0000000000000000000000000000000000000004 GCI: 5781886
2024-10-08 08:26:15 [MgmtSrvr] INFO     -- Node 4: Suma: handover from node 2 gci: 5781886 buckets: 00000002 (2)
2024-10-08 08:26:15 [MgmtSrvr] INFO     -- Node 2: DICT: index 89 stats auto-update starting
2024-10-08 08:26:18 [MgmtSrvr] INFO     -- Node 2: DICT: index 89 stats auto-update starting - Repeated 2 times
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 2: DICT: index 89 stats auto-update starting
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 4: Start phase 101 completed (node restart)
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 4: Started (mysql-8.0.35 ndb-8.0.35)
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 2: NR Status: node=4,OLD=Wait handover of subscriptions,NEW=Restart completed
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 2: DICT: unlocked by node 4 for NodeRestart
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 5: Suma: initiate handover for startup with nodes 0000000000000000000000000000000000000008 GCI: 5781889
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 5: Suma: handover from node 3 gci: 5781889 buckets: 00000002 (2)
2024-10-08 08:26:20 [MgmtSrvr] INFO     -- Node 2: DICT: index 89 stats auto-update starting
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: Cluster shutdown initiated
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 3: Cluster shutdown initiated
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 4: Cluster shutdown initiated
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 5 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:21 [MgmtSrvr] ALERT    -- Node 1: Node 5 Disconnected
2024-10-08 08:26:21 [MgmtSrvr] ALERT    -- Node 4: Node 5 Disconnected
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: Communication to Node 5 closed
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 3: Communication to Node 5 closed
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 4: Communication to Node 5 closed
2024-10-08 08:26:21 [MgmtSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: Removed lock for node 5
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: DICT: remove lock by failed node 5 for NodeRestart
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: DICT: unlocked by node 5 for NodeRestart
2024-10-08 08:26:21 [MgmtSrvr] ALERT    -- Node 2: Node 5 Disconnected
2024-10-08 08:26:21 [MgmtSrvr] ALERT    -- Node 3: Node 5 Disconnected
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 5: Node shutdown completed.
2024-10-08 08:26:25 [MgmtSrvr] INFO     -- Node 4: Communication to Node 5 opened
2024-10-08 08:26:25 [MgmtSrvr] INFO     -- Node 2: Communication to Node 5 opened
2024-10-08 08:26:25 [MgmtSrvr] INFO     -- Node 3: Communication to Node 5 opened
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Node 2 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Node 4 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Node 2: Node shutdown completed.
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Node 3 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Node 4: Node shutdown completed.
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Node 3: Node shutdown completed.
2024-10-08 08:26:28 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2024-10-08 08:26:28 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2024-10-08 08:26:28 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2024-10-08 08:26:28 [MgmtSrvr] INFO     -- Nodeid 2 allocated for NDB at 192.172.10.9
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Alloc node id 3 rejected, no new president yet
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Nodeid 3 allocated for NDB at 192.172.10.10
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 2: Start phase 0 completed (system restart)
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 2: Communication to Node 3 opened
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 2: Communication to Node 4 opened
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 2: Communication to Node 5 opened
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 2: Waiting 30 sec for nodes 3, 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2 no-wait:  ]
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Node 1: Node 3 Connected
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Alloc node id 4 rejected, no new president yet
2024-10-08 08:26:29 [MgmtSrvr] INFO     -- Nodeid 4 allocated for NDB at 192.172.10.11
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 1: Node 4 Connected
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3: Start phase 0 completed (system restart)
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 opened
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 opened
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3: Communication to Node 5 opened
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3: Waiting 30 sec for nodes 2, 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 3 no-wait:  ]
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Alloc node id 5 rejected, no new president yet
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Nodeid 5 allocated for NDB at 192.172.10.12
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3: Node 2 Connected
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 2: Node 3 Connected
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 2: Waiting 29 sec for nodes 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2 and 3 no-wait:  ]
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 1: Node 5 Connected
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 5 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 5: Node shutdown completed.
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 4 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 3 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 2 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:30 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2024-10-08 08:26:30 [MgmtSrvr] ALERT    -- Node 1: Node 5 Disconnected
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 2: Node shutdown completed.
2024-10-08 08:26:30 [MgmtSrvr] INFO     -- Node 4: Node shutdown completed.
2024-10-08 08:26:31 [MgmtSrvr] INFO     -- Node 3: Node shutdown completed.
2024-10-08 08:26:31 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2024-10-08 08:26:31 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2024-10-08 08:26:31 [MgmtSrvr] INFO     -- Nodeid 2 allocated for NDB at 192.172.10.9
2024-10-08 08:26:31 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected
2024-10-08 08:26:31 [MgmtSrvr] INFO     -- Nodeid 3 allocated for NDB at 192.172.10.10
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Alloc node id 4 rejected, no new president yet
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Nodeid 4 allocated for NDB at 192.172.10.11
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 1: Node 3 Connected
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Buffering maximum epochs 100
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 1: Node 4 Connected
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Start phase 0 completed (system restart)
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 opened
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 opened
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Communication to Node 5 opened
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Waiting 30 sec for nodes 2, 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 3 no-wait:  ]
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 2: Node 3 Connected
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 4 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 2 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 3: Node shutdown completed.
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 2: Node shutdown completed.
2024-10-08 08:26:32 [MgmtSrvr] INFO     -- Node 4: Node shutdown completed.
2024-10-08 08:26:33 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2024-10-08 08:26:33 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2024-10-08 08:26:33 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2024-10-08 08:26:33 [MgmtSrvr] INFO     -- Nodeid 2 allocated for NDB at 192.172.10.9
2024-10-08 08:26:33 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected
2024-10-08 08:26:33 [MgmtSrvr] INFO     -- Nodeid 3 allocated for NDB at 192.172.10.10
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 1: Node 3 Connected
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2: Start phase 0 completed (system restart)
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2: Communication to Node 3 opened
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2: Communication to Node 4 opened
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2: Communication to Node 5 opened
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2: Waiting 30 sec for nodes 3, 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2 no-wait:  ]
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 3 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 3: Node shutdown completed.
2024-10-08 08:26:34 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Node 2: Node shutdown completed.
2024-10-08 08:26:34 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2024-10-08 08:26:34 [MgmtSrvr] INFO     -- Nodeid 2 allocated for NDB at 192.172.10.9
2024-10-08 08:26:35 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected
2024-10-08 08:26:35 [MgmtSrvr] INFO     -- Node 2 disconnected in recv with errnum: 104 in state: 0
2024-10-08 08:26:35 [MgmtSrvr] INFO     -- Node 2: Node shutdown completed.
2024-10-08 08:26:35 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2024-10-08 08:27:44 [MgmtSrvr] WARNING  -- Unable to allocate nodeid for API at 192.172.10.9. Returned error: 'Cluster not ready for nodeid allocation.'
2024-10-08 08:27:47 [MgmtSrvr] WARNING  -- Unable to allocate nodeid for API at 192.172.10.9. Returned error: 'Cluster not ready for nodeid allocation.'
2024-10-08 08:27:50 [MgmtSrvr] WARNING  -- Unable to allocate nodeid for API at 192.172.10.9. Returned error: 'Cluster not ready for nodeid allocation.'
2024-10-08 08:27:53 [MgmtSrvr] WARNING  -- Unable to allocate nodeid for API at 192.172.10.9. Returned error: 'Cluster not ready for nodeid allocation.'
```

From the log file, I found that:
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: Removed lock for node 5
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: DICT: remove lock by failed node 5 for NodeRestart
2024-10-08 08:26:21 [MgmtSrvr] INFO     -- Node 2: DICT: unlocked by node 5 for NodeRestart

It seems that the misusing of locks caused this synchronization failure.

However, I did not find the log of node2 during 08:26:21, only found the log file of node5:
2024-10-08 08:26:20 [ndbd] INFO     -- Node 4 has completed its restart
2024-10-08 08:26:20 [ndbd] INFO     -- Suma: handover from node 3 gci: 5781889 buckets: 00000002 (2)
2024-10-08 08:26:21 [ndbd] INFO     -- The data node run-time environment has been stopped
2024-10-08 08:26:21 [ndbd] INFO     -- Shutdown initiated
2024-10-08 08:26:21 [ndbd] INFO     -- Shutdown completed - exiting
2024-10-08 08:26:21 [ndbd] INFO     -- Angel shutting down
2024-10-08 08:26:21 [ndbd] INFO     -- Node 5: Node shutdown completed.

Only some binlog on node2.

Suggested fix:
There should be a problem with the locking mechanism during synchronization, which doesn't handle node exits occurring during synchronization.
[14 Oct 2024 5:24] MySQL Verification Team
Hi,

I do not understand what you did. Can you write in details how you managed to get your ndbcluster to this state.

Thanks
[14 Oct 2024 9:02] CunDi Fang
Okay, I reorganized the logs from that time and found that I was supposed to be trying to rejoin node 5 to the cluster, and then node 5 established communication in the cluster and joined successfully. However, immediately after that node 5 disconnected again. At the same time, I had sql transactions going on on both sql servers and found data inconsistencies. I will give you the rest of the log files next and hopefully that will help.
[14 Oct 2024 9:03] CunDi Fang
This is the log file of the manager

Attachment: ndb_error_report_116349-log-manager.txt (text/plain), 12.48 KiB.

[14 Oct 2024 9:03] CunDi Fang
This is the log file of the ndbd node 2

Attachment: ndb_error_report_116349-log-ndb2.txt (text/plain), 5.51 KiB.

[14 Oct 2024 9:04] CunDi Fang
This is the log file of the ndbd node 3

Attachment: ndb_error_report_116349-log-ndb3.txt (text/plain), 3.72 KiB.

[14 Oct 2024 9:04] CunDi Fang
This is the log file of the ndbd node 4

Attachment: ndb_error_report_116349-log-ndb4.txt (text/plain), 4.77 KiB.

[14 Oct 2024 9:05] CunDi Fang
This is the log file of the ndbd node 5

Attachment: ndb_error_report_116349-log-ndb5.txt (text/plain), 3.47 KiB.

[14 Oct 2024 9:10] CunDi Fang
Looking at the log file, it appears that a large number of "mbind: Operation not permitted" have been logged in the file even before this bug was encountered. There should be a problem with one of the locks, but when I traced it back, I realized that the log had already arrived:
2024-09-23T13:27:42.953769Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization
2024-09-23T13:28:42.962014Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization
2024-09-23T13:29:42.969867Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization
2024-09-23T13:30:42.257748Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:42.728303Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test0' for table 'mytest100.test0' already exists
2024-09-23T13:30:42.736237Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:42.952290Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test1' for table 'mytest100.test1' already exists
2024-09-23T13:30:42.958592Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:42.976104Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization
2024-09-23T13:30:43.230317Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test2' for table 'mytest100.test2' already exists
2024-09-23T13:30:43.236269Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:43.532614Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test3' for table 'mytest100.test3' already exists
2024-09-23T13:30:43.539566Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:43.770974Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test4' for table 'mytest100.test4' already exists
2024-09-23T13:30:43.778368Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:44.142185Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test5' for table 'mytest100.test5' already exists
2024-09-23T13:30:44.156434Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:44.559589Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test6' for table 'mytest100.test6' already exists
2024-09-23T13:30:44.567982Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:44.871424Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test7' for table 'mytest100.test7' already exists
2024-09-23T13:30:44.925371Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:45.402160Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test8' for table 'mytest100.test8' already exists
2024-09-23T13:30:45.420683Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
2024-09-23T13:30:45.796922Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test9' for table 'mytest100.test9' already exists
2024-09-23T13:30:45.814623Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache
mbind: Operation not permitted
mbind: Operation not permitted
mbind: Operation not permitted

The affairs I was performing were:
SELECT * FROM ndbinfo.table_replicas;
SELECT COLUMN_NAME FROM INFORMATION_SCHEMA. KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = 'mytest100'   AND TABLE_NAME = 'test0'   AND CONSTRAINT_NAME = 'PRIMARY' ORDER BY ORDINAL_POSITION;
From the results, it looks like these two statements are executing properly on both sql servers
[16 Oct 2024 4:11] MySQL Verification Team
Hi,

I'm having issue reproducing this. Are you running this on bare metal or in VM or you are using docker?
[16 Oct 2024 4:42] CunDi Fang
I am using docker. Five docker containers, one for manager, four for data and sql server. all are Ubuntu 22.04 environments. The configuration of the architecture is as follows:
[NDBD DEFAULT]
NoOfReplicas =2
DataMemory = 512M
IndexMemory = 64M

[NDB_MGMD]
NodeId=1
hostname =192.168.10.8
datadir =/var/lib/mysql-cluster
LogFilename=/var/lib/mysql-cluster/ndb_1_mgmd.log

[NDBD]
NodeId =2
hostname =192.168.10.9
datadir =/usr/local/mysql-cluster/data
[NDBD]
NodeId =3
hostname =192.168.10.10
datadir =/usr/local/mysql-cluster/data

[mysqld]
NodeId =4
hostname =192.168.10.9
[mysqld]
NodeId =5
hostname =192.168.10.10
[16 Oct 2024 7:28] MySQL Verification Team
Hi,

Docker is not properly configured. If you look at your log you can see

mbind: Operation not permitted

Checkout CAP_SYS_NICE option for docker.

Since docker is improperly configured MySQL cannot handle thread priorities and this can lead to problems. ( https://dev.mysql.com/doc/refman/8.0/en/resource-groups.html )

Can you retest with properly configured docker.
[16 Oct 2024 8:09] CunDi Fang
I checked my docker configurations, and you are right, I did not set the CAP_SYS_NICE:
```
        "HostConfig": {                                                                                                                                                                           
            "Binds": null,                                                                                                                                                                        
            "ContainerIDFile": "",                                                                                                                                                                
            "LogConfig": {                                                                                                                                                                        
                "Type": "json-file",                                                                                                                                                              
                "Config": {}                                                                                                                                                                      
            },                                                                                                                                                                                    
            "NetworkMode": "mysqlcluster_net2",                                                                                                                                                   
            "PortBindings": {},                                                                                                                                                                   
            "RestartPolicy": {                                                                                                                                                                    
                "Name": "no",                                                                                                                                                                     
                "MaximumRetryCount": 0                                                                                                                                                            
            },                                                                                                                                                                                    
            "AutoRemove": false,                                                                                                                                                                  
            "VolumeDriver": "",                                                                                                                                                                   
            "VolumesFrom": null,                                                                                                                                                                  
            "CapAdd": null,            //HERE                                                                                                                                                           
            "CapDrop": null,                                                                                                                                                                      
            "CgroupnsMode": "private",                                                                                                                                                            
            "Dns": [],                                                                                                                                                                            
            "DnsOptions": [],                                                                                                                                                                     
            "DnsSearch": [],                                                                                                                                                                      
            "ExtraHosts": null,  
```
However, I still do not understand. This configuration should be related to Resource Groups, a feature that optimizes the use of resources, and I have not been involved in modifying the scheduling of threads and CPU resource allocation transactions, just the normal select, insert, alter, and then carried out node join and restart and so on. Why did it cause the resource error?
I'm sorry to say that since I only kept the log file at the time of this bug, it's difficult on my end to try to reproduce the bug in a strict time sequence now. But if, as you say, this bug and 116351 are caused by the same thing, then this type of bug is encountered frequently on my side.
[16 Oct 2024 8:10] CunDi Fang
I checked my docker configurations, and you are right, I did not set the CAP_SYS_NICE:
```
        "HostConfig": {                                                                                                                                                                           
            "Binds": null,                                                                                                                                                                        
            "ContainerIDFile": "",                                                                                                                                                                
            "LogConfig": {                                                                                                                                                                        
                "Type": "json-file",                                                                                                                                                              
                "Config": {}                                                                                                                                                                      
            },                                                                                                                                                                                    
            "NetworkMode": "mysqlcluster_net2",                                                                                                                                                   
            "PortBindings": {},                                                                                                                                                                   
            "RestartPolicy": {                                                                                                                                                                    
                "Name": "no",                                                                                                                                                                     
                "MaximumRetryCount": 0                                                                                                                                                            
            },                                                                                                                                                                                    
            "AutoRemove": false,                                                                                                                                                                  
            "VolumeDriver": "",                                                                                                                                                                   
            "VolumesFrom": null,                                                                                                                                                                  
            "CapAdd": null,            //HERE                                                                                                                                                           
            "CapDrop": null,                                                                                                                                                                      
            "CgroupnsMode": "private",                                                                                                                                                            
            "Dns": [],                                                                                                                                                                            
            "DnsOptions": [],                                                                                                                                                                     
            "DnsSearch": [],                                                                                                                                                                      
            "ExtraHosts": null,  
```
However, I still do not understand. This configuration should be related to Resource Groups, a feature that optimizes the use of resources, and I have not been involved in modifying the scheduling of threads and CPU resource allocation transactions, just the normal select, insert, alter, and then carried out node join and restart and so on. Why did it cause the resource error?
I'm sorry to say that since I only kept the log file at the time of this bug, it's difficult on my end to try to reproduce the bug in a strict time sequence now. But if, as you say, this bug and 116351 are caused by the same thing, then this type of bug is encountered frequently on my side.
[16 Oct 2024 8:48] MySQL Verification Team
Hi,

> I have not been involved in modifying the scheduling of threads and CPU resource allocation transactions

You did not but MySQL tried and failed as you can see in the log.

> But if, as you say, this bug and 116351 are caused by the same thing, 

I cannot say for sure it is the same, but it could be so properly configuring docker is the first step, as I cannot reproduce this issue you are having (neither this nor 116351) so that is a best bet at the moment.

> then this type of bug is encountered frequently on my side.

So you can then easily see if reconfiguring docker solves the problem, if problem do not happen again after properly configuring docker - that was the issue, if it happens again, please collect logs again and update the report.

Thanks
[16 Oct 2024 9:33] CunDi Fang
Ok, I'll try from my side if this bug and 116351 will continue to appear after changing the docker configuration. I will get back to you within three days.
[19 Oct 2024 15:39] CunDi Fang
Hi, after dozens of my attempts, I found that this one record should reflect the bug I reported, and this one was reproduced with CAP_SYS_NICE turned on. Also, I found bug116351 you confirmed as verified, may I ask if you are reproducing it successfully? Is there any more information needed from this side?
[24 Oct 2024 9:38] MySQL Verification Team
We will update these reports if we need more data, thanks.