MySQL Bugs: #116348: Unable to start NDB cluster

Bug #116348	Unable to start NDB cluster
Submitted:	13 Oct 2024 23:55	Modified:	14 Oct 2024 5:22
Reporter:	CunDi Fang	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster/J	Severity:	S3 (Non-critical)
Version:	8.0.35-cluster MySQL Cluster Community S	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
After I did a system reboot and a mysql service update, I realized that I couldn't start ndbd anymore

How to repeat:
I then just did a mysql server update and then noticed that all the ndbd nodes were inaccessible, and when I tried to do a reboot, since I'm a four node cluster, I had no problem booting the first three nodes of the ndbd, but when booting the fourth one, it crashes all of them at the same time.

Here is the log of manager:
```
2024-10-13 23:42:57 [MgmtSrvr] INFO     -- Loaded config from '/usr/local/mysql/mysql-cluster/ndb_1_config.bin.1'
2024-10-13 23:42:58 [MgmtSrvr] INFO     -- Node 1: Node 1 Connected
2024-10-13 23:42:58 [MgmtSrvr] INFO     -- Id: 1, Command port: *:1186
2024-10-13 23:42:58 [MgmtSrvr] INFO     -- MySQL Cluster Management Server mysql-8.0.35 ndb-8.0.35 started
2024-10-13 23:42:58 [MgmtSrvr] INFO     -- Node 1 connected
2024-10-13 23:43:09 [MgmtSrvr] INFO     -- Nodeid 2 allocated for NDB at 192.192.10.9
2024-10-13 23:43:09 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected
2024-10-13 23:43:10 [MgmtSrvr] INFO     -- Node 2: Start phase 0 completed (system restart)
2024-10-13 23:43:10 [MgmtSrvr] INFO     -- Node 2: Communication to Node 3 opened
2024-10-13 23:43:10 [MgmtSrvr] INFO     -- Node 2: Communication to Node 4 opened
2024-10-13 23:43:10 [MgmtSrvr] INFO     -- Node 2: Communication to Node 5 opened
2024-10-13 23:43:10 [MgmtSrvr] INFO     -- Node 2: Waiting 30 sec for nodes 3, 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2 no-wait:  ]
2024-10-13 23:43:11 [MgmtSrvr] INFO     -- Alloc node id 3 rejected, no new president yet
2024-10-13 23:43:11 [MgmtSrvr] INFO     -- Nodeid 3 allocated for NDB at 192.192.10.10
2024-10-13 23:43:11 [MgmtSrvr] INFO     -- Node 1: Node 3 Connected
2024-10-13 23:43:12 [MgmtSrvr] INFO     -- Node 2: Node 3 Connected
2024-10-13 23:43:12 [MgmtSrvr] INFO     -- Node 3: Node 2 Connected
2024-10-13 23:43:12 [MgmtSrvr] INFO     -- Node 2: Waiting 28 sec for nodes 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2 and 3 no-wait:  ]
2024-10-13 23:43:12 [MgmtSrvr] INFO     -- Alloc node id 4 rejected, no new president yet
2024-10-13 23:43:12 [MgmtSrvr] INFO     -- Nodeid 4 allocated for NDB at 192.192.10.11
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 2: Waiting 27 sec for nodes 4 and 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2 and 3 no-wait:  ]
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 1: Node 4 Connected
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Start phase 0 completed (system restart)
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Communication to Node 2 opened
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Communication to Node 3 opened
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Communication to Node 5 opened
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Initial start, waiting for 2, 3 and 5 to connect,  nodes [ all: 2, 3, 4 and 5 connected: 4 no-wait:  ]
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 2: Node 4 Connected
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 3: Node 4 Connected
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Node 2 Connected
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 4: Node 3 Connected
2024-10-13 23:43:13 [MgmtSrvr] INFO     -- Node 2: Waiting 26 sec for nodes 5 to connect, nodes [ all: 2, 3, 4 and 5 connected: 2, 3 and 4 no-wait:  ]
2024-10-13 23:43:14 [MgmtSrvr] INFO     -- Alloc node id 5 rejected, no new president yet
2024-10-13 23:43:14 [MgmtSrvr] INFO     -- Nodeid 5 allocated for NDB at 192.192.10.12
2024-10-13 23:43:14 [MgmtSrvr] INFO     -- Node 1: Node 5 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Start phase 0 completed (system restart)
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Communication to Node 2 opened
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Communication to Node 3 opened
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Communication to Node 4 opened
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Initial start, waiting for 2, 3 and 4 to connect,  nodes [ all: 2, 3, 4 and 5 connected: 5 no-wait:  ]
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 2: Node 5 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 3: Node 5 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Node 2 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Node 3 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 5: Node 4 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 4: Node 5 Connected
2024-10-13 23:43:15 [MgmtSrvr] INFO     -- Node 2 disconnected in recv with errnum: 104 in state: 0
2024-10-13 23:43:15 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2024-10-13 23:43:15 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2353: 'Insufficent nodes for system restart(Restart error). Temporary error, restart node'.
2024-10-13 23:43:16 [MgmtSrvr] INFO     -- Node 3 disconnected in recv with errnum: 104 in state: 0
2024-10-13 23:43:16 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2024-10-13 23:43:16 [MgmtSrvr] INFO     -- Node 5 disconnected in recv with errnum: 104 in state: 0
2024-10-13 23:43:16 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2024-10-13 23:43:16 [MgmtSrvr] INFO     -- Node 4 disconnected in recv with errnum: 104 in state: 0
2024-10-13 23:43:16 [MgmtSrvr] ALERT    -- Node 5: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2024-10-13 23:43:16 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2024-10-13 23:43:16 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2024-10-13 23:43:16 [MgmtSrvr] ALERT    -- Node 1: Node 5 Disconnected
```

Here is the error log of one ndbd, node2:
```
2024-10-13T23:40:05.895043Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:41:05.895182Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13 23:42:03 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 1006  Illegal reply from server line: 3613
2024-10-13T23:42:05.895326Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13 23:42:08 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 110 cmd: get connection parameter, error: Time out talking to management server, timeout: 5000 Error line: 597
2024-10-13 23:42:13 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 110 cmd: get connection parameter, error: Time out talking to management server, timeout: 5000 Error line: 597
2024-10-13 23:42:18 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 110 cmd: get connection parameter, error: Time out talking to management server, timeout: 5000 Error line: 597
2024-10-13 23:42:23 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 110 cmd: get connection parameter, error: Time out talking to management server, timeout: 5000 Error line: 597
2024-10-13 23:42:28 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 110 cmd: get connection parameter, error: Time out talking to management server, timeout: 5000 Error line: 597
2024-10-13 23:42:29 [NdbApi] INFO     -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 1006  Illegal reply from server line: 3613
2024-10-13T23:43:05.895473Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:44:05.895634Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:45:05.895813Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:46:05.895983Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:47:05.896147Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:48:05.896335Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:49:05.896501Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:50:05.896642Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:51:05.896787Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
2024-10-13T23:52:05.896968Z 0 [System] [MY-010866] [NDB] Metadata: Schema synchronization is ongoing, this iteration of metadata check is skipped
```

and here is the log of ndbd, node2, ndb_2_error.log
```
Current byte-offset of file-pointer is: 5059

Time: Wednesday 25 September 2024 - 17:45:11
Status: Ndbd file system error, restart node initial
Message: File not found (Ndbd file system inconsistency error, please report a bug)
Error: 2815
Error data: DBDIH: File system open failed during FileRecord status 10. OS errno: 2
Error object: DBDIH (Line: 1427) 0x00000002
Program: ndbd
Pid: 74
Version: mysql-8.0.35 ndb-8.0.35
Trace file name: ndb_2_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.1 [t1..t1]
***EOM***

Time: Sunday 13 October 2024 - 23:39:08
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 7376) 0x00000002
Program: ndbd
Pid: 48
Version: mysql-8.0.35 ndb-8.0.35
Trace file name: ndb_2_trace.log.2
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.2 [t1..t1]
***EOM***

Time: Sunday 13 October 2024 - 23:40:18
Status: Temporary error, restart node
Message: Insufficent nodes for system restart (Restart error)
Error: 2353
Error data: Unable to start missing node group!  starting: 000000000000000000000000000000000000003c (missing working fs for: 0000000000000000000000000000000000000038)
Error object: QMGR (Line: 2515) 0x00000002
Program: ndbd
Pid: 529
Version: mysql-8.0.35 ndb-8.0.35
Trace file name: ndb_2_trace.log.3
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.3 [t1..t1]
***EOM***

Time: Sunday 13 October 2024 - 23:40:48
Status: Temporary error, restart node
Message: Insufficent nodes for system restart (Restart error)
Error: 2353
Error data: Unable to start missing node group!  starting: 000000000000000000000000000000000000003c (missing working fs for: 0000000000000000000000000000000000000038)
Error object: QMGR (Line: 2515) 0x00000002
Program: ndbd
Pid: 598
Version: mysql-8.0.35 ndb-8.0.35
Trace file name: ndb_2_trace.log.4
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.4 [t1..t1]
***EOM***

Time: Sunday 13 October 2024 - 23:43:15
Status: Temporary error, restart node
Message: Insufficent nodes for system restart (Restart error)
Error: 2353
Error data: Unable to start missing node group!  starting: 000000000000000000000000000000000000003c (missing working fs for: 0000000000000000000000000000000000000038)
Error object: QMGR (Line: 2515) 0x00000002
Program: ndbd
Pid: 667
Version: mysql-8.0.35 ndb-8.0.35
Trace file name: ndb_2_trace.log.5
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.5 [t1..t1]
***EOM***
```

Suggested fix:
There seems to be a problem with the synchronization process to select the master node in the replica group.

We're sorry, but the bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug.

For details on getting support for MySQL products see http://www.mysql.com/support/
You can also check our forums (free) at http://forums.mysql.com/

Thank you for your interest in MySQL.

As your log shows your fourth node have inconsistent data. Start partial cluster and then start that node with --initial. For details contact our support team.