Bug #116349 | Sync crash in NDB cluster | ||
---|---|---|---|
Submitted: | 14 Oct 2024 0:12 | Modified: | 24 Oct 2024 9:37 |
Reporter: | CunDi Fang | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 8.0.35-cluster MySQL Cluster Community S | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[14 Oct 2024 0:12]
CunDi Fang
[14 Oct 2024 5:24]
MySQL Verification Team
Hi, I do not understand what you did. Can you write in details how you managed to get your ndbcluster to this state. Thanks
[14 Oct 2024 9:02]
CunDi Fang
Okay, I reorganized the logs from that time and found that I was supposed to be trying to rejoin node 5 to the cluster, and then node 5 established communication in the cluster and joined successfully. However, immediately after that node 5 disconnected again. At the same time, I had sql transactions going on on both sql servers and found data inconsistencies. I will give you the rest of the log files next and hopefully that will help.
[14 Oct 2024 9:03]
CunDi Fang
This is the log file of the manager
Attachment: ndb_error_report_116349-log-manager.txt (text/plain), 12.48 KiB.
[14 Oct 2024 9:03]
CunDi Fang
This is the log file of the ndbd node 2
Attachment: ndb_error_report_116349-log-ndb2.txt (text/plain), 5.51 KiB.
[14 Oct 2024 9:04]
CunDi Fang
This is the log file of the ndbd node 3
Attachment: ndb_error_report_116349-log-ndb3.txt (text/plain), 3.72 KiB.
[14 Oct 2024 9:04]
CunDi Fang
This is the log file of the ndbd node 4
Attachment: ndb_error_report_116349-log-ndb4.txt (text/plain), 4.77 KiB.
[14 Oct 2024 9:05]
CunDi Fang
This is the log file of the ndbd node 5
Attachment: ndb_error_report_116349-log-ndb5.txt (text/plain), 3.47 KiB.
[14 Oct 2024 9:10]
CunDi Fang
Looking at the log file, it appears that a large number of "mbind: Operation not permitted" have been logged in the file even before this bug was encountered. There should be a problem with one of the locks, but when I traced it back, I realized that the log had already arrived: 2024-09-23T13:27:42.953769Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization 2024-09-23T13:28:42.962014Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization 2024-09-23T13:29:42.969867Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization 2024-09-23T13:30:42.257748Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:42.728303Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test0' for table 'mytest100.test0' already exists 2024-09-23T13:30:42.736237Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:42.952290Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test1' for table 'mytest100.test1' already exists 2024-09-23T13:30:42.958592Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:42.976104Z 0 [System] [MY-010866] [NDB] Metadata: Failed to submit table 'mysql.ndb_apply_status' for synchronization 2024-09-23T13:30:43.230317Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test2' for table 'mytest100.test2' already exists 2024-09-23T13:30:43.236269Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:43.532614Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test3' for table 'mytest100.test3' already exists 2024-09-23T13:30:43.539566Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:43.770974Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test4' for table 'mytest100.test4' already exists 2024-09-23T13:30:43.778368Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:44.142185Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test5' for table 'mytest100.test5' already exists 2024-09-23T13:30:44.156434Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:44.559589Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test6' for table 'mytest100.test6' already exists 2024-09-23T13:30:44.567982Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:44.871424Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test7' for table 'mytest100.test7' already exists 2024-09-23T13:30:44.925371Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:45.402160Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test8' for table 'mytest100.test8' already exists 2024-09-23T13:30:45.420683Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache 2024-09-23T13:30:45.796922Z 2 [System] [MY-010865] [NDB] Event 'REPL$mytest100/test9' for table 'mytest100.test9' already exists 2024-09-23T13:30:45.814623Z 2 [System] [MY-010866] [NDB] Binlog: Reloaded metadata cache mbind: Operation not permitted mbind: Operation not permitted mbind: Operation not permitted The affairs I was performing were: SELECT * FROM ndbinfo.table_replicas; SELECT COLUMN_NAME FROM INFORMATION_SCHEMA. KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = 'mytest100' AND TABLE_NAME = 'test0' AND CONSTRAINT_NAME = 'PRIMARY' ORDER BY ORDINAL_POSITION; From the results, it looks like these two statements are executing properly on both sql servers
[16 Oct 2024 4:11]
MySQL Verification Team
Hi, I'm having issue reproducing this. Are you running this on bare metal or in VM or you are using docker?
[16 Oct 2024 4:42]
CunDi Fang
I am using docker. Five docker containers, one for manager, four for data and sql server. all are Ubuntu 22.04 environments. The configuration of the architecture is as follows: [NDBD DEFAULT] NoOfReplicas =2 DataMemory = 512M IndexMemory = 64M [NDB_MGMD] NodeId=1 hostname =192.168.10.8 datadir =/var/lib/mysql-cluster LogFilename=/var/lib/mysql-cluster/ndb_1_mgmd.log [NDBD] NodeId =2 hostname =192.168.10.9 datadir =/usr/local/mysql-cluster/data [NDBD] NodeId =3 hostname =192.168.10.10 datadir =/usr/local/mysql-cluster/data [mysqld] NodeId =4 hostname =192.168.10.9 [mysqld] NodeId =5 hostname =192.168.10.10
[16 Oct 2024 7:28]
MySQL Verification Team
Hi, Docker is not properly configured. If you look at your log you can see mbind: Operation not permitted Checkout CAP_SYS_NICE option for docker. Since docker is improperly configured MySQL cannot handle thread priorities and this can lead to problems. ( https://dev.mysql.com/doc/refman/8.0/en/resource-groups.html ) Can you retest with properly configured docker.
[16 Oct 2024 8:09]
CunDi Fang
I checked my docker configurations, and you are right, I did not set the CAP_SYS_NICE: ``` "HostConfig": { "Binds": null, "ContainerIDFile": "", "LogConfig": { "Type": "json-file", "Config": {} }, "NetworkMode": "mysqlcluster_net2", "PortBindings": {}, "RestartPolicy": { "Name": "no", "MaximumRetryCount": 0 }, "AutoRemove": false, "VolumeDriver": "", "VolumesFrom": null, "CapAdd": null, //HERE "CapDrop": null, "CgroupnsMode": "private", "Dns": [], "DnsOptions": [], "DnsSearch": [], "ExtraHosts": null, ``` However, I still do not understand. This configuration should be related to Resource Groups, a feature that optimizes the use of resources, and I have not been involved in modifying the scheduling of threads and CPU resource allocation transactions, just the normal select, insert, alter, and then carried out node join and restart and so on. Why did it cause the resource error? I'm sorry to say that since I only kept the log file at the time of this bug, it's difficult on my end to try to reproduce the bug in a strict time sequence now. But if, as you say, this bug and 116351 are caused by the same thing, then this type of bug is encountered frequently on my side.
[16 Oct 2024 8:10]
CunDi Fang
I checked my docker configurations, and you are right, I did not set the CAP_SYS_NICE: ``` "HostConfig": { "Binds": null, "ContainerIDFile": "", "LogConfig": { "Type": "json-file", "Config": {} }, "NetworkMode": "mysqlcluster_net2", "PortBindings": {}, "RestartPolicy": { "Name": "no", "MaximumRetryCount": 0 }, "AutoRemove": false, "VolumeDriver": "", "VolumesFrom": null, "CapAdd": null, //HERE "CapDrop": null, "CgroupnsMode": "private", "Dns": [], "DnsOptions": [], "DnsSearch": [], "ExtraHosts": null, ``` However, I still do not understand. This configuration should be related to Resource Groups, a feature that optimizes the use of resources, and I have not been involved in modifying the scheduling of threads and CPU resource allocation transactions, just the normal select, insert, alter, and then carried out node join and restart and so on. Why did it cause the resource error? I'm sorry to say that since I only kept the log file at the time of this bug, it's difficult on my end to try to reproduce the bug in a strict time sequence now. But if, as you say, this bug and 116351 are caused by the same thing, then this type of bug is encountered frequently on my side.
[16 Oct 2024 8:48]
MySQL Verification Team
Hi, > I have not been involved in modifying the scheduling of threads and CPU resource allocation transactions You did not but MySQL tried and failed as you can see in the log. > But if, as you say, this bug and 116351 are caused by the same thing, I cannot say for sure it is the same, but it could be so properly configuring docker is the first step, as I cannot reproduce this issue you are having (neither this nor 116351) so that is a best bet at the moment. > then this type of bug is encountered frequently on my side. So you can then easily see if reconfiguring docker solves the problem, if problem do not happen again after properly configuring docker - that was the issue, if it happens again, please collect logs again and update the report. Thanks
[16 Oct 2024 9:33]
CunDi Fang
Ok, I'll try from my side if this bug and 116351 will continue to appear after changing the docker configuration. I will get back to you within three days.
[19 Oct 2024 15:39]
CunDi Fang
Hi, after dozens of my attempts, I found that this one record should reflect the bug I reported, and this one was reproduced with CAP_SYS_NICE turned on. Also, I found bug116351 you confirmed as verified, may I ask if you are reproducing it successfully? Is there any more information needed from this side?
[24 Oct 2024 9:38]
MySQL Verification Team
We will update these reports if we need more data, thanks.