Bug #88627 | MySQl cluster is wiping a databasse data during initialization of one ndb node | ||
---|---|---|---|
Submitted: | 23 Nov 2017 15:23 | Modified: | 18 Dec 2017 16:20 |
Reporter: | Zeljko Zuvic | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | ndb-7.4.7 | OS: | CentOS (6.6) |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
Tags: | wiping data |
[23 Nov 2017 15:23]
Zeljko Zuvic
[5 Dec 2017 11:41]
MySQL Verification Team
Hi, This is very weird behavior. Can you tell me exactly how do you start nodes after the crash. You are not using MCM right? thanks Bogdan
[5 Dec 2017 12:29]
Zeljko Zuvic
Hi Bogdan, After crush of one node of mysql cluster we first doing initialization of ndb file system on that node: ../mysql/bin/ndbd --initial then by monitoring ndb_log when the phase 101 has been reached and wait few more minutes we are doing start of mysql service on that node: #service mysql start after getting the service started I am checking is DB content successfully replicated from halthy node but except DB name all tables are wiped on both (in our case ) data nodes. During this process healthy node and management ( arbitrator) node are up and running. Please let me know if you need any further information from our side. Thanks, Zeljko
[5 Dec 2017 12:35]
MySQL Verification Team
Hi Zeljko, So, you have one node running, properly, with data on it, you connect to mysqld check data you see it; then you start second ndbd with --initial and after it moves to "running" state all your data is gone ?! And you have the same behavior 2 times already? on 7.4.xx ? thanks Bogdan
[5 Dec 2017 12:52]
Zeljko Zuvic
Bogdan, That is right and two times that happened on our production MySQL cluster during unplanned crush of one of data node. But I have to say that I have tried to simulate such a behavior ( by doing power of of WM - one node) on our test environment with exact configuration and same version of mysql cluster but reduced resources and significant smaller DB size but everything is working perfect without any loss. Thanks, Zeljko
[6 Dec 2017 15:35]
MySQL Verification Team
Hi, weird, I just tried medium sized setup (32G ram per node) and was not able to reproduce the issue. Lemme discuss this with colleagues before I get back to you, this really makes not much sense and it is scaring me. all best Bogdan
[6 Dec 2017 15:54]
Zeljko Zuvic
Hi Bogdan, Maybe is worth to mention that we have binary logs enabled on first data node and also after the crush, second node had a lot of DB write activities before we started with recovery of first node. Thank you for your time spent on this case. Best regards, Zeljko
[6 Dec 2017 16:29]
MySQL Verification Team
Hi, Few questions, - any chance the missing data was innodb/myisam and not ndbcluster data? - any chance you started the "broken" node first? The potential sequence that can lead to this type of situation is 1. stop cluster (crash, shutdown, whatever..) 2. destroy data on node1 and start it(rm -rf datadir or ndbd --initial) at this point you have only node1 running and it's empty 3. start node2 that used to have good data in this case cluster will wipe the data from node2 (expected behavior) as the "latest valid data" is empty cluster (node1). any chance your scenario can fit into this? all best bogdan
[7 Dec 2017 14:09]
MySQL Verification Team
Hi, additional question - are you maybe using NOLOGGING tables? https://dev.mysql.com/doc/refman/5.7/en/create-table-ndb-table-comment-options.html because nologging tables will survive a node crash but will not survive cluster crash (What you had) 'cause nologging tabls are not included in redo logs so they cannot be consistently restored with other data hence they are not included in the LCP all best Bogdan
[14 Dec 2017 2:29]
MySQL Verification Team
Timeline from the cluster log (ndb_1_cluster.log.5) 1. Cluster ok e.g. 2017-07-04 09:07:03 [MgmtSrvr] INFO -- Node 3: LDM(0): Completed LCP, #frags = 14 #records = 2117, #bytes = 87076 2017-07-04 09:07:03 [MgmtSrvr] INFO -- Node 2: LDM(0): Completed LCP, #frags = 14 #records = 2117, #bytes = 87076 2017-07-04 09:07:03 [MgmtSrvr] INFO -- Node 2: Local checkpoint 222 completed 2. MGMD is killed and restarted 2017-07-04 17:11:29 [MgmtSrvr] INFO -- Loaded config from '/var/lib/mysql-cluster//ndb_1_config.bin.1' 2017-07-04 17:11:29 [MgmtSrvr] INFO -- Id: 1, Command port: *:1186 2017-07-04 17:11:29 [MgmtSrvr] INFO -- MySQL Cluster Management Server mysql-5.6.25 ndb-7.4.7 started 2017-07-04 17:11:29 [MgmtSrvr] INFO -- Node 1: Node 1 Connected 2017-07-04 17:11:29 [MgmtSrvr] INFO -- Node 1 connected 3. Nodes connect for System Restart (SR) 2017-07-04 17:15:52 [MgmtSrvr] INFO -- Node 2: System Restart: master node: 2, num starting: 2, gci: 6852095 ... 2017-07-04 17:16:03 [MgmtSrvr] INFO -- Node 2: LDM instance 0: Restored LCP : 14 fragments, 2117 rows, 60 millis, 35283 rows/s 2017-07-04 17:16:03 [MgmtSrvr] INFO -- Node 3: LDM instance 0: Restored LCP : 14 fragments, 2117 rows, 91 millis, 23263 rows/s 4. Node 3 has a problem with REDO 'corruption' : 2017-07-04 17:16:03 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2310: 'Error while reading the REDO log(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'. 5. Another attempt at SR, fails in the same way : 2017-07-04 17:32:07 [MgmtSrvr] INFO -- Node 2: System Restart: master node: 2, num starting: 2, gci: 6852095 ... 2017-07-04 17:32:13 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2310: 'Error while reading the REDO log(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'. 6. Another attempt at SR, but node 3 has initial specified from somewhere 2017-07-04 17:33:16 [MgmtSrvr] INFO -- Node 2: Start phase 0 completed 2017-07-04 17:33:16 [MgmtSrvr] INFO -- Node 2: Communication to Node 3 opened ... 2017-07-04 17:34:18 [MgmtSrvr] INFO -- Node 3: Communication to Node 2 opened 2017-07-04 17:34:18 [MgmtSrvr] INFO -- Node 3: Initial start, waiting for 2 to connect, nodes [ all: 2 and 3 connected: 3 no-wait: ] 2017-07-04 17:34:18 [MgmtSrvr] INFO -- Node 2: Node 3 Connected 2017-07-04 17:34:18 [MgmtSrvr] INFO -- Node 3: Node 2 Connected 2017-07-04 17:34:18 [MgmtSrvr] INFO -- Node 2: Start with all nodes 2 and 3 ... 2017-07-04 17:34:21 [MgmtSrvr] INFO -- Node 2: System Restart: master node: 1, num starting: 2, gci: 6852095 Note that the System Restart log says 'master node: 1', but it is node 2, node 1 is MGMD. 7. SR proceeds just recovering node 2 : 2017-07-04 17:34:27 [MgmtSrvr] INFO -- Node 2: LDM instance 0: Restored LCP : 14 fragments, 2117 rows, 19 millis, 111421 rows/s 8. Start of node 2 completes : 2017-07-04 17:34:28 [MgmtSrvr] INFO -- Node 2: Start phase 101 completed (system restart) 2017-07-04 17:34:28 [MgmtSrvr] INFO -- Node 2: Started (mysql-5.6.25 ndb-7.4.7) 9. Node 3 begins INR as part of SR to join node 2 2017-07-04 17:34:28 [MgmtSrvr] INFO -- Node 2: Start node: 3 using initial node restart as part of system restart 10. Node 3 copies schema, activates indexes etc ok 11. Node 3 restores from LCP (but there's none as it's INR) 2017-07-04 17:39:08 [MgmtSrvr] INFO -- Node 3: LDM instance 0: Restored LCP : 0 fragments, 0 rows, 0 millis, 0 rows/s 12. Node 3 copies fragments from Node 2 : 2017-07-04 17:39:08 [MgmtSrvr] INFO -- Node 3: LDM(0): CopyFrag complete. 124 frags, +2117/-3279 rows, 376936 bytes/76 ms 4959684 bytes/s. Note that 124 fragments are mentioned here, whereas 14 fragments are mentioned in the LCP restore. However, the number of rows added (2117) aligns. 13. Wait for an LCP to complete : 2017-07-04 17:39:08 [MgmtSrvr] INFO -- Node 2: Local checkpoint 224 started. Keep GCI = 6852095 oldest restorable GCI = 6852095 2017-07-04 17:39:08 [MgmtSrvr] INFO -- Node 3: LDM(0): Completed LCP, #frags = 14 #records = 2117, #bytes = 87076 2017-07-04 17:39:09 [MgmtSrvr] INFO -- Node 2: LDM(0): Completed LCP, #frags = 14 #records = 2117, #bytes = 87076 2017-07-04 17:39:09 [MgmtSrvr] INFO -- Node 2: Local checkpoint 224 completed 14. Restart completes 2017-07-04 17:39:12 [MgmtSrvr] INFO -- Node 3: Start phase 101 completed (initial node restart) 2017-07-04 17:39:12 [MgmtSrvr] INFO -- Node 3: Started (mysql-5.6.25 ndb-7.4.7) 15. APIs connect 2017-07-04 17:53:57 [MgmtSrvr] INFO -- Node 3: Node 5: API mysql-5.6.25 ndb-7.4.7 2017-07-04 17:54:23 [MgmtSrvr] INFO -- Node 2: Node 4: API mysql-5.6.25 ndb-7.4.7 16. Restore of metadata from a backup occurs 2017-07-04 18:46:00 [MgmtSrvr] INFO -- Alloc node id 4 succeeded 2017-07-04 18:46:00 [MgmtSrvr] INFO -- Nodeid 4 allocated for API at 192.168.0.4 2017-07-04 18:46:00 [MgmtSrvr] INFO -- Node 4: ndb_restore -b 15 -n 2 -m -e -p 128 2017-07-04 18:46:17 [MgmtSrvr] INFO -- Node 4: Restore meta data: backup 15 from node 2 #Tables: 13 #Tablespaces: 0 #Logfilegroups: 0 #datafiles: 0 #undofiles: 0 17. Restore of data from a backup occurs 2017-07-04 18:47:08 [MgmtSrvr] INFO -- Node 4: ndb_restore -b 15 -n 2 -r -e -p 128 2017-07-04 18:47:09 [MgmtSrvr] INFO -- Node 4: Restore started: backup 15 from node 2 2017-07-04 18:47:09 [MgmtSrvr] INFO -- Node 4: Restore meta data: backup 15 from node 2 #Tables: 13 #Tablespaces: 0 #Logfilegroups: 0 #datafiles: 0 #undofiles: 0 2017-07-04 18:51:02 [MgmtSrvr] INFO -- Node 4: Restore data: backup 15 from node 2 #Records: 2564234 Data: 9125230207 bytes 2017-07-04 18:51:02 [MgmtSrvr] INFO -- Node 4: Restore log: backup 15 from node 2 #Records: 0 Data: 0 bytes 2017-07-04 18:51:02 [MgmtSrvr] INFO -- Node 4: Restore completed: backup 15 from node 2 ... 2017-07-04 18:51:49 [MgmtSrvr] INFO -- Node 5: ndb_restore -b 15 -n 3 -r -e -p 128 2017-07-04 18:51:50 [MgmtSrvr] INFO -- Node 5: Restore started: backup 15 from node 3 2017-07-04 18:51:50 [MgmtSrvr] INFO -- Node 5: Restore meta data: backup 15 from node 3 #Tables: 13 #Tablespaces: 0 #Logfilegroups: 0 #datafiles: 0 #undofiles: 0 2017-07-04 19:00:45 [MgmtSrvr] INFO -- Node 5: Restore data: backup 15 from node 3 #Records: 2545933 Data: 9053962610 bytes 2017-07-04 19:00:45 [MgmtSrvr] INFO -- Node 5: Restore log: backup 15 from node 3 #Records: 0 Data: 0 bytes 2017-07-04 19:00:45 [MgmtSrvr] INFO -- Node 5: Restore completed: backup 15 from node 3
[14 Dec 2017 2:31]
MySQL Verification Team
1) The LCP size in the 'before' case is small relative to the Backup size : LCP : 2017-07-04 09:07:03 [MgmtSrvr] INFO -- Node 3: LDM(0): Completed LCP, #frags = 14 #records = 2117, #bytes = 87076 2017-07-04 09:07:03 [MgmtSrvr] INFO -- Node 2: LDM(0): Completed LCP, #frags = 14 #records = 2117, #bytes = 87076 Backup : 2017-07-04 18:51:02 [MgmtSrvr] INFO -- Node 4: Restore data: backup 15 from node 2 #Records: 2564234 Data: 9125230207 bytes 2017-07-04 19:00:45 [MgmtSrvr] INFO -- Node 5: Restore data: backup 15 from node 3 #Records: 2545933 Data: 9053962610 bytes 2) Seems like the cluster restores everything that was included in the LCP (14 frags, 2117 rows) Note that the cluster copies the content of 124 fragments as part of CopyFrag 3) Therefore it seems like the problem is that the tables are not included in the LCP This is why I'm asking if you are using NOLOGGING.
[14 Dec 2017 2:33]
MySQL Verification Team
> Any chance that significant difference between last state > in bin logs (enabled only on node1 that crushed last time) > and the current state (epoch) on the running node (node2) > can cause the issue? Hi Ċ½eljko, I don't think so. Especially not looking at the logs. You had the same problem in October: 2017-10-29 23:26:59 [MgmtSrvr] INFO -- Node 2: Local checkpoint 533 started. Keep GCI = 16484622 oldest restorable GCI = 16483797 2017-10-29 23:26:59 [MgmtSrvr] INFO -- Node 3: LDM(0): Completed LCP, #frags = 14 #records = 2129, #bytes = 87524 2017-10-29 23:27:00 [MgmtSrvr] INFO -- Node 2: LDM(0): Completed LCP, #frags = 14 #records = 2129, #bytes = 87524 2017-10-29 23:27:00 [MgmtSrvr] INFO -- Node 2: Local checkpoint 533 completed 2017-10-30 00:10:07 [MgmtSrvr] INFO -- Node 2: Backup 32 started from node 1 2017-10-30 00:19:04 [MgmtSrvr] INFO -- Node 2: Backup 32 started from node 1 completed. StartGCP: 16518386 StopGCP: 16518889 #Records: 6209360 #LogRecords: 365 Data: 720191156 bytes Log: 14284 bytes LCP#533 starts and finishes in 1 second, recording 14 fragments and ~87kB on each node Backup 32 takes ~9 minutes (to record what should be ~half the data) and writes ~720MB overall (~360MB per node) Looking at all of the LCP logs since July, the LCP has *never* contained > 14 fragments, and always exactly the same amount of data... So for some reason your data is not in LCP and this *will* happen again!
[14 Dec 2017 2:36]
MySQL Verification Team
Can you get me the output of * ndb_show_tables * ndb_desc -pn output for all tables If the tables are NOLOGGING then that property would be restored from the metadata in the backup!! Possible fix would be to recreate the tables from SQL without nologging, and then restore the data from the backup. Anyhow, please get me ndb_show_tables and ndb_desc -pn for each table so we can look (or you can look yourself), but looking at the logs you don't have persistent data there! all best Bogdan
[18 Dec 2017 14:02]
Zeljko Zuvic
Hi Bogdan, Sorry for my delayed response. I went to private trip last week and hence didn't provide you requested information. As you have requested I captured and attached output of the command ndb_show_tables on the other hand for command ndb_desc -pn i have got the same short output for all tables, for example: ndb_desc -pn -t user_record NDBT_ProgramExit: 0 - OK Is that expected output or I have missed something? Many thanks for your support on this case. Br. Zeljko
[18 Dec 2017 14:27]
MySQL Verification Team
Hi, as for the ndb_desc call it with ndb_desct tablename -c connstring -d database -t tablename (so yes, 2 times tablename) for e.g. [root@localhost bin]# ./ndb_desc nekitest -c localhost -u -n -p -i -d test -t nekitest -- nekitest -- Version: 5 Fragment type: HashMapPartition K Value: 6 Min load factor: 78 Max load factor: 80 Temporary table: no Number of attributes: 4 Number of primary keys: 1 Length of frm data: 307 Max Rows: 0 Row Checksum: 1 Row GCI: 1 SingleUserMode: 0 ForceVarPart: 1 PartitionCount: 2 FragmentCount: 2 PartitionBalance: FOR_RP_BY_LDM ExtraRowGciBits: 0 ExtraRowAuthorBits: 0 TableStatus: Retrieved Table options: HashMap: DEFAULT-HASHMAP-3840-2 -- Attributes -- id Bigint PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR x Int NULL AT=FIXED ST=MEMORY y Int NULL AT=FIXED ST=MEMORY z Varchar(255;latin1_swedish_ci) NULL AT=SHORT_VAR ST=MEMORY -- Indexes -- PRIMARY KEY(id) - UniqueHashIndex PRIMARY(id) - OrderedIndex -- Per partition info -- Partition Row count Commit count Frag fixed memory Frag varsized memory Extent_space Free extent_spaceN odes 0 10132 10132 458752 557056 0 0 2,3 1 9868 9868 458752 557056 0 0 3,2 -- nekitest/PRIMARY -- Version: 5 Base table: nekitest Number of attributes: 1 Logging: 0 Index type: OrderedIndex Index status: Retrieved -- Attributes -- id Bigint NOT NULL AT=FIXED ST=MEMORY -- IndexTable 20/PRIMARY -- Version: 5 Fragment type: FragUndefined K Value: 6 Min load factor: 78 Max load factor: 80 Temporary table: yes Number of attributes: 2 Number of primary keys: 1 Length of frm data: 0 Max Rows: 0 Row Checksum: 1 Row GCI: 1 SingleUserMode: 2 ForceVarPart: 0 PartitionCount: 2 FragmentCount: 2 PartitionBalance: FOR_RP_BY_LDM ExtraRowGciBits: 0 ExtraRowAuthorBits: 0 TableStatus: Retrieved Table options: -- Attributes -- id Bigint NOT NULL DISTRIBUTION KEY AT=FIXED ST=MEMORY NDB$TNODE Unsigned [64] PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY -- Indexes -- PRIMARY KEY(NDB$TNODE) - UniqueHashIndex NDBT_ProgramExit: 0 - OK
[18 Dec 2017 14:33]
MySQL Verification Team
The more important part.. ndb_show_tables .. look at "logging" column, most of your tables are no-logging. Everything I wrote about nologging tables apply here, they are not persistent, they do not write to disk (their content is not part of LCP), they are part of backup but that's it. They will survive single node restart but not the cluster restart - cluster restart will lose all the data. How did you end up with nologging tables, I can't say, but that property would be restored from the metadata in the backup so every time you restore from backup they are again set to nologging. One fix would be to recreate the tables from SQL without nologging, and then restore the data from the backup. So you do mysqldump to dump the structure, modify if needed, make cluster backup, then initialize cluster, create tables from sql script created with mysqldump and restore data (without metadata) from cluster backup. Sometime in the past you set those tables to be nologging. This reduces significantly disk usage of the cluster (as LCP is not performed for those table) hope this helps Bogdan
[18 Dec 2017 14:52]
MySQL Verification Team
btw no need for ndb_desc output now, we see that all tables are nologging from ndb_show tables all best b.
[18 Dec 2017 15:31]
Zeljko Zuvic
Hi Bogdan, Thank you for confirmation and everything makes sense now. I cannot figure out what happened and we got nologging tables, but I will take an investigation. Also on our TEST environment logging parameter is set to YES and hence I could not recreate the issue on it. The fix you have proposed also sounds good so I am going to prepare condition for it. Many thanks for all your support so far, not just for this case. The every time you have guided us to right direction ,finding the exact cause of and also suggesting a fix for it. Really GREAT support, I am impressed! :) All the best! Zeljko
[18 Dec 2017 16:20]
MySQL Verification Team
Hi Zeljko, You are welcome. > Thank you for confirmation and everything makes sense now. > I cannot figure out what happened and we got nologging tables, > but I will take an investigation. It is possible that sometimes in history someone set the tables to be nologging, as you maybe had issues with disk io, not understanding all the implications of that action. Moving tables to be nologging you reduce disk io a lot (since nothing is written to disk) but then, as you experienced, tables don't survive cluster crash/shutdown. all best Bogdan p.s. I changed the bug status to "not a bug"