Bug #82970 | MYSQL cluster downtime during data backup with mysqldump | ||
---|---|---|---|
Submitted: | 13 Sep 2016 13:43 | Modified: | 5 Nov 2016 22:19 |
Reporter: | Zeljko Zuvic | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 5.6.25-ndb-7.4.7-cluster-gpl | OS: | CentOS (6.6) |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[13 Sep 2016 13:43]
Zeljko Zuvic
[13 Sep 2016 13:52]
Zeljko Zuvic
arbitrator node log
Attachment: NDB_arbitrator_node_log.txt (text/plain), 1.23 KiB.
[13 Sep 2016 13:52]
Zeljko Zuvic
data node log
Attachment: NDB_data_node_log.txt (text/plain), 986 bytes.
[16 Sep 2016 14:22]
MySQL Verification Team
Hi Zeljko, I'm not able to reproduce this at all, originally testing with 7.4.10 (that one was already setup for some backup testing) but then retested with 7.4.7 and I don't see this downtime you are talking about. I guess you are seeing this regularly so can you please reproduce this again but have some monitoring tool (sysstat should be enough) monitor CPU/IO/RAM usage on all the nodes during the run, and then upload both the monitoring logs and the whole ndb_error_reporter output, not only one log. From what you send I see some nodes disconnected but I have no clue why so without full logs I can't process this forward. I'm pretty sure we are talking about missconfigured cluster here rather then a bug but let's see logs to confirm it. Also I hope you know that proper way to run a ndbcluster backup is not with mysqldump :) take care Bogdan
[16 Sep 2016 14:37]
Zeljko Zuvic
Hi Bogdan, Thank you for checking this. Unfortunately this issue is happening from time to time. We have scheduled job for full backup every Sunday night, and hopefully I will collect useful information on Monday morning. I will provide you then logs per request. Thanks a lot for your support. Zeljko
[16 Sep 2016 15:25]
MySQL Verification Team
Hi Zeljko, you are most welcome. please make sure you have some monitoring turned on so we can look at the ram/cpu/io usage on all nodes (both data and sql) during the backup. best regards bogdan
[23 Sep 2016 14:58]
Zeljko Zuvic
Hi Bogdan, We have turned SAR logs on both nodes but during last MySQL dump ( full backup ) everything passed fine. Therefore I will wait for this Sunday to see if we the error will appear again. Thanks, Zeljko
[26 Sep 2016 15:21]
Zeljko Zuvic
Hi Bogdan, I have uploaded the files you required from this Sunday. This time ndb_cluster has stopped ndbd service on first data node and my colleague who was on duty has restarted it. Please let me know if you need some additional information. Thanks, Zeljko
[26 Sep 2016 15:33]
MySQL Verification Team
Postovanje Zeljko, Thanks for the detailed data, I wonder why not pack all the logs with ndb_error_reporter? but from what I see I have all I need to analyze the problem. I hope I'll have something for you soon. uzdravlje Bogdan
[4 Oct 2016 14:03]
Zeljko Zuvic
Zdravo Bogdane, Do you have any update regarding analysis of the logs or you need some more information from our side? Srdacno, Zeljko
[5 Oct 2016 21:56]
MySQL Verification Team
2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: BackupRecord 0: BackupId: 14410 MasterRef: f70002 ClientRef: 0 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: State: 4 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: file 0: type: 3 flags: H'19 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: Backup - dump pool sizes 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: BackupPool: 2 BackupFilePool: 4 TablePool: 323 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: AttrPool: 2 TriggerPool: 4 FragmentPool: 323 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: PagePool: 1571 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: == LQH LCP STATE == 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: clcpCompletedState=1, c_lcpId=14410, cnoOfFragsCheckpointed=63 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: lcpState=7 lastFragmentFlag=0 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: currentFragment.fragPtrI=31 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: currentFragment.lcpFragOrd.tableId=19 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: reportEmpty=0 2016-09-25 20:55:01 [MgmtSrvr] INFO -- Node 2: m_EMPTY_LCP_REQ=0000000000000000 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: Communication to Node 2 closed 2016-09-25 20:55:07 [MgmtSrvr] ALERT -- Node 3: Network partitioning - arbitration required 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: President restarts arbitration thread [state=7] 2016-09-25 20:55:07 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected 2016-09-25 20:55:07 [MgmtSrvr] ALERT -- Node 3: Arbitration won - positive reply from node 1 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: NR Status: node=2,OLD=Initial state,NEW=Node failed, fail handling ongoing 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: LCP Take over started 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: ParticipatingDIH = 0000000000000008 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: ParticipatingLQH = 0000000000000008 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000] 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=2 000000000000000c] 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000] 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 0 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: GCP Take over started 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: Node 3 taking over as DICT master 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: GCP Monitor: unlimited lags allowed 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: GCP Take over completed 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: kk: 14415113/712 0 1 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: LCP Take over completed (state = 5) 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: ParticipatingDIH = 0000000000000008 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: ParticipatingLQH = 0000000000000008 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000008] 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000008] 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000008] 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 0 2016-09-25 20:55:07 [MgmtSrvr] ALERT -- Node 3: Node 2 Disconnected 2016-09-25 20:55:07 [MgmtSrvr] INFO -- Node 3: Started arbitrator node 1 [ticket=24250002eb29e599] 2016-09-25 20:55:08 [MgmtSrvr] INFO -- Node 3: NR Status: node=2,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete 2016-09-25 20:55:10 [MgmtSrvr] INFO -- Node 3: Communication to Node 2 opened 2016-09-25 20:56:01 [MgmtSrvr] INFO -- Node 3: LDM(0): Completed LCP, #frags = 68 #records = 2780087, #bytes = 9818490296 2016-09-25 20:56:01 [MgmtSrvr] INFO -- Node 3: Local checkpoint 14410 completed 2016-09-25 20:56:02 [MgmtSrvr] INFO -- Node 3: Stall LCP, LCP time = 399 secs, wait for Node2, state Node failure handling complete 2016-09-25 20:56:02 [MgmtSrvr] INFO -- Node 3: Stall LCP: current stall time: 0 secs, max wait time:29 secs 2016-09-25 20:56:26 [MgmtSrvr] INFO -- Node 3: Local checkpoint 14411 started. Keep GCI = 14414980 oldest restorable GCI = 14415123 2016-09-25 20:56:53 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11. 2016-09-25 21:04:19 [MgmtSrvr] INFO -- Node 3: LDM(0): Completed LCP, #frags = 68 #records = 2780090, #bytes = 9818490920
[5 Oct 2016 21:58]
MySQL Verification Team
Time: Sunday 25 September 2016 - 21:03:29 Status: Temporary error, restart node Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug) Error: 6050 Error data: Job Handling Error object: /export/home/pb2/build/sb_0-15767947-1435843549.5/mysql-cluster-gpl-7.4.7/storage/ndb/src/kernel/vm/WatchDog.cpp Program: ndbd Pid: 6287 Version: mysql-5.6.25 ndb-7.4.7 Trace: /usr/local/m
[5 Oct 2016 22:00]
MySQL Verification Team
2016-09-25 20:54:16 [ndbd] INFO -- findNeighbours from: 5070 old (left: 2 right: 2) new (65535 65535) 2016-09-25 20:54:16 [ndbd] ALERT -- Network partitioning - arbitration required 2016-09-25 20:54:16 [ndbd] INFO -- President restarts arbitration thread [state=7] 2016-09-25 20:54:16 [ndbd] ALERT -- Arbitration won - positive reply from node 1 2016-09-25 20:54:16 [ndbd] INFO -- NR Status: node=2,OLD=Initial state,NEW=Node failed, fail handling ongoing start_resend(0, empty bucket (14415113/712 14415113/711) -> active 2016-09-25 20:54:16 [ndbd] INFO -- DBTC instance 0: Starting take over of node 2 execGCP_NOMORETRANS(14415113/712) c_ongoing_take_over_cnt -> seize 2016-09-25 20:54:16 [ndbd] INFO -- DBTC instance 0: Completed take over of failed node 2 completing gcp 14415113/712 in execTAKE_OVERTCCONF 2016-09-25 20:54:17 [ndbd] INFO -- Started arbitrator node 1 [ticket=24250002eb29e599] 2016-09-25 20:54:17 [ndbd] INFO -- We are adjusting Max Disk Write Speed, a restart is ongoing now 2016-09-25 20:54:18 [ndbd] INFO -- NR Status: node=2,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete 2016-09-25 20:54:18 [ndbd] INFO -- Node 2 has completed node fail handling 2016-09-25 20:55:11 [ndbd] INFO -- We are adjusting Max Disk Write Speed, no restarts ongoing anymore 2016-09-25 21:55:44 [ndbd] INFO -- NR Status: node=2,OLD=Node failure handling complete,NEW=Allocated node id 2016-09-25 21:55:54 [ndbd] INFO -- findNeighbours from: 2246 old (left: 65535 right: 65535) new (2 2) 2016-09-25 21:55:54 [ndbd] INFO -- NR Status: node=2,OLD=Allocated node id,NEW=Included in heartbeat protocol 2016-09-25 21:55:54 [ndbd] INFO -- NR Status: node=2,OLD=Included in heartbeat protocol,NEW=NDBCNTR master permitted us 2016-09-25 21:55:54 [ndbd] INFO -- NR Status: node=2,OLD=NDBCNTR master permitted us,NEW=All nodes permitted us 2016-09-25 21:55:54 [ndbd] INFO -- NR Status: node=2,OLD=All nodes permitted us,NEW=Wait for LCP complete to copy meta data 2016-09-25 21:55:54 [ndbd] INFO -- NR Status: node=2,OLD=Wait for LCP complete to copy meta data,NEW=Copy meta data to start node 2016-09-25 21:55:55 [ndbd] INFO -- We are adjusting Max Disk Write Speed, a restart is ongoing now 2016-09-25 21:56:02 [ndbd] INFO -- NR Status: node=2,OLD=Copy meta data to start node,NEW=Include node in LCP/GCP protocols 2016-09-25 21:56:02 [ndbd] INFO -- NR Status: node=2,OLD=Include node in LCP/GCP protocols,NEW=Restore fragments ongoing 2016-09-25 21:57:50 [ndbd] INFO -- NR Status: node=2,OLD=Restore fragments ongoing,NEW=Undo Disk data ongoing 2016-09-25 21:57:50 [ndbd] INFO -- NR Status: node=2,OLD=Undo Disk data ongoing,NEW=Execute REDO logs ongoing 2016-09-25 21:57:50 [ndbd] INFO -- NR Status: node=2,OLD=Execute REDO logs ongoing,NEW=Build indexes ongoing 2016-09-25 21:57:52 [ndbd] INFO -- NR Status: node=2,OLD=Build indexes ongoing,NEW=Synchronize start node with live nodes 2016-09-25 21:57:52 [ndbd] INFO -- NR Status: node=2,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability 2016-09-25 22:01:05 [ndbd] INFO -- NR Status: node=2,OLD=Wait LCP to ensure durability,NEW=Wait handover of subscriptions 2016-09-25 22:01:05 [ndbd] INFO -- granting SumaStartMe dict lock to 2 2016-09-25 22:01:05 [ndbd] INFO -- clearing SumaStartMe dict lock for 2 2016-09-25 22:01:07 [ndbd] INFO -- granting SumaStartMe dict lock to 2 prepare to handover bucket: 0 14417063/0 (14417062/4294967295) switchover complete bucket 0 state: 2 handover
[5 Oct 2016 22:18]
MySQL Verification Team
20:45:00 Backup starting 20:55:07 Node 3 reports Node 2 is gone 20:55:07 Node 1 reports Node 2 is gone 20:55:07 Network partitioning detected, arbitrator is removing node2 from the cluster, node 3 is only remaining data node 20:55:10 Node 2 is alive 20:56:53 Node 2 is forced down by arbitrator now node2 error log only mentions the crash at 21:03:29. This is more then 6 minutes after all this happened. I assume node2 was "stuck/frozen" all this time and it failed only 6min later. Looking at the error log from the node2 I can see that at 21:01:50 (this is 2 minutes since the issue is noticed on node1/3) 2016-09-25 21:01:50 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100 2016-09-25 21:01:50 [ndbd] INFO -- Watchdog: User time: 132298 System time: 286573 2016-09-25 21:01:50 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=200 2016-09-25 21:01:50 [ndbd] INFO -- Watchdog: User time: 132298 System time: 286573 2016-09-25 21:01:50 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=300 2016-09-25 21:01:50 [ndbd] INFO -- Watchdog: User time: 132298 System time: 286573 2016-09-25 21:01:50 [ndbd] WARNING -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=401 2016-09-25 21:01:50 [ndbd] INFO -- Watchdog: User time: 132298 System time: 286573 ... node2 starts complaining that ndb kernel thread is stuck. but CPU data about node2 does not show a problem at this point in time 08:50:01 PM all 0.04 0.00 0.10 0.96 0.00 98.90 09:00:01 PM all 0.09 0.00 1.50 2.14 0.00 96.26 09:10:01 PM all 10.18 0.00 3.19 23.06 0.00 63.57 09:20:01 PM all 12.22 0.00 0.71 2.47 0.00 84.60 the cpu usage (fairly light) starts at 21:10 ... that's long time after everything already happened. IO does show some activity but nothing serious. It is obvious that node2 froze but apart from that there's nothing saying "why" it froze. I would suggest a thorough hardware check of that node before continuing further. Also, I assume all your nodes have synchronized time, if not then all the logs don't make much sense. I doubt this is bug, rather a hw issue (please inspect syslog, dmesg..) but can't rule out anything out. take care Bogdan
[5 Oct 2016 22:23]
MySQL Verification Team
suggested course of action: - check hardware, inspect dmesg/syslog - you are expecting new dump to happen during the weekend, check if node2 is again the one that fails, does it fail the same way (same freeze and watchdog kill) - if this is the case (most likely), restart the node2 with --initial (this will delete ALL data from node2 and rebuild it from scratch downloading data from node3 so it will take some time and system will be under some load) to solve any potential ndb filesystem error you might have on node2 - when you are already using some maintenance window for rebooting node2, use it to upgrade to latest 7.4 all best Bogdan Kecman
[6 Nov 2016 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".