MySQL Bugs: #82970: MYSQL cluster downtime during data backup with mysqldump

Bug #82970	MYSQL cluster downtime during data backup with mysqldump
Submitted:	13 Sep 2016 13:43	Modified:	5 Nov 2016 22:19
Reporter:	Zeljko Zuvic	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.6.25-ndb-7.4.7-cluster-gpl	OS:	CentOS (6.6)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Hi,

We have an issue with ndb cluster ( two replicas ) during data backup ( mysqldump ) . The cluster is going down for a short ( 8 seconds in this case ) and is up again, but backup is interrupted and not completed.
i am giong to attach mysqld.log and ndb cluster logs for time scale when the issue happened.

Could you please let us know what could be possible cause of this behavior?

Thanks,

Zeljko 

How to repeat:
during scheduled backup from time to time.

arbitrator node log

Attachment: NDB_arbitrator_node_log.txt (text/plain), 1.23 KiB.

data node log

Attachment: NDB_data_node_log.txt (text/plain), 986 bytes.

Hi Zeljko,

I'm not able to reproduce this at all, originally testing with 7.4.10 (that one was already setup for some backup testing) but then retested with 7.4.7 and I don't see this downtime you are talking about.

I guess you are seeing this regularly so can you please reproduce this again but have some monitoring tool (sysstat should be enough) monitor CPU/IO/RAM usage on all the nodes during the run, and then upload both the monitoring logs and the whole ndb_error_reporter output, not only one log. From what you send I see some nodes disconnected but I have no clue why so without full logs I can't process this forward.

I'm pretty sure we are talking about missconfigured cluster here rather then a bug but let's see logs to confirm it. Also I hope you know that proper way to run a ndbcluster backup is not with mysqldump :)

take care
Bogdan

Hi Bogdan,

Thank you for checking this.
Unfortunately this issue is happening from time to time.
We have scheduled job for full backup every Sunday night, and hopefully I will collect useful information on Monday morning. I will provide you then logs per request.
Thanks a lot for your support.

Zeljko

Hi Zeljko, you are most welcome. please make sure you have some monitoring turned on so we can look at the ram/cpu/io usage on all nodes (both data and sql) during the backup.

best regards
bogdan

Hi Bogdan,

We have turned SAR logs on both nodes but during last MySQL dump ( full backup ) everything passed fine. Therefore I will wait for this Sunday to see if we the error will appear again. 

Thanks,

Zeljko

Hi Bogdan,

I have uploaded the files you required from this Sunday.
This time ndb_cluster has stopped ndbd service on first data node and my colleague who was on duty has restarted it. 
Please let me know if you need some additional information.

Thanks,

Zeljko

Postovanje Zeljko,

Thanks for the detailed data, I wonder why not pack all the logs with ndb_error_reporter? but from what I see I have all I need to analyze the problem. I hope I'll have something for you soon.

uzdravlje
Bogdan

Zdravo Bogdane,

Do you have any update regarding analysis of the logs or you need some more information from our side?

Srdacno,

Zeljko

2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: BackupRecord 0: BackupId: 14410 MasterRef: f70002 ClientRef: 0
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2:  State: 4
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2:  file 0: type: 3 flags: H'19
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: Backup - dump pool sizes
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: BackupPool: 2 BackupFilePool: 4 TablePool: 323
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: AttrPool: 2 TriggerPool: 4 FragmentPool: 323
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: PagePool: 1571
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: == LQH LCP STATE ==
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2:  clcpCompletedState=1, c_lcpId=14410, cnoOfFragsCheckpointed=63
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2:  lcpState=7 lastFragmentFlag=0
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: currentFragment.fragPtrI=31
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2: currentFragment.lcpFragOrd.tableId=19
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2:  reportEmpty=0
2016-09-25 20:55:01 [MgmtSrvr] INFO     -- Node 2:  m_EMPTY_LCP_REQ=0000000000000000
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 closed
2016-09-25 20:55:07 [MgmtSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2016-09-25 20:55:07 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2016-09-25 20:55:07 [MgmtSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: NR Status: node=2,OLD=Initial state,NEW=Node failed, fail handling ongoing
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: LCP Take over started
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000008
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000008
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=2 000000000000000c]
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 0
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: GCP Take over started
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: Node 3 taking over as DICT master
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: GCP Monitor: unlimited lags allowed
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: GCP Take over completed
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: kk: 14415113/712 0 1
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: LCP Take over completed (state = 5)
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000008
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000008
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000008]
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000008]
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000008]
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 0
2016-09-25 20:55:07 [MgmtSrvr] ALERT    -- Node 3: Node 2 Disconnected
2016-09-25 20:55:07 [MgmtSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=24250002eb29e599]
2016-09-25 20:55:08 [MgmtSrvr] INFO     -- Node 3: NR Status: node=2,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2016-09-25 20:55:10 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 opened
2016-09-25 20:56:01 [MgmtSrvr] INFO     -- Node 3: LDM(0): Completed LCP, #frags = 68 #records = 2780087, #bytes = 9818490296
2016-09-25 20:56:01 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 14410 completed
2016-09-25 20:56:02 [MgmtSrvr] INFO     -- Node 3: Stall LCP, LCP time = 399 secs, wait for Node2, state Node failure handling complete
2016-09-25 20:56:02 [MgmtSrvr] INFO     -- Node 3: Stall LCP: current stall time: 0 secs, max wait time:29 secs
2016-09-25 20:56:26 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 14411 started. Keep GCI = 14414980 oldest restorable GCI = 14415123
2016-09-25 20:56:53 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11.
2016-09-25 21:04:19 [MgmtSrvr] INFO     -- Node 3: LDM(0): Completed LCP, #frags = 68 #records = 2780090, #bytes = 9818490920

Time: Sunday 25 September 2016 - 21:03:29
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: /export/home/pb2/build/sb_0-15767947-1435843549.5/mysql-cluster-gpl-7.4.7/storage/ndb/src/kernel/vm/WatchDog.cpp
Program: ndbd
Pid: 6287
Version: mysql-5.6.25 ndb-7.4.7
Trace: /usr/local/m

2016-09-25 20:54:16 [ndbd] INFO     -- findNeighbours from: 5070 old (left: 2 right: 2) new (65535 65535)
2016-09-25 20:54:16 [ndbd] ALERT    -- Network partitioning - arbitration required
2016-09-25 20:54:16 [ndbd] INFO     -- President restarts arbitration thread [state=7]
2016-09-25 20:54:16 [ndbd] ALERT    -- Arbitration won - positive reply from node 1
2016-09-25 20:54:16 [ndbd] INFO     -- NR Status: node=2,OLD=Initial state,NEW=Node failed, fail handling ongoing
start_resend(0, empty bucket (14415113/712 14415113/711) -> active
2016-09-25 20:54:16 [ndbd] INFO     -- DBTC instance 0: Starting take over of node 2
execGCP_NOMORETRANS(14415113/712) c_ongoing_take_over_cnt -> seize
2016-09-25 20:54:16 [ndbd] INFO     -- DBTC instance 0: Completed take over of failed node 2
completing gcp 14415113/712 in execTAKE_OVERTCCONF
2016-09-25 20:54:17 [ndbd] INFO     -- Started arbitrator node 1 [ticket=24250002eb29e599]
2016-09-25 20:54:17 [ndbd] INFO     -- We are adjusting Max Disk Write Speed, a restart is ongoing now
2016-09-25 20:54:18 [ndbd] INFO     -- NR Status: node=2,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2016-09-25 20:54:18 [ndbd] INFO     -- Node 2 has completed node fail handling
2016-09-25 20:55:11 [ndbd] INFO     -- We are adjusting Max Disk Write Speed, no restarts ongoing anymore
2016-09-25 21:55:44 [ndbd] INFO     -- NR Status: node=2,OLD=Node failure handling complete,NEW=Allocated node id
2016-09-25 21:55:54 [ndbd] INFO     -- findNeighbours from: 2246 old (left: 65535 right: 65535) new (2 2)
2016-09-25 21:55:54 [ndbd] INFO     -- NR Status: node=2,OLD=Allocated node id,NEW=Included in heartbeat protocol
2016-09-25 21:55:54 [ndbd] INFO     -- NR Status: node=2,OLD=Included in heartbeat protocol,NEW=NDBCNTR master permitted us
2016-09-25 21:55:54 [ndbd] INFO     -- NR Status: node=2,OLD=NDBCNTR master permitted us,NEW=All nodes permitted us
2016-09-25 21:55:54 [ndbd] INFO     -- NR Status: node=2,OLD=All nodes permitted us,NEW=Wait for LCP complete to copy meta data
2016-09-25 21:55:54 [ndbd] INFO     -- NR Status: node=2,OLD=Wait for LCP complete to copy meta data,NEW=Copy meta data to start node
2016-09-25 21:55:55 [ndbd] INFO     -- We are adjusting Max Disk Write Speed, a restart is ongoing now
2016-09-25 21:56:02 [ndbd] INFO     -- NR Status: node=2,OLD=Copy meta data to start node,NEW=Include node in LCP/GCP protocols
2016-09-25 21:56:02 [ndbd] INFO     -- NR Status: node=2,OLD=Include node in LCP/GCP protocols,NEW=Restore fragments ongoing
2016-09-25 21:57:50 [ndbd] INFO     -- NR Status: node=2,OLD=Restore fragments ongoing,NEW=Undo Disk data ongoing
2016-09-25 21:57:50 [ndbd] INFO     -- NR Status: node=2,OLD=Undo Disk data ongoing,NEW=Execute REDO logs ongoing
2016-09-25 21:57:50 [ndbd] INFO     -- NR Status: node=2,OLD=Execute REDO logs ongoing,NEW=Build indexes ongoing
2016-09-25 21:57:52 [ndbd] INFO     -- NR Status: node=2,OLD=Build indexes ongoing,NEW=Synchronize start node with live nodes
2016-09-25 21:57:52 [ndbd] INFO     -- NR Status: node=2,OLD=Synchronize start node with live nodes,NEW=Wait LCP to ensure durability
2016-09-25 22:01:05 [ndbd] INFO     -- NR Status: node=2,OLD=Wait LCP to ensure durability,NEW=Wait handover of subscriptions
2016-09-25 22:01:05 [ndbd] INFO     -- granting SumaStartMe dict lock to 2
2016-09-25 22:01:05 [ndbd] INFO     -- clearing SumaStartMe dict lock for 2
2016-09-25 22:01:07 [ndbd] INFO     -- granting SumaStartMe dict lock to 2
prepare to handover bucket: 0
14417063/0 (14417062/4294967295) switchover complete bucket 0 state: 2
handover

20:45:00 Backup starting
20:55:07 Node 3 reports Node 2 is gone
20:55:07 Node 1 reports Node 2 is gone
20:55:07 Network partitioning detected, 
         arbitrator is removing node2 
		 from the cluster, node 3 is 
		 only remaining data node
20:55:10 Node 2 is alive
20:56:53 Node 2 is forced down by arbitrator

now node2 error log only mentions the crash at 21:03:29. This is more then 6 minutes after all this happened.
I assume node2 was "stuck/frozen" all this time and it failed only 6min later. Looking at the error log from the node2 I can see that at 21:01:50 (this is 2 minutes since the issue is noticed on node1/3)

2016-09-25 21:01:50 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100
2016-09-25 21:01:50 [ndbd] INFO     -- Watchdog: User time: 132298  System time: 286573
2016-09-25 21:01:50 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=200
2016-09-25 21:01:50 [ndbd] INFO     -- Watchdog: User time: 132298  System time: 286573
2016-09-25 21:01:50 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=300
2016-09-25 21:01:50 [ndbd] INFO     -- Watchdog: User time: 132298  System time: 286573
2016-09-25 21:01:50 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=401
2016-09-25 21:01:50 [ndbd] INFO     -- Watchdog: User time: 132298  System time: 286573
...

node2 starts complaining that ndb kernel thread is stuck. 

but CPU data about node2 does not show a problem at this point in time

08:50:01 PM     all      0.04      0.00      0.10      0.96      0.00     98.90
09:00:01 PM     all      0.09      0.00      1.50      2.14      0.00     96.26
09:10:01 PM     all     10.18      0.00      3.19     23.06      0.00     63.57
09:20:01 PM     all     12.22      0.00      0.71      2.47      0.00     84.60

the cpu usage (fairly light) starts at 21:10 ... that's long time after everything already happened.

IO does show some activity but nothing serious.

It is obvious that node2 froze but apart from that there's nothing saying "why" it froze. I would suggest a thorough hardware check of that node before continuing further. Also, I assume all your nodes have synchronized time, if not then all the logs don't make much sense.

I doubt this is bug, rather a hw issue (please inspect syslog, dmesg..) but can't rule out anything out.

take care
Bogdan

suggested course of action:

 - check hardware, inspect dmesg/syslog 
 - you are expecting new dump to happen during the weekend, check if node2 is again the one that fails, does it fail the same way (same freeze and watchdog kill)
 - if this is the case (most likely), restart the node2 with --initial (this will delete ALL data from node2 and rebuild it from scratch downloading data from node3 so it will take some time and system will be under some load) to solve any potential ndb filesystem error you might have on node2
 - when you are already using some maintenance window for rebooting node2, use it to upgrade to latest 7.4

all best
Bogdan Kecman

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".