MySQL Bugs: #71139: Continuous node shutdown caused by error 6050

Bug #71139	Continuous node shutdown caused by error 6050
Submitted:	13 Dec 2013 7:26	Modified:	8 Mar 2016 8:25
Reporter:	Сергей Кукуев	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.3.2	OS:	Linux (2.6.39-200.24.1.el6uek.x86_64)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	error 6050, massive overload, node crash, watchdog

Description:
The same data node continuously shuts down whit the same error in log:

2013-12-05 00:03:24 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

even during restart:

2013-12-05 00:03:24 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

We have 4 data nodes (2x2) sharing servers with 4 ndb api applications (1 per each data node)

Apps located near working data nodes works good, app located near crashing node writes lots of errors in log:
Commit transaction error : 7/899/-1 , Rowid already allocated

Cluster works under load ~43500 requests per sec.
Each request includes approx. 5 PK operations (reads and writes).

Memory usage:
SELECT node_id, memory_type, used, total, used_pages, total_pages, 
round(used/1024/1024) as "Used (Mb)", 
round(total/1024/1024) as "Total (Mb)", 
round(used*100/total) as "Used (%)" 
FROM ndbinfo.memoryusage;

# node_id, memory_type, used, total, used_pages, total_pages, Used (Mb), Total (Mb), Used (%)
3, Data memory, 62905974784, 98784247808, 1919738, 3014656, 59992, 94208, 64
3, Index memory, 7361347584, 11813257216, 898602, 1442048, 7020, 11266, 62
4, Data memory, 65079115776, 98784247808, 1986057, 3014656, 62064, 94208, 66
4, Index memory, 7344349184, 11813257216, 896527, 1442048, 7004, 11266, 62
5, Data memory, 62899322880, 98784247808, 1919535, 3014656, 59985, 94208, 64
5, Index memory, 7361781760, 11813257216, 898655, 1442048, 7021, 11266, 62
6, Data memory, 62910038016, 98784247808, 1919862, 3014656, 59996, 94208, 64
6, Index memory, 7361691648, 11813257216, 898644, 1442048, 7021, 11266, 62

All data nodes' servers have completely the same HW.
Each server have 128GB RAM.
And data nodes memory configuration is:

[tcp default]
SendBufferMemory=16M
ReceiveBufferMemory=16M

[ndbd default]
LockPagesInMainMemory=1
DataMemory=92G
IndexMemory=11G
TransactionBufferMemory=8M
LongMessageBuffer=64M
SharedGlobalMemory=512M
DiskPageBufferMemory=64M

So I think that swapping shouldn't be present.
At OK data nodes ndbmtd utilizes 117G RAM (VIRT = RES = 117G)
but BAD data node after restart utilizes 117G RES and 122G VIRT how could it be possible? Maybe this swapping issue causes whatchdog timeout.

How to repeat:
Leave cluster under high load for a while

Uploaded error report:
sftp.oracle.com:/support/incoming/BUG_71139_ndb_error_report.tar.bz2

Additionally:
We are using ThreadConfig parameter for binding threads but HyperThreading is on, so we have 32 CPUs and only 16 physical cores.

Update:
we turned off HyperThreading
the same data node still crashes with another error but caused by whatchdog timeout again:

 Time: Monday 16 December 2013 - 20:45:05
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data:
Error object: DBTC (Line: 7783) 0x00000002
Program: ndbmtd
Pid: 2559 thr: 6
Version: mysql-5.6.11 ndb-7.3.2
Trace: /opt/mysql/ndb_4_trace.log.1 [t1..t10]
***EOM***

traces attached

After HyperThreading turned off

Attachment: BUG_71139_ndb_error_report_2.tar.bz2 (application/octet-stream, text), 2.75 MiB.

Hi Sergei,

I hope you are not getting the same error with latest 7.3/7.4 releases. There are more then 5 bugs that affect ndbmtd behavior on a large core system like yours that are fixed since then. (some of them Bug #16961971, Bug #15907515, Bug #17739131...)

all best
Bogdan Kecman