Bug #71139 Continuous node shutdown caused by error 6050
Submitted: 13 Dec 2013 7:26 Modified: 8 Mar 2016 8:25
Reporter: Сергей Кукуев Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.3.2 OS:Linux (2.6.39-200.24.1.el6uek.x86_64)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: error 6050, massive overload, node crash, watchdog

[13 Dec 2013 7:26] Сергей Кукуев
Description:
The same data node continuously shuts down whit the same error in log:

2013-12-05 00:03:24 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

even during restart:

2013-12-05 00:03:24 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

We have 4 data nodes (2x2) sharing servers with 4 ndb api applications (1 per each data node)

Apps located near working data nodes works good, app located near crashing node writes lots of errors in log:
Commit transaction error : 7/899/-1 , Rowid already allocated

Cluster works under load ~43500 requests per sec.
Each request includes approx. 5 PK operations (reads and writes).

Memory usage:
SELECT node_id, memory_type, used, total, used_pages, total_pages, 
round(used/1024/1024) as "Used (Mb)", 
round(total/1024/1024) as "Total (Mb)", 
round(used*100/total) as "Used (%)" 
FROM ndbinfo.memoryusage;

# node_id, memory_type, used, total, used_pages, total_pages, Used (Mb), Total (Mb), Used (%)
3, Data memory, 62905974784, 98784247808, 1919738, 3014656, 59992, 94208, 64
3, Index memory, 7361347584, 11813257216, 898602, 1442048, 7020, 11266, 62
4, Data memory, 65079115776, 98784247808, 1986057, 3014656, 62064, 94208, 66
4, Index memory, 7344349184, 11813257216, 896527, 1442048, 7004, 11266, 62
5, Data memory, 62899322880, 98784247808, 1919535, 3014656, 59985, 94208, 64
5, Index memory, 7361781760, 11813257216, 898655, 1442048, 7021, 11266, 62
6, Data memory, 62910038016, 98784247808, 1919862, 3014656, 59996, 94208, 64
6, Index memory, 7361691648, 11813257216, 898644, 1442048, 7021, 11266, 62

All data nodes' servers have completely the same HW.
Each server have 128GB RAM.
And data nodes memory configuration is:

[tcp default]
SendBufferMemory=16M
ReceiveBufferMemory=16M

[ndbd default]
LockPagesInMainMemory=1
DataMemory=92G
IndexMemory=11G
TransactionBufferMemory=8M
LongMessageBuffer=64M
SharedGlobalMemory=512M
DiskPageBufferMemory=64M

So I think that swapping shouldn't be present.
At OK data nodes ndbmtd utilizes 117G RAM (VIRT = RES = 117G)
but BAD data node after restart utilizes 117G RES and 122G VIRT how could it be possible? Maybe this swapping issue causes whatchdog timeout.

How to repeat:
Leave cluster under high load for a while
[13 Dec 2013 7:36] Сергей Кукуев
Uploaded error report:
sftp.oracle.com:/support/incoming/BUG_71139_ndb_error_report.tar.bz2
[13 Dec 2013 10:06] Сергей Кукуев
Additionally:
We are using ThreadConfig parameter for binding threads but HyperThreading is on, so we have 32 CPUs and only 16 physical cores.
[17 Dec 2013 6:55] Сергей Кукуев
Update:
we turned off HyperThreading
the same data node still crashes with another error but caused by whatchdog timeout again:

 Time: Monday 16 December 2013 - 20:45:05
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data:
Error object: DBTC (Line: 7783) 0x00000002
Program: ndbmtd
Pid: 2559 thr: 6
Version: mysql-5.6.11 ndb-7.3.2
Trace: /opt/mysql/ndb_4_trace.log.1 [t1..t10]
***EOM***

traces attached
[17 Dec 2013 6:56] Сергей Кукуев
After HyperThreading turned off

Attachment: BUG_71139_ndb_error_report_2.tar.bz2 (application/octet-stream, text), 2.75 MiB.

[8 Mar 2016 8:25] MySQL Verification Team
Hi Sergei,

I hope you are not getting the same error with latest 7.3/7.4 releases. There are more then 5 bugs that affect ndbmtd behavior on a large core system like yours that are fixed since then. (some of them Bug #16961971, Bug #15907515, Bug #17739131...)

all best
Bogdan Kecman