Bug #94911 ndbcluster shutdown when dropping large table
Submitted: 4 Apr 2019 18:26 Modified: 16 May 2019 22:25
Reporter: Shawn Hogan Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.6.8 OS:SUSE
Assigned to: MySQL Verification Team CPU Architecture:x86

[4 Apr 2019 18:26] Shawn Hogan
Description:
Dropping a large table cases an immediate shutdown of all nodes due to segmentation fault.  The table is approximately 3.2GB consisting of 47M rows.  The cluster has 8 data nodes and allocated memory for ndbcluster on those nodes is kept under 50% usage.

Some of the relevant logs:

Time: Thursday 4 April 2019 - 10:32:04
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: /export/home/pb2/build/sb_0-30900976-1539000237.07/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/ndbd.cpp
Program: ndbmtd
Pid: 2082 thr: 2
Version: mysql-5.7.24 ndb-7.6.8
Trace file name: ndb_11_trace.log.1_t2
Trace file path: /var/mysql-cluster/ndb_11_trace.log.1 [t1..t7]
***EOM***

-------

2019-04-04 10:32:01 [ndbd] INFO     -- DROP_TAB_REQ: tab: 5034, tabLcpStatus set to 3
2019-04-04 10:32:02 [ndbd] INFO     -- findNeighbours from: 5168 old (left: 18 right: 12) new (18 12)
2019-04-04 10:32:02 [ndbd] INFO     -- findNeighbours from: 5168 old (left: 18 right: 12) new (17 12)
2019-04-04 10:32:02 [ndbd] INFO     -- findNeighbours from: 5168 old (left: 17 right: 12) new (17 13)
2019-04-04 10:32:03 [ndbd] INFO     -- findNeighbours from: 5168 old (left: 17 right: 13) new (17 13)
2019-04-04 10:32:04 [ndbd] INFO     -- Received signal 11. Running error handler.
2019-04-04 10:32:04 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: JobHandling in block: 247, gsn: 353 elapsed=100
2019-04-04 10:32:04 [ndbd] INFO     -- Watchdog: User time: 242054512  System time: 295956603

...lots of similar lines... clipped because of limitations of bug report...

2019-04-04 10:32:07 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: JobHandling in block: 247, gsn: 353 elapsed=3206
2019-04-04 10:32:07 [ndbd] INFO     -- Watchdog: User time: 242054517  System time: 295956606
2019-04-04 10:32:07 [ndbd] INFO     -- Signal 11 received; Segmentation fault
2019-04-04 10:32:07 [ndbd] INFO     -- /export/home/pb2/build/sb_0-30900976-1539000237.07/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/ndbd.cpp
2019-04-04 10:32:07 [ndbd] INFO     -- Watchdog is killing system the hard way
2019-04-04 10:32:07 [ndbd] ALERT    -- Node 11: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
Make a large table and drop it.

Suggested fix:
Probably could make something that goes through and loops and deletes records in batches of 10,000, then drop the table.  Not sure, I haven't had time to try that yet.  Ben working on bringing the cluster back online.
[5 Apr 2019 16:52] MySQL Verification Team
Hi Shawn,

ndbcluster is in-memory system and all transactions need to happen in memory, I doubt you have configured system to handle 3.5G big transaction (that happens when you delete 3.5G of data in single statement) so that's why you have a crash. Yes, you are right, proper way to do it is to delete smaller chunks, but even that you have to do in separate transactions, if you do begin, delete, delete, delete... commit, you will again have crash! You have to do begin, delete, commit, begin, delete, commit... 

Now with regards to the reported issue, I'm bit concerned with the actual error reported so I might log this as a bug in that respect (the crash error you see in log should say that you don't have proper configuration rather then to just segfault), but for further analysis of this report I will need a full ndb_error_report or at least your config file

kind regards
Bogdan
[5 Apr 2019 20:42] Shawn Hogan
Got the cluster back online and it happened again.  Except this time it was caused by an ALTER TABLE on a small table (4 columns, zero records).

At that point I brought the cluster back online again, then upgraded to 7.6.9 in case it was some bug that was fixed.  I also went ahead and did --initial restarts of the data nodes for good measure.  It *seemed* to be okay for awhile... even was fine dropping the 3.2GB table as well as another 1.6GB one (figured it was time to clean up and purge some stuff).

A few hours later the entire cluster came down again, although not sure what it was exactly that time (maybe was a process that once an hour TRUNCATES) a small table (less than 50 small records in it).

Brought the cluster back online again and started digging through logs.  It appears it was the LCP watchdog process that killed all data nodes at the same instant.  I'm not entirely sure the internals of how that works, but I assume it's a per node watchdog.  If that's the case, I have a hard time believing that all 8 data nodes had I/O problems concurrently at the same second in the day.

Each node has 6x Seagate Savvio drives in a hardware RAID-6.  Not SSDs, but enterprise class drives:  https://www.seagate.com/support/internal-hard-drives/enterprise-hard-drives/savvio-10k/#sp...

Checked all 48 drives to see if any were giving any errors at the hardware level and all were fine.  Did a real world test of reading large files from one node and writing to another through the network (everything is interconnected with 56Gbit Infiniband) and was getting throughput of around 1.2GB/sec.

Watching servers io processes and I don't see anything other than ndbmtd doing it's normal stuff (LCP writing is the most).

After all 8 data nodes being killed by the LCP watchdog 3x in a single day, I ended up disabling the watchdog (LcpScanProgressTimeout=0 now).

Nothing has changed with the servers or config or load and stuff has been humming along fine for years.  Servers are under light load.  Each server has 2 CPU sockets, with each running 8-core Xeons.  lscpu output:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Stepping:              7
CPU MHz:               1239.574
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
BogoMIPS:              5401.54
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

Server loads never goes over 2.00, and typically hover between 0.50 and 1.25.
[5 Apr 2019 20:47] Shawn Hogan
Current config file (the only thing I changed since the crashes was adding the LcpScanProgressTimeout=0 option (left it with commented out stuff... some of that stuff is from temporary changes for redo logs when doing backup restores or loading a ton of data at once).

------
[tcp default]
SendBufferMemory=8M
ReceiveBufferMemory=8M

[ndbd default]
NoOfReplicas= 2
datadir=/var/mysql-cluster

DataMemory=100G
#IndexMemory=10G

MaxNoOfFiredTriggers=12000
TransactionDeadlockDetectionTimeout=10000

##TimeBetweenLocalCheckpoints = 10

TimeBetweenGlobalCheckpoints=10000

### Params for REDO LOG
#FragmentLogFileSize=256M
#InitFragmentLogFiles=SPARSE
#NoOfFragmentLogFiles=756
#RedoBuffer=48M

FragmentLogFileSize=64M
InitFragmentLogFiles=SPARSE

NoOfFragmentLogFiles=64
#NoOfFragmentLogFiles=256
#TimeBetweenLocalCheckpoints=10
RedoBuffer=64M

## MySQL Cluster 7.3
## DiskCheckpointSpeed=20M
## DiskCheckpointSpeedInRestart=250M

## MySQL Cluster 7.4
MinDiskWriteSpeed=40M
MaxDiskWriteSpeed=80M
MaxDiskWriteSpeedOtherNodeRestart=200M
MaxDiskWriteSpeedOwnRestart=400M

## watchdog disabled
LcpScanProgressTimeout=0

ODirect=1
MaxNoOfConcurrentScans=500
MaxNoOfLocalScans=100000
LongMessageBuffer=32M
MaxNoOfExecutionThreads=8
LockPagesInMainMemory=1
MaxNoOfConcurrentOperations=1000000
MaxNoOfConcurrentTransactions=65536
MaxNoOfTables=4096
## MaxAllocate=64M

MaxNoOfOrderedIndexes=4096
MaxNoOfUniqueHashIndexes=2048
MaxNoOfAttributes=40480
MaxNoOfTriggers=10240
BuildIndexThreads=4

## RealTimeScheduler=1
[5 Apr 2019 20:48] Shawn Hogan
Oh, forgot to mention that each node has 256GB RAM, with more than 100GB free (even after allocating 100GB to data nodes).
[11 Apr 2019 19:54] MySQL Verification Team
Hi,

ndb_error_reporter would give us some additional info, especially that crash on that alter that was on a miniature table. 

I believe your cluster is not properly configured, for start 

Shared global memory is default, and that's small for cluster of this size.

https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-s...

and I really think you would benefit from a session with MySQL Cluster Support team.

all best
Bogdan
[13 Apr 2019 2:31] Shawn Hogan
Is a 128M SharedGlobalMemory small on a cluster with no disk data, tablespaces, etc.?

I figured the default was a good starting point when the documentation seems to imply it's more for stuff we don't use.

mysql> SELECT * FROM information_schema.TABLESPACES;
Empty set (0.00 sec)

We also don't have any log files defined.  The only thing in information_schema.FILES is the default stuff (the 23 records in there all are for ENGINE InnoDB, which we don't use for anything beyond whatever mysql sets up by default).

mysql> SELECT * FROM information_schema.FILES WHERE ENGINE != 'InnoDB';
Empty set (0.00 sec)

Truth be told, ndbcluster does what it does pretty well (when the watchdog doesn't kill all the data nodes concurrently.. haha), so we've stayed away from trying to use disk tables with it.  Although we do have about 30GB of rarely used/accessed binary data being stored in the cluster that we would love to store on disk.  But it is what it is... hasn't been a priority to try and make ndbcluster do what it really wasn't designed for.

As far as the error logs and trace files, I'll see about getting them.
[17 May 2019 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".