Bug #86537 Invalid memory access: ptr
Submitted: 1 Jun 2017 8:52 Modified: 12 Jun 2017 14:34
Reporter: Global Incubator Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.5.5 OS:Debian (7)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: MySQL Cluster

[1 Jun 2017 8:52] Global Incubator
Description:
Our Architecture:

MySQL Cluster v7.5.5
2 MySQL API Nodes
4 Data NDB (2 redundants)
256Gb RAM
40 Threads

On the past 2 days, diferents nodes have give us this error, after this error, the node dies and it replica dies also. This makes the cluster to shut down and it takes us 2h to start it again.

Find log attached on the next comment.

Error:

Time: Thursday 1 June 2017 - 10:07:46
Status: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: Invalid memory access: ptr (80aad8b2 0x7f23e4a5e2d8) magic: (00000000 00000068) memroot: 0x7f21e1fa8000 page: 68
Error object: DBSPJ (Line: 47) 0x00000002
Program: ndbmtd
Pid: 15364 thr: 14
Version: mysql-5.7.17 ndb-7.5.5
Trace file name: ndb_2_trace.log.5_t14
Trace file path: /data/mysql-cluster//ndb_2_trace.log.5 [t1..t26]
***EOM***

How to repeat:
We don't now how to repeat it. The only thing new is that we have increased our trafic by 100 on the past week.
[1 Jun 2017 8:59] Global Incubator
Link to download error log: https://mega.nz/#!cmI3GDiZ!S3-rcnhUhfX_0xIUFIlVkx_D2IL2Gs8CPHCd2HuXO5E

This log was created with the command "db_error_reporter"
[7 Jun 2017 13:23] MySQL Verification Team
Hi,

Thanks for your report. I can say it's not a hw error (first thing that crossed my mind) but not sure what's exactly going on. Looks like miss-configuration to me for now but I have to look more.

Do you run any usage stats for your servers? Having a memory, cpu, io usage stats for the period pre and during crash would give use more insight. 

What I see is that RT_SPJ_ARENA_BLOCK is the failing resource. This is internal join buffer. Are you running some large join queries on your system?

all best
Bogdan

p.s. with regards to "urgency" you mentioned, having MySQL Support subscription would surely help you great deal with that allowing you to properly size your setup
[12 Jun 2017 9:08] Global Incubator
Hi Bogdan,

We don't have exactly usage stats of the period of the crash, we have only some graphs of stadistics.

CPU: http://es.zimagez.com/full/78c44bb5b184d621817defaa1cf6b7542539cb58dbbce966d539fe042db697e...
RAM: http://es.zimagez.com/full/9369cc49e6a646bb817defaa1cf6b75433797024f0af9d6a0269dd2909610bd...
SWAP: http://es.zimagez.com/full/b8d1969b4b106e5b817defaa1cf6b7549bef1806a6b53c22d057b2321de7461...

The crashes were at 31 May, 1 June and 5 June.

Yes, we are running multiple large join queries,after last crash we have increased the  "SharedGlobalMemory" variable from 20M to 2G.

Also we have decreased DataMemory from 180G to 150G and IndexMemory from 40G to 30G, although we have never had swap.

Thanks for your comment.
[12 Jun 2017 12:10] MySQL Verification Team
Hi,

> graphs of stadistics.

the url's you sent don't work :(

> Yes, we are running multiple large join queries

That is your problem!

> increased the  "SharedGlobalMemory" variable from 20M to 2G.

That should help.

I don't see this as a bug, this looks like improper configuration / sizing of the cluster. To help with that you should contact MySQL Support team, they can help you properly size/configure your system and optimize your queries to use your MySQL Cluster setup properly (rewriting your queries to better suite cluster environment)

best regards
Bogdan
[12 Jun 2017 12:40] Global Incubator
Hi Bogdan,

Sorry I am uploaded the files again.

CPU: http://oi63.tinypic.com/wtwp3b.jpg
RAM: http://oi65.tinypic.com/90yio7.jpg
SWAP: http://oi68.tinypic.com/ykncp.jpg

Although we have queries with large join, the cluster should be slower but not crash. At least not crash multiple nodes in cascade :S, because of them I think that is a bug. Nodes statistics never show congested nodes.
 
And the large queries is joined by index, the average of that large queries is more or less 2 seconds.

Best regards
[12 Jun 2017 14:34] MySQL Verification Team
Hi,

The graphs look ok. 

> Although we have queries with large join, 
> the cluster should be slower but not crash.

That's actually not how ndbcluster storage engine is designed. MySQL Cluster is designed to be a real time database so instead of things becoming slow - the system will intentionally crash in many occasions.

> At least not crash multiple nodes in cascade :S, because of them I think that is a bug. 

Yes, it should not crash the whole system but nothing in the logs show "what happened" except that it crashed within a join.

If you can provide more info (query that crashes is) we might look into it, but otherwise reconfiguration of the cluster is the only way forward. Now how to best configure it is not something we should discuss in bugs system.

> And the large queries is joined by index

not really important since with joins you always need to search multiple nodes for data. multiple optimizations are involved, especially with 7.4+ to make this faster.

Knowing the query that crashed might help us reproduce the problem

all best
Bogdan