MySQL Bugs: #82157: error 20008 'Query aborted due to out of query memory' from NDBCLUSTER): HY000

Bug #82157	error 20008 'Query aborted due to out of query memory' from NDBCLUSTER): HY000
Submitted:	8 Jul 2016 2:20	Modified:	26 Aug 2016 10:36
Reporter:	alain cocconi	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.6.29 ndb-7.4.11	OS:	Ubuntu (14.04.2 LTS)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Hello

I'm running a MySQL cluster since 7/15/2016 without any big issues.
The cluster handle 2 big databases (with up to 14M records for one, and up to 13M records for the other one)

This cluster is composed from :
 - 2 datanodes DELL R730 (256G RAM, 2 x CPU E5-2630 v3 @ 2.40GHz, RAID 5 6x250 SAS 15k)
 - 2 managers (VM 4G RAM, 20G virtio disk, 1 vcpu 1 core 1GHz)
 - 2 MySQL servers (8G RAM, 20G virtio disk, 2 vcpu 4 core 1 GHz)
 - 2 loadbalancer ZenLB

Since July 4 2016 the errors 
"error 20008 'Query aborted due to out of query memory' from NDBCLUSTER): HY000"
is returned to my client servers (ubuntu 14.04.2 LTS servers with standard mysql clients and libraries).
No more queries are working when it is happening : every query return the error 20008

Now this is happening nearly every 20 or 21 hours.

When it happens the 1st time I'd mysql cluster v7.4.8, so I've updated all the cluster to 7.4.11 but no change.

Thanks for help.
Regards
 

How to repeat:
Just let it work for 20 or 21 hours

Suggested fix:
The only way to return to a stable state is to stop the 1st datanode with the manager, then reboot the server, restart the datanode in the cluster and when it is finished, proceed the same with the 2nd one.

Hi,

while we are looking at the provided log files, can you let us know if you have any monitoring of your nodes? Do you have maybe MEM installed or do you at least monitor cpu usage, ram usage, io usage on your data and management nodes?

The "state" you get at should be solved by cluster restart but it's not acceptable solution for the production cluster, I know. What I'd like to know is if you tried it and did you do rolling restart or shutdown/start, and if you did rolling restart how did the clients behave? It is possible the only one node is in problem so if you did this rolling restart few times was everything up and running ok after only one node is restarted?

I would also need you to provide me with some results from the ndbinfo db.

select * from ndbinfo.counters;
select * from ndbinfo.diskpagebuffer;
select * from ndbinfo.memory_per_fragment;
select * from ndbinfo.memoryusage;
select * from ndbinfo.operations_per_fragment;
select * from ndbinfo.resources;

now I need this info "after restart" and then after ~20 hours (so just before you expect it to start shooting those errors).

also if you don't have at least some SAR data, can you extract some cpu usage and ram usage data from all nodes after restart and when you start experiencing the issue.

kind regards
Bogdan Kecman

Hi
Yes 'm monitoring cpu, mem, network, io etc on both servers. But nothing's happening when I've those errors.
Before to reboot servers, I've try stop, start, restart with the manager but no succes, still the error after 1 or 2 hours running.
To return to a stable state I've stopped to use a database in the cluster.
And now all is ok.
So I'm investigating what was going wrong with that database and I will return to you.
Thanks

Hi,

Let us know when you finish your investigation, and please get us the data I requested from the ndbinfo database.

take care
Bogdan

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".