Bug #75762 Table is full error on sql nodes and subsequent crash of entire cluster
Submitted: 4 Feb 2015 12:20 Modified: 3 Jun 2015 15:09
Reporter: Marco Sperandio Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.3.7 OS:Linux (RHEL 5.4 64bit)
Assigned to: MySQL Verification Team CPU Architecture:Any

[4 Feb 2015 12:20] Marco Sperandio
Description:
Hi,

We're facing an issue after upgrading from 7.1.15a to 7.3.7, for two weeks the whole cluster seems to be ok, (not so good performances but good stability, only need a fine tuning).

Yesterday we started a stress test (heavy insert on ndbcluster tables) After some hours of run, all the sql nodes start to get ERROR 1114 (HY000): The table '[table name]' is full.

When we hit the problem there was around 12% of datamemory free and 32% indexmemory free.

Additionally when we try to create a new table we received error:
Error Code: 625  Message: Out of memory in Ndb Kernel, hash index part (increase IndexMemory).

But that's not possible because at the moment of the error there is around 130.000 free pages in indexmemory.

After some hours of errors one of the two datanodes gone down with this error:

Time: Wednesday 4 February 2015 - 00:05:32
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 4774) 0x00000002
Program: ndbmtd
Pid: 5827 thr: 3
Version: mysql-5.6.21 ndb-7.3.7
Trace: /var/lib/mysql/ndb_4_trace.log.5 [t1..t11]
***EOM***

Immediately after the crash the sql nodes stop to gave "table is full errors" and for around 30 minutes the situation returned to normality, but the "table is full" reappeared and 53 minutes after the node1 crash, the last one node gone down with the same error.

The crash seems to be related to the "table is full" running condition, but we are trying to investigate why ndbcluster show us more free memory (data and index) instead of the real usage.

We try to investigate on MaxAllocate parameter, but without luck, even after doubling it (32-> 64MB + rolling restart) the error still persists.

Additionally we're investigating about the Hash/Ordered indexes, but we're not using so much Hash indexes, the most of our indexes are ordered type. 

Additionally: "MySQL Cluster can use a maximum of 512 MB for hash indexes per partition which means in some cases it is possible to get Table is full errors in MySQL client applications even when ndb_mgm -e "ALL REPORT MEMORYUSAGE" shows significant free DataMemory".
But the error on sql nodes spoke about "index memory" so this is a bit confusing, are we facing an hard limit of ndbcluster or an error in the management of memory page in ndb kernel?

At the moment of the crash we had two tables with 40+65M records, so we are speaking about 110-120 Millions of rows for the whole ndb tables.

This morning, tryng to start a single datanode was ok, the index memory usage raised 74% (Data 89%), and no errors creating an empty table.

Any advice? I've attached the ndb_error_log report.

Best regards
Marco Sperandio

How to repeat:
Try to raise 70% of index memory and 90% datamemory
[4 Feb 2015 16:36] Marco Sperandio
Assuming that the error log is true:
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 4774) 0x00000002

Opening the DbaccMain.cpp from 7.3.7 source:

/* --------------------------------------------------------------------------------- */
/* ALLOC_OVERFLOW_PAGE                                                               */
/*          DESCRIPTION:                                                             */
/* --------------------------------------------------------------------------------- */
void Dbacc::allocOverflowPage(Signal* signal)
{
  tresult = 0;
  if (cfreepages.isEmpty())
  {
    jam();
    zpagesize_error("Dbacc::allocOverflowPage");
    tresult = ZPAGESIZE_ERROR;
    return;
  }//if
  seizePage(signal);
  ndbrequire(tresult <= ZLIMIT_OF_ERROR);
  {
    LocalContainerPageList sparselist(*this, fragrecptr.p->sparsepages);
    sparselist.addLast(spPageptr);
  }
  iopPageptr = spPageptr;
  initOverpage(signal);
}//Dbacc::allocOverflowPage()

==============================

Reading the source code it seems that we reached a limit related to free pages, but again, there is no trace of indexmemory and datamemory overallocation in our checks, am i missing something?

still investigating.

regards
Marco
[4 Feb 2015 16:38] Marco Sperandio
I forgot to say that line 4774 is: ndbrequire(tresult <= ZLIMIT_OF_ERROR);

M.
[3 Jun 2015 15:08] MySQL Verification Team
Hi,

"table is full" error can happen from 2 reasons

 1. you do not have enough free allocated ram (datamemory or indexmemory)
 2. you do not have enough fragments for that table (too many rows)

Solving [1] is obvious, you need more ram. In your case I'm not sure this is the reason for these errors. It could be as when you go over 70% of memory usage you can have, on a system that does large and/or long transactions, reported free space that's not actually free (yet) so you hit a limit not being aware of it. 

Solving [2] is not obvious but it's rather simple. Using alter table.. MAX_ROWS=.. you increase number of fragments for a table and overcome this problem. In order to check how MAX_ROWS actually works please consult documentation or if in need of assistance open a ticket with MySQL Support

This behavior is a bug but a designed solution so I'm closing this request as not a bug.

kind regards
Bogdan Kecman