Bug #64609 NDB crashes and does not recover under load
Submitted: 10 Mar 2012 10:17 Modified: 22 Aug 2016 12:58
Reporter: Avi Bluestein Email Updates:
Status: No Feedback Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version: mysql-5.5.19 ndb-7.2.4 OS:Linux (CentOS 6)
Assigned to: Bogdan Kecman CPU Architecture:Any
Tags: crash, MySQL Cluster, ndb

[10 Mar 2012 10:17] Avi Bluestein
Under load and no apparent reason, NDB crashed and the server reported error 157.
NDB did not recover.

The error file contains the following:

Current byte-offset of file-pointer is: 568

Time: Saturday 10 March 2012 - 11:43:12
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1591) 0x00000002
Program: ndbmtd
Pid: 3249 thr: 2
Version: mysql-5.5.19 ndb-7.2.4
Trace: /home/newsbridge/database/ndb_data/ndb_2_trace.log.1 [t1..t7]

How to repeat:
It happens randomly under load so no way to determine. We can generate the load with our system, of course.
[13 Mar 2012 17:23] Jonas Oreland

Looking at logs...it doesn't really look load related (but I can be wrong)
It is a insert on table with id = 10 that is failing.

Can you do "ndb_show_tables" and give us table definition (create table statement) of table with id=10

[13 Mar 2012 17:56] Jonas Oreland
first guess is that this is a charset related issue.
[14 Mar 2012 6:22] Jonas Oreland
setting state to need feedback
[14 Mar 2012 23:25] Avi Bluestein
Thank you.

I wish it was a matter of charset - but it's not so IMHO, as this crash happens in random places and is load-dependent.

We've been able to overcome the NDB crashes by increasing various parameters such as the redo log, the number of concurrent transactions and various buffers.
Additionally and most important - we had no choice but to implement a capping mechanism which stalls the system every predefined number of transactions and measured TPS - to allow NDB to flush the data to disk.
This seems to have dealt with the crashes - although this is not normal or reasonable and the solution is not perfect at all - given that it stalls the system for 30-60 seconds every once in a while.

Unfortunately we do not have the data or original table structure as we've deleted the whole DB and reconstructed it.

But now comes the 'juicy' part.
When we thought it was all behind us - now MYSQLD started crashing badly, entering a wild loop of recovery-crash-recovery-crash....
This seems to somehow be load-oriented and again, it's random - but it does happens always and when it does - it never recovers.

Here's the info from the log (the crash repeats itself with the same log):

120312  8:45:16  InnoDB: Waiting for the background threads to start
120312  8:45:17 InnoDB: 1.1.8 started; log sequence number 2390373
120312  8:45:17 [Warning] 'user' entry 'root@db-1.localdomain' ignored in --skip-name-resolve mode.
120312  8:45:17 [Warning] 'user' entry '@db-1.localdomain' ignored in --skip-name-resolve mode.
120312  8:45:17 [Warning] 'proxies_priv' entry '@ root@db-1.localdomain' ignored in --skip-name-resolve mode.
120312  8:45:17 [Note] Event Scheduler: Loaded 0 events
120312  8:45:17 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.19-ndb-7.2.4-gpl'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Cluster Community Server (GPL)
120312  8:45:18 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema
120312  8:45:18 [Note] NDB Binlog: logging ./mysql/ndb_schema (UPDATED,USE_WRITE)
120312  8:45:18 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_apply_status
120312  8:45:18 [Note] NDB Binlog: logging ./mysql/ndb_apply_status (UPDATED,USE_WRITE)
2012-03-12 08:45:18 [NdbApi] INFO     -- Flushing incomplete GCI:s < 14092/13
2012-03-12 08:45:18 [NdbApi] INFO     -- Flushing incomplete GCI:s < 14092/13
120312  8:45:18 [Note] NDB Binlog: starting log at epoch 14092/13
120312  8:45:18 [Note] NDB Binlog: ndb tables writable
120312  8:45:19 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 445676 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2cef180
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f2c486e9e68 thread_stack 0x40000

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f2c00004b80): is an invalid pointer
Connection ID (thread ID): 5

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
mysqld: /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/mysys/my_new.cc:52: int __cxa_pure_virtual(): Assertion `! "Aborted: pure virtual method called."' failed.
Fatal signal 6 while backtracing
[22 Jul 2016 12:58] Bogdan Kecman
coming into this issues some years later, I don't see nearly enough data here to conclude what the problem is. I assume since you are no longer updating this bug that the issue was somehow solved but if you still have the issue with latest binaries, please let us know and upload full logs, both mysqld and cluster

take care
Bogdan Kecman
[23 Aug 2016 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".