MySQL Bugs: #68261: ndbmtd, mysqld process fails

Bug #68261	ndbmtd, mysqld process fails
Submitted:	4 Feb 2013 10:44	Modified:	18 May 2013 14:50
Reporter:	Mateusz Kamola	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.5.29 ndb-7.2.10	OS:	Linux (Debian GNU/Linux wheezy)
Assigned to:		CPU Architecture:	Any

Description:
We have quite a bit of problems with MySQL Cluster stability. We are running cluster on two machines.
Machine 1: mgmt, ndbmtd, mysqld
Machine 2: ndbmtd, mysqld

This time first crash occured on machine 2:

Time: Monday 4 February 2013 - 04:04:00
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: /pb2/build/sb_0-7932439-1355951702.81/mysql-cluster-gpl-7.2.10/storage/ndb/src/kernel/vm/WatchDog.cpp
Program: ndbmtd
Pid: 27555
Version: mysql-5.5.29 ndb-7.2.10

And few minutes later on Machine 1:

Time: Monday 4 February 2013 - 04:04:23
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: LocalProxy.hpp
Error object: DBTC (Line: 211) 0x00000002
Program: ndbmtd
Pid: 24739 thr: 0
Version: mysql-5.5.29 ndb-7.2.10
Trace: /opt/mysql/server-5.5/ndb_data/ndb_2_trace.log.16 [t1..t9]

Usually crashes happens during copying of data from InnDB tables to NDB tables. We SELECT ... INTO OUTFILE, split file into chunks of 100 000 rows and LOAD DATA INFILE into ndb tables.

Also mysqld crashes from time to time - it's often during update query like here:
02:13:06 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=1073741824
read_buffer_size=528384
max_used_connections=501
max_threads=500
thread_count=16
connection_count=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2336310 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f87a4006a80
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f8815121e78 thread_stack 0x40000
bin/mysqld(my_print_stacktrace+0x35)[0x842f55]
bin/mysqld(handle_fatal_signal+0x4a4)[0x7119b4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f8a8561c030]
bin/mysqld(_ZN13ha_ndbcluster16exec_bulk_updateEPj+0x15f)[0x9eca0f]
bin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_ordery15enum_duplicatesbPySB_+0x1876)[0x6869a6]
bin/mysqld(_Z21mysql_execute_commandP3THD+0x13d4)[0x6157e4]
bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x188)[0x618db8]
bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x136d)[0x61a14d]
bin/mysqld(_Z24do_handle_one_connectionP3THD+0xcf)[0x6b3b8f]
bin/mysqld(handle_one_connection+0x51)[0x6b3c91]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7f8a85613b50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f8a8459070d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (122c7d80): UPDATE core_profile SET hash = NULL, last_probe_at = '2013-02-01 03:09:21' WHERE id = '40361636'
Connection ID (thread ID): 25496390
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

But sometimes it crashes after reaching limit of max_connections (probably because it tries to use table to which a lot of data is inserted and queries are waiting in queue).

How to repeat:
Hard to determine. It sometimes crashes, sometimes don't. Usually in similar circumstances but didn't manage to isolate it.

I have uploaded data generated by ndb_error_reporter - it's 9mb so it's uploaded to ftp.

Experiencing exactly the same issue. Nightmare!

@chris kyte : we have managed to significantly decrease probability of crash (it's working for about 5 days now since last crash, earlier it happened every night) by loading data in much smaller chunks. Now we are inserting only about 30k rows in one LOAD DATA INFILE and it's much more stable.

Hi Mateusz,

- Were you able to reproduce the issue?

- Are you still seeing SQL nodes crash with high load? Can you upload the NDB error report, cluster logs and configuration file, and mysql error log and configuration files again, please? 

- Do SQL nodes crash only when there is high load else they run fine (100% of the time)?

Please let me know.

Thanks.

Hi Syed,

We have decided to switch to Diskless mode (we were concerned about our SSDs heavy written by cluster) after which everything was working much better and more stable (but not perfect).

We still experienced crashes but we managed to isolate one of the causes. Some of the data from ndb tables was copied to innodb tables where we were running some analitic stuff. One of the queries used GROUP BY on few very large tables - it created big internal temporary table which didn't fit into RAM and was moved to disk. Then results were sorted which (because data was on disk) caused 100% disk usage for couple minutes - few minutes of 100% disk usage and cluster was down. Every single time. Even though we were using Diskless mode. This surely was one of the reasons we experienced crashes earlier - however this caused all NDB nodes to crash, not SQL nodes.

After we changed queries and didn't allow mysql to sort huge tables on disk and we are running in Diskless mode, mysql cluster stopped crashing. Still, another problem came up which I described here http://forums.mysql.com/read.php?25,583570,583570 ... if you could point me to some info on why it's happening I'd be grateful.

I'm not able to upload NDB error report, since Cluster is not crashing anymore, and I don't have the one I uploaded when sending this bug report.
Crashes happened only under high load, else everything run fine (unless we tried to run "alter table" on Temporary Innodb table, which is a different bug I believe, causing whole cluster to crash).

If you have any more questions, please let me know

Thanks.

Hi Mateusz,

Can you upload cluster log and configuration files, and mysql error log and configuration files, please? Please make sure you upload log files containing errors/warnings.

Thanks.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

are you using NIC bonding? If yes, what bonding mode are you using?