MySQL Bugs: #91865: mysqld crashes with signal 6

Bug #91865	mysqld crashes with signal 6
Submitted:	2 Aug 2018 8:17	Modified:	10 Aug 2018 13:43
Reporter:	Hendrik Woltersdorf	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.6.41 ndb-7.4.21	OS:	CentOS (6.3)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86

Description:
a test system 4 machines, 2 nodes of type management, data and sql each.
After one machine, hosting a sql node, died, we made a copy of the surviving sql node on the operating system level. This copy lives inside of a virtual machine.
With this cloned sql node we see often crashes of the type:
glibc detected *** /opt/mysql/bin/mysqld: munmap_chunk(): invalid pointer: 0x00007fcd80f39fb0 ***
...

How to repeat:
The crash happens often, but not always, when I call a stored procedure (call SP_UEBERWACHUNG('');)

error as seen from the client

Attachment: error.txt (text/plain), 8.07 KiB.

I can't upload the file from the ndb_error_reporter on sftp because of network security limitations. (11MB).

reduced ndb_error-reporter files

Attachment: mysql-bug-data-91865_v2.tar.bz2 (application/octet-stream, text), 2.26 MiB.

the stored procedure mentioned

Attachment: sp_ueberwachung.sql (application/octet-stream, text), 20.39 KiB.

Hi,

I order to see what's going on we need all log files (easiest way to collect them is using ndb_error_rerpot tool).

Why are you "cloning" SQL node?

best regards
Bogdan

I collected the files using ndb_error_reporter. I just deleted some large old log files.
We lost one machine and had to set up a new one. Our system administrators suggested, to clone the surviving sql node on the os level into a new virtual machine. And that's what we did. Anything wrong with that?

Hi,

I see the logs, apologies, I was thinking about one thing and writing another, the ndb_error_reporter does not collect mysql log files so I wanted to ask about full sql log file of the failing node. It is questionable if I'll be able to see anything new there that's not already in the error.txt but we might find something useful there so if you can upload please do.

Now, I can't reproduce this and no, just cloning the node is not a good way to go about it as both ndb filesystem and mysql datadir are "wrong" so if you don't want to install sql and ndbmtd on a new node you can clone the existing one but you have to remove ndb filesystem (start the node with --initial or manually delete filesystem before starting the node). The mysqld should work ok with cloned filesystem but I personally like to clear it's datadir too before connecting to cluster.

If I understand you correctly - the original SQL node is never crashing, only the new cloned one? If that's correct I'd easily assume there's a filesystem issue on this cloned node.

kind regards
Bogdan

mysqld log of the cloned node

Attachment: etq-dusv-dbcl2.zip (application/x-zip-compressed, text), 27.42 KiB.

mysqld log of the original node

Attachment: etq-wil-dbcl2.zip (application/x-zip-compressed, text), 23.34 KiB.

I added the log files of the two sql nodes.
The original node crashed too, but less often.

Hi,
Thanks for the logs and clarification. Let me analyze this and I'll get back to you.

all best
Bogdan

Yesterday I recreated the MySQL Cluster SQL node on the cloned machine.
That means:
- stop mysqld
- delete everything in 'datadir'
- start from scratch with mysql_install_db

Since then (at least until now, for one day) no crashes and no more messages like:
"Incorrect information in file: './hacom/SYSTEM_CACHE.frm'" on one SQL node whenever a "truncate table SYSTEM_CACHE" was issued on the other one.

Hi,
looks like cloning was the problem. You can't clone data filesystem (neither mysqld's nor ndbmtd's) for new nodes, those need to be clean. Hopefully this solves the problem. Let me know if your experience show otherwise, but for now I'm setting this to "not a bug"

all best
Bogdan