MySQL Bugs: #90866: mysqld segfault at 7f153814da98 ip 0000000000c2206a sp 00007f0d742dd040 error 4

Bug #90866	mysqld segfault at 7f153814da98 ip 0000000000c2206a sp 00007f0d742dd040 error 4
Submitted:	15 May 2018 5:50	Modified:	2 Sep 2018 0:45
Reporter:	Hendrik Woltersdorf	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.6.40 ndb-7.4.20	OS:	CentOS (6.4 and 7.4)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86

Description:
After upgrading our test systems from Mysql Cluster 7.4.19 to 7.4.20 we see sporadic (once a week or every two weeks) segfaults of some SQL nodes.

We saw this in two systems:
a) the system described in #90083
b) another cluster hosted on Centos 7.4

There are no informations about the crash in the mysql logs, but we see the segfault in /var/log/messages:

...
May 14 12:34:01 eti-dus-ha-cl2 ntpdate[12011]: adjust time server 62.192.195.137 offset -0.003584 sec
May 14 12:47:01 eti-dus-ha-cl2 kernel: mysqld[11197]: segfault at 7f24f42311d8 ip 0000000000c2206a sp 00007f1d32cf8fc0 error 4 in mysqld[400000+d8f000]
May 14 12:49:01 eti-dus-ha-cl2 ntpdate[14776]: adjust time server 62.192.195.137 offset -0.003869 sec
...

...
May  8 15:20:01 fgd-dcv-ha-cl1 systemd: Starting Session 29346 of user root.
May  8 15:29:01 fgd-dcv-ha-cl1 kernel: mysqld[8192]: segfault at 7f153814da98 ip 0000000000c2206a sp 00007f0d742dd040 error 4 in mysqld[400000+d8f000]
May  8 15:30:01 fgd-dcv-ha-cl1 systemd: Started Session 29347 of user root.
...

For details about system a) see #90083.
For system b) I will attach the mysql server log.

How to repeat:
no idea.

mysql server log of system b)

Attachment: fgd-dcv-ha-cl1_log.zip (application/x-zip-compressed, text), 11.23 KiB.

Today's crash on fgd-dcv-ha-cl1 created a few lines in the server log:

03:24:39 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=6291456
read_buffer_size=262144
max_used_connections=261
max_threads=2000
thread_count=261
connection_count=260
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1567737 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x3ddf110
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

One observation:
Early in June I changed the setting ndb-cluster-connection-pool of the affected SQL nodes from "ndb-cluster-connection-pool=1" to "ndb-cluster-connection-pool=2".
Since then, for almost one month, I see no crashes any more.

Hi,

I cannot reproduce this nor I can find enough data in the crash reports.

Since you have this "on regular intervals" were you be willing to run a debug version of the mysqld so we might collect more data?

best regards
Bogdan

Yes, it's no problem, to run a debug version in one of our test environments.

If I use mysqld-debug with the standard options, it produces a huge trace file; much more the disks of the test system can store.

Can you recommend the settings to use, to save space?
I need to run this for at least one week, to see a crash (on average).

error log from SQL node 1 from 2018-07-16

Attachment: fgd-dcv-ha-cl1_20180716.zip (application/x-zip-compressed, text), 2.60 KiB.

Today I see intersting error messages in the error log of the SQL node, running mysqld-debug. (see fgd-dcv-ha-cl1_20180716.zip). The node process did not crash and is still running. 
Sorry, there is no trace file of this time range; because of the huge size I have to recreate it every 15 minutes.

Hi,

rotating logs is the only viable solution, as for the uploaded data I did not understand, the node did not died on this TCRELEASE sent error? it worked ok and continued to work ok?

all best
Bogdan

Yes, the SQL node did not die and is working. At 14:20 I saw reconnect messages in the logs of the other cluster nodes, but that's it.

Of cause the clients, connected to this SQL node, got a few errors (Java/JDBC):

2018-07-16 12:21:09,864 null    062096229196    03018   023617  OUT     ERR     HB_1531736453519        de.xcom.ha.mb.jdbcstatement.RefreshServerStatement    x0        call [select I_OPTION, USED_FOR_GLOBAL_ID, STATE from SERVER_REPOSITORY  where ID_SERVER_REPOSITORY = 103440 ] exception [java.sql.SQLExceptio
n: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER] warnings [Got temporary error 4010 'Node failure caused abort of transa
ction' from NDB;Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER;Got error 4350 'Transaction already aborted' from NDB;Got e
rror 4350 'Transaction already aborted' from NDBCLUSTER;Got error 4350 during ROLLBACK]  - Laufzeit: 0 ms
2018-07-16 12:21:09,897 null    062096229196    03018   023617  NON     ERR     HB_1531736453519        com.mysql.jdbc.SQLError createSQLException      964    ExceptionChecker [java.sql.SQLException SQLState HY000 SQLErrorCode 1297]

Running debug versions produces massive amount of log messages, we're working on
decreasing the amount of logs produced in debug builds. But I am a bit surprised that your
crash printout doesn't contain a proper crash trace. I think production builds are usually
still keeping symbols around to be able to handle core files.

So if you could run a production version that is unstripped of symbols it would provide a
proper stack trace of crash. This should help to know how to proceed with the bug analysis.

The production version is not stripped:
fgd-dcv-ha-cl1 root# file /opt/mysql/bin/mysqld
/opt/mysql/bin/mysqld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=8a27250514c097fd60268809f511826602790078, not stripped

But you can see in my two first comments, what (little) output it produced.

Please have a look at bug #90083. I think it's related and contains more stack trace.

log snippet of the crashed sql node v2

Attachment: fgd-dcv-ha-cl1_20180718.zip (application/x-zip-compressed, text), 1.79 KiB.

Tonight we saw another incident like the one from 2018-07-16. I added the log snippet.
I switched back to the production version of mysqld to get maybe a stack trace.
The shutdown prozess of the debug-server stopped at:
2018-07-18 11:50:42 3435 [Note] Shutting down plugin 'ndbcluster'
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stop
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Wakeup
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stopping...
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stopped
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stop completed
2018-07-18 11:50:42 3435 [Note] NDB Util: Stop
2018-07-18 11:50:42 3435 [Note] NDB Util: Wakeup
2018-07-18 11:50:42 3435 [Note] NDB Util: Stop completed
2018-07-18 11:50:42 [NdbApi] WARNING  -- Waiting for Ndb instances belonging to Ndb_cluster_connection 0x2b3fb40 to be deleted...

It had to be killed "-9".

Sorry but I can not reproduce this nor I can find reason why it is crashing. Nor I see any changes between 7.4.19 and 7.4.20 that could cause this.

bogdan