Bug #90866 mysqld segfault at 7f153814da98 ip 0000000000c2206a sp 00007f0d742dd040 error 4
Submitted: 15 May 2018 5:50 Modified: 2 Sep 2018 0:45
Reporter: Hendrik Woltersdorf Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.6.40 ndb-7.4.20 OS:CentOS (6.4 and 7.4)
Assigned to: Bogdan Kecman CPU Architecture:x86

[15 May 2018 5:50] Hendrik Woltersdorf
Description:
After upgrading our test systems from Mysql Cluster 7.4.19 to 7.4.20 we see sporadic (once a week or every two weeks) segfaults of some SQL nodes.

We saw this in two systems:
a) the system described in #90083
b) another cluster hosted on Centos 7.4

There are no informations about the crash in the mysql logs, but we see the segfault in /var/log/messages:

...
May 14 12:34:01 eti-dus-ha-cl2 ntpdate[12011]: adjust time server 62.192.195.137 offset -0.003584 sec
May 14 12:47:01 eti-dus-ha-cl2 kernel: mysqld[11197]: segfault at 7f24f42311d8 ip 0000000000c2206a sp 00007f1d32cf8fc0 error 4 in mysqld[400000+d8f000]
May 14 12:49:01 eti-dus-ha-cl2 ntpdate[14776]: adjust time server 62.192.195.137 offset -0.003869 sec
...

...
May  8 15:20:01 fgd-dcv-ha-cl1 systemd: Starting Session 29346 of user root.
May  8 15:29:01 fgd-dcv-ha-cl1 kernel: mysqld[8192]: segfault at 7f153814da98 ip 0000000000c2206a sp 00007f0d742dd040 error 4 in mysqld[400000+d8f000]
May  8 15:30:01 fgd-dcv-ha-cl1 systemd: Started Session 29347 of user root.
...

For details about system a) see #90083.
For system b) I will attach the mysql server log.

How to repeat:
no idea.
[15 May 2018 5:51] Hendrik Woltersdorf
mysql server log of system b)

Attachment: fgd-dcv-ha-cl1_log.zip (application/x-zip-compressed, text), 11.23 KiB.

[23 May 2018 12:04] Hendrik Woltersdorf
Today's crash on fgd-dcv-ha-cl1 created a few lines in the server log:

03:24:39 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=6291456
read_buffer_size=262144
max_used_connections=261
max_threads=2000
thread_count=261
connection_count=260
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1567737 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x3ddf110
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
[2 Jul 2018 6:50] Hendrik Woltersdorf
One observation:
Early in June I changed the setting ndb-cluster-connection-pool of the affected SQL nodes from "ndb-cluster-connection-pool=1" to "ndb-cluster-connection-pool=2".
Since then, for almost one month, I see no crashes any more.
[11 Jul 2018 0:59] Bogdan Kecman
Hi,

I cannot reproduce this nor I can find enough data in the crash reports.

Since you have this "on regular intervals" were you be willing to run a debug version of the mysqld so we might collect more data?

best regards
Bogdan
[11 Jul 2018 5:20] Hendrik Woltersdorf
Yes, it's no problem, to run a debug version in one of our test environments.
[11 Jul 2018 8:36] Hendrik Woltersdorf
If I use mysqld-debug with the standard options, it produces a huge trace file; much more the disks of the test system can store.

Can you recommend the settings to use, to save space?
I need to run this for at least one week, to see a crash (on average).
[16 Jul 2018 12:01] Hendrik Woltersdorf
error log from SQL node 1 from 2018-07-16

Attachment: fgd-dcv-ha-cl1_20180716.zip (application/x-zip-compressed, text), 2.60 KiB.

[16 Jul 2018 12:04] Hendrik Woltersdorf
Today I see intersting error messages in the error log of the SQL node, running mysqld-debug. (see fgd-dcv-ha-cl1_20180716.zip). The node process did not crash and is still running. 
Sorry, there is no trace file of this time range; because of the huge size I have to recreate it every 15 minutes.
[16 Jul 2018 20:02] Bogdan Kecman
Hi,

rotating logs is the only viable solution, as for the uploaded data I did not understand, the node did not died on this TCRELEASE sent error? it worked ok and continued to work ok?

all best
Bogdan
[17 Jul 2018 5:09] Hendrik Woltersdorf
Yes, the SQL node did not die and is working. At 14:20 I saw reconnect messages in the logs of the other cluster nodes, but that's it.
[17 Jul 2018 6:19] Hendrik Woltersdorf
Of cause the clients, connected to this SQL node, got a few errors (Java/JDBC):

2018-07-16 12:21:09,864 null    062096229196    03018   023617  OUT     ERR     HB_1531736453519        de.xcom.ha.mb.jdbcstatement.RefreshServerStatement    x0        call [select I_OPTION, USED_FOR_GLOBAL_ID, STATE from SERVER_REPOSITORY  where ID_SERVER_REPOSITORY = 103440 ] exception [java.sql.SQLExceptio
n: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER] warnings [Got temporary error 4010 'Node failure caused abort of transa
ction' from NDB;Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER;Got error 4350 'Transaction already aborted' from NDB;Got e
rror 4350 'Transaction already aborted' from NDBCLUSTER;Got error 4350 during ROLLBACK]  - Laufzeit: 0 ms
2018-07-16 12:21:09,897 null    062096229196    03018   023617  NON     ERR     HB_1531736453519        com.mysql.jdbc.SQLError createSQLException      964    ExceptionChecker [java.sql.SQLException SQLState HY000 SQLErrorCode 1297]
[17 Jul 2018 10:14] Mikael Ronström
Running debug versions produces massive amount of log messages, we're working on
decreasing the amount of logs produced in debug builds. But I am a bit surprised that your
crash printout doesn't contain a proper crash trace. I think production builds are usually
still keeping symbols around to be able to handle core files.

So if you could run a production version that is unstripped of symbols it would provide a
proper stack trace of crash. This should help to know how to proceed with the bug analysis.
[17 Jul 2018 10:46] Hendrik Woltersdorf
The production version is not stripped:
fgd-dcv-ha-cl1 root# file /opt/mysql/bin/mysqld
/opt/mysql/bin/mysqld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=8a27250514c097fd60268809f511826602790078, not stripped

But you can see in my two first comments, what (little) output it produced.

Please have a look at bug #90083. I think it's related and contains more stack trace.
[18 Jul 2018 9:56] Hendrik Woltersdorf
log snippet of the crashed sql node v2

Attachment: fgd-dcv-ha-cl1_20180718.zip (application/x-zip-compressed, text), 1.79 KiB.

[18 Jul 2018 9:59] Hendrik Woltersdorf
Tonight we saw another incident like the one from 2018-07-16. I added the log snippet.
I switched back to the production version of mysqld to get maybe a stack trace.
The shutdown prozess of the debug-server stopped at:
2018-07-18 11:50:42 3435 [Note] Shutting down plugin 'ndbcluster'
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stop
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Wakeup
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stopping...
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stopped
2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stop completed
2018-07-18 11:50:42 3435 [Note] NDB Util: Stop
2018-07-18 11:50:42 3435 [Note] NDB Util: Wakeup
2018-07-18 11:50:42 3435 [Note] NDB Util: Stop completed
2018-07-18 11:50:42 [NdbApi] WARNING  -- Waiting for Ndb instances belonging to Ndb_cluster_connection 0x2b3fb40 to be deleted...

It had to be killed "-9".
[2 Sep 2018 0:45] Bogdan Kecman
Sorry but I can not reproduce this nor I can find reason why it is crashing. Nor I see any changes between 7.4.19 and 7.4.20 that could cause this.

bogdan