Bug #90866 | mysqld segfault at 7f153814da98 ip 0000000000c2206a sp 00007f0d742dd040 error 4 | ||
---|---|---|---|
Submitted: | 15 May 2018 5:50 | Modified: | 2 Sep 2018 0:45 |
Reporter: | Hendrik Woltersdorf | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 5.6.40 ndb-7.4.20 | OS: | CentOS (6.4 and 7.4) |
Assigned to: | MySQL Verification Team | CPU Architecture: | x86 |
[15 May 2018 5:50]
Hendrik Woltersdorf
[15 May 2018 5:51]
Hendrik Woltersdorf
mysql server log of system b)
Attachment: fgd-dcv-ha-cl1_log.zip (application/x-zip-compressed, text), 11.23 KiB.
[23 May 2018 12:04]
Hendrik Woltersdorf
Today's crash on fgd-dcv-ha-cl1 created a few lines in the server log: 03:24:39 UTC - mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=6291456 read_buffer_size=262144 max_used_connections=261 max_threads=2000 thread_count=261 connection_count=260 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1567737 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x3ddf110 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong...
[2 Jul 2018 6:50]
Hendrik Woltersdorf
One observation: Early in June I changed the setting ndb-cluster-connection-pool of the affected SQL nodes from "ndb-cluster-connection-pool=1" to "ndb-cluster-connection-pool=2". Since then, for almost one month, I see no crashes any more.
[11 Jul 2018 0:59]
MySQL Verification Team
Hi, I cannot reproduce this nor I can find enough data in the crash reports. Since you have this "on regular intervals" were you be willing to run a debug version of the mysqld so we might collect more data? best regards Bogdan
[11 Jul 2018 5:20]
Hendrik Woltersdorf
Yes, it's no problem, to run a debug version in one of our test environments.
[11 Jul 2018 8:36]
Hendrik Woltersdorf
If I use mysqld-debug with the standard options, it produces a huge trace file; much more the disks of the test system can store. Can you recommend the settings to use, to save space? I need to run this for at least one week, to see a crash (on average).
[16 Jul 2018 12:01]
Hendrik Woltersdorf
error log from SQL node 1 from 2018-07-16
Attachment: fgd-dcv-ha-cl1_20180716.zip (application/x-zip-compressed, text), 2.60 KiB.
[16 Jul 2018 12:04]
Hendrik Woltersdorf
Today I see intersting error messages in the error log of the SQL node, running mysqld-debug. (see fgd-dcv-ha-cl1_20180716.zip). The node process did not crash and is still running. Sorry, there is no trace file of this time range; because of the huge size I have to recreate it every 15 minutes.
[16 Jul 2018 20:02]
MySQL Verification Team
Hi, rotating logs is the only viable solution, as for the uploaded data I did not understand, the node did not died on this TCRELEASE sent error? it worked ok and continued to work ok? all best Bogdan
[17 Jul 2018 5:09]
Hendrik Woltersdorf
Yes, the SQL node did not die and is working. At 14:20 I saw reconnect messages in the logs of the other cluster nodes, but that's it.
[17 Jul 2018 6:19]
Hendrik Woltersdorf
Of cause the clients, connected to this SQL node, got a few errors (Java/JDBC): 2018-07-16 12:21:09,864 null 062096229196 03018 023617 OUT ERR HB_1531736453519 de.xcom.ha.mb.jdbcstatement.RefreshServerStatement x0 call [select I_OPTION, USED_FOR_GLOBAL_ID, STATE from SERVER_REPOSITORY where ID_SERVER_REPOSITORY = 103440 ] exception [java.sql.SQLExceptio n: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER] warnings [Got temporary error 4010 'Node failure caused abort of transa ction' from NDB;Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER;Got error 4350 'Transaction already aborted' from NDB;Got e rror 4350 'Transaction already aborted' from NDBCLUSTER;Got error 4350 during ROLLBACK] - Laufzeit: 0 ms 2018-07-16 12:21:09,897 null 062096229196 03018 023617 NON ERR HB_1531736453519 com.mysql.jdbc.SQLError createSQLException 964 ExceptionChecker [java.sql.SQLException SQLState HY000 SQLErrorCode 1297]
[17 Jul 2018 10:14]
Mikael Ronström
Running debug versions produces massive amount of log messages, we're working on decreasing the amount of logs produced in debug builds. But I am a bit surprised that your crash printout doesn't contain a proper crash trace. I think production builds are usually still keeping symbols around to be able to handle core files. So if you could run a production version that is unstripped of symbols it would provide a proper stack trace of crash. This should help to know how to proceed with the bug analysis.
[17 Jul 2018 10:46]
Hendrik Woltersdorf
The production version is not stripped: fgd-dcv-ha-cl1 root# file /opt/mysql/bin/mysqld /opt/mysql/bin/mysqld: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=8a27250514c097fd60268809f511826602790078, not stripped But you can see in my two first comments, what (little) output it produced. Please have a look at bug #90083. I think it's related and contains more stack trace.
[18 Jul 2018 9:56]
Hendrik Woltersdorf
log snippet of the crashed sql node v2
Attachment: fgd-dcv-ha-cl1_20180718.zip (application/x-zip-compressed, text), 1.79 KiB.
[18 Jul 2018 9:59]
Hendrik Woltersdorf
Tonight we saw another incident like the one from 2018-07-16. I added the log snippet. I switched back to the production version of mysqld to get maybe a stack trace. The shutdown prozess of the debug-server stopped at: 2018-07-18 11:50:42 3435 [Note] Shutting down plugin 'ndbcluster' 2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stop 2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Wakeup 2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stopping... 2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stopped 2018-07-18 11:50:42 3435 [Note] NDB Index Stat: Stop completed 2018-07-18 11:50:42 3435 [Note] NDB Util: Stop 2018-07-18 11:50:42 3435 [Note] NDB Util: Wakeup 2018-07-18 11:50:42 3435 [Note] NDB Util: Stop completed 2018-07-18 11:50:42 [NdbApi] WARNING -- Waiting for Ndb instances belonging to Ndb_cluster_connection 0x2b3fb40 to be deleted... It had to be killed "-9".
[2 Sep 2018 0:45]
MySQL Verification Team
Sorry but I can not reproduce this nor I can find reason why it is crashing. Nor I see any changes between 7.4.19 and 7.4.20 that could cause this. bogdan