MySQL Bugs: #73392: Server crashes (segfaul) at shutdown or slave start

Bug #73392	Server crashes (segfaul) at shutdown or slave start - NO MERGE TABLES
Submitted:	25 Jul 2014 13:13	Modified:	5 Sep 2014 19:12
Reporter:	Mike Diehn	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	5.6.19	OS:	Linux (CentOS 6.5 with mysql-community repos)
Assigned to:		CPU Architecture:	Any

Description:
I've dumped a 100 GB database from a 5.1.73 server and imported it into a 5.6.19 server.  I'll attach the my.cnf from the new server and a mysqld.log.  I'm using InnoDB in file-per-table mode.

The problem is that when I try to stop mysqld with "service mysqld stop" or "mysqladmin shutdown", it segfaults and then mysqld_safe restarts it.  I can get it down by kill -KILLing the mysqld_safe process and then the mysqld process.

The biggest problem, though, is that from then on, it also segfaults when I try to "START SLAVE."

Bug reports I checked:
  69574: I do NOT have any merge tables anywhere in the database
  71546: does not appear to be like the tmp table bug.
  70194: my stack traces are nothing like these and the behavior is different.

Other info:

  This is a slave in a replication chain:
    first master  is 5.1.69, for redhat-linux-gnu (x86_64)
    master/slave  is 5.1.73, for redhat-linux-gnu (x86_64)
    last slave    is 5.6.19, for Linux (x86_64)

  When I first restore the mysqldump into the new server, I can start
  and stop the server and I can even start replication and it will
  catch up.  At some point, and I think it's anytime after I've restored
  the database, I try to stop the server and it crashes.  From then on
  there's no starting the replication anymore.

As soon as I move the data directory "out of the way" and let mysql create a fresh one, we go back to normal operations.

I tried upgrading to mysql 5.7 and the problem got worse, if anything.  Or at least the log files were noisier.

I'm about to try version 5.5.

Here are the log entries and stack trace from a crash at start slave:

2014-07-24 18:33:04 3877 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.19'  socket: '(null)'  port: 3306  MySQL Community Server (GPL)
22:33:34 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=1
max_threads=500
thread_count=2
connection_count=1
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 207043 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f89c0000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
2014-07-24 18:33:34 3877 [Warning] Slave SQL: If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0
2014-07-24 18:33:34 3877 [Note] Slave SQL thread initialized, starting replication in log 'test-db02-mysqld-bin.001503' at position 42834278, relay log './tdx-db03-relay-bin.000057' position: 4
stack_bottom = 7f89f7e1ed80 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8d6885]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x663234]
/lib64/libpthread.so.0(+0xf710)[0x7f8b7b5d4710]
/lib64/libc.so.6(+0x13385f)[0x7f8b7a38085f]
/usr/sbin/mysqld(my_strdup+0x1e)[0x8d21ce]
/usr/sbin/mysqld(mysql_options+0x513)[0x77daa3]
/usr/sbin/mysqld[0x8a40e2]
/usr/sbin/mysqld(handle_slave_io+0x812)[0x8ae242]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xafc8ea]
/lib64/libpthread.so.0(+0x79d1)[0x7f8b7b5cc9d1]
/lib64/libc.so.6(clone+0x6d)[0x7f8b7a335b5d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 2
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
140724 18:33:34 mysqld_safe Number of processes running now: 0
140724 18:33:34 mysqld_safe mysqld restarted

From then on, any attempts to stop the mysqld end with:
  01:22:17 UTC - mysqld got signal 11 ;
  [boilerplate for segfaults]
  terribly wrong...

and nothing else.

How to repeat:
Build fresh CentOS 6.5 Minimal VM.
Add mysql-community repo
Install mysql-community-server 5.6.19 and friends.
service start mysqld
create a database named 'cdna'
then "mysql cdna < cdna-mysqldump-opt.out"
once it's down, which takes about five hours, configure replication
change master to master_host='tdx-db02', master_log_file='tdx-db02-mysqld-bin.001499', master_log_pos='4', master_log_user='username', master_log_user='password'.
start slave;
show slave status;
observe replication running and that after a few hours it's all caught up.
service mysqld stop
ps faux | grep mysqld
observe msyqld_safe and mysqld both running. mysqld_save still has many hours old start time but mysqld has moments old start time.
observe /var/log/mysqld has "something went terribly wrong" crash message as the last entry before mysqld_safe messages about starting another mysqld process.
try start slave;
note lost connection and check the logs.
discover that it's crashed again, this time leaving a small stack trace.
repeat ad nauseam.

Suggested fix:
I have no clue.

my.cnf for mysqld 5.6.19 that crashed

Attachment: my.cnf.2014072403 (application/octet-stream, text), 987 bytes.

log file with crashes and stack traces

Attachment: mysqld.log.borked.4 (application/octet-stream, text), 42.60 KiB.

Fixed typo in synopsis

Thank you for the report.

Does your backup include mysql database? If yes, have you run mysql_upgrade?

I've done this four times now, I think.

The first time, I'd actually shutdown the master/slave (in the middle) and used rsync to copy it's data directory to the new machine.  Then I ran mysql_upgrade.  Then started repl.  Then it started crashing.

Thinking I'd done something stupid, I tried again with the same result.

So next I made a dump with --all-databases, restored that into a pristine data directory. The dump did have the mysql database in it.  And yes, I did run mysql_upgrade on it.  Same crashes.

Then I tried restoring a dump of just the 'cdna' database into yet another 5.6.19 pristine installation.  After the restoration, the replication started, caught up and then the shutdown command crashed the server. Since then, the start slave command will crash the server.

That was the most recent attempt and it's where I'm stuck at the moment.

Hey - I just tried this with MariaDB 10 and had the same result.  As I recall, MariaDB 10 is based on MySQL 5.5 and has backports of some 5.6 features.

Just an FYI.

Thank you for the feedback.

Try to identify if this is single table which crashes the server: try single-table backup. Interesting that in the error log you provided crash happened, most likely, when server tried to set a variable. Please also send us error log file from the server where you restored database cdna only: I want to check if backtrace is same.

The log I attached *is* for the cdna only restoration.  I have 73 tables in this database.  Were you thinking I'd try each one alone? 

This is the most frequent and consistent stack trace I get.  It happens when I attempt "START SLAVE" on the 5.6.19 slave server.  You can see the call to handle_slave_io in there and so I think it's got something to do with that.

/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8d6885]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x663234]
/lib64/libpthread.so.0(+0xf710)[0x7f8b7b5d4710]
/lib64/libc.so.6(+0x13385f)[0x7f8b7a38085f]
/usr/sbin/mysqld(my_strdup+0x1e)[0x8d21ce]
/usr/sbin/mysqld(mysql_options+0x513)[0x77daa3]
/usr/sbin/mysqld[0x8a40e2]
/usr/sbin/mysqld(handle_slave_io+0x812)[0x8ae242]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xafc8ea]
/lib64/libpthread.so.0(+0x79d1)[0x7f8b7b5cc9d1]
/lib64/libc.so.6(clone+0x6d)[0x7f8b7a335b5d]

Thank you for the feedback.

Well, I used to debug this way even with more than 73 tables.

Anyway it is still strange that your dump affects START SLAVE command and backtrace complains about options, not about table corruption or some other affect on your database.

Please send us output of SHOW TABLES FROM cdna and try to comment out option "relay_log_info_repository = TABLE" in your configuration file (or change it to relay_log_info_repository = FILE) and inform us if this solves the problem.

Hi Sveta,  I did that - commented out the "crash safe slave" options - and it made no difference either.  Here's the output of SHOW TABLES:

mysql(root@localhost:6) cdna> show tables;
+-------------------------------------+
| Tables_in_cdna                      |
+-------------------------------------+
| AnnotationRuleValue                 |
| AnnotationSet                       |
| AnnotationTableName                 |
| Architecture                        |
| Attribute                           |
| AutoGUIConfig                       |
| AutoGuiMode                         |
| BackwardMode                        |
| Boolean                             |
| Compiler                            |
| Component                           |
| Config                              |
| DATABASECHANGELOG                   |
| DATABASECHANGELOGLOCK               |
| DependentMethod                     |
| DeveloperSpecialties                |
| Employee                            |
| ExpectedTestArchitectureNp          |
| ExpectedTestManagementGroup         |
| ExpectedTestTypeMode                |
| Group                               |
| GuiMode                             |
| IterationMaximumMemoryMetrics       |
| JiraUser                            |
| Machine                             |
| Message                             |
| Module                              |
| ModuleError                         |
| ModuleResult                        |
| Office                              |
| Owner                               |
| PerformanceData                     |
| PreConfigModule                     |
| PreConfigSuite                      |
| Product                             |
| ProductVersion                      |
| ProductVersionToTest                |
| Project                             |
| ResolveEmployeeSpecialty            |
| Server                              |
| Setting                             |
| SolverIterationData                 |
| SolverMetrics                       |
| SolverTimeStepData                  |
| Specialty                           |
| Suite                               |
| SuiteAnnotationRule                 |
| TCF_MISSING                         |
| Test                                |
| TestCase                            |
| TestCaseFile                        |
| TestFramework                       |
| TestManagementGroup                 |
| TestMode                            |
| TestStatus                          |
| TestSuiteResult                     |
| TestSuiteResultFile                 |
| TestType                            |
| UnitMode                            |
| ViewConfig                          |
| ViewModuleResult                    |
| ViewPreConfigModule                 |
| ViewPreConfigSuite                  |
| ViewProductVersionSuiteAnnotation   |
| ViewProductVersionSuiteAttribute    |
| ViewProductVersionSuiteBackwardMode |
| ViewProductVersionSuiteExcludeOSes  |
| ViewProductVersionSuiteGroup        |
| ViewProductVersionSuiteOwner        |
| ViewProductVersionSuiteSandbox      |
| ViewProductVersionSuiteServer       |
| ViewProductVersionSuiteTestType     |
| ViewProductVersionSuiteUnitMode     |
| ViewTestCase                        |
| ViewTestSuiteResult                 |
| WhoSpecializesIn                    |
| membership_grouppermissions         |
| membership_groups                   |
| membership_userrecords              |
| membership_users                    |
| tempDuplicateModuleErrors           |
+-------------------------------------+
81 rows in set (0.00 sec)

Thank you for the feedback.

I did few tests and could not repeat described behavior. I assume your master server is 5.1? Please send us its option file.

Hi Sveta, the master is:

[root@tdx-db02 ~]# /usr/libexec/mysqld --verbose --help | more
/usr/libexec/mysqld  Ver 5.1.73-log for redhat-linux-gnu on x86_64 (Source distribution)

I'll attach the my.cnf in a moment

my.cnf for master mysqld at version 5.1.73

Attachment: my.cnf.tdx-db02.txt (text/plain), 4.12 KiB.

Thank you for the feedback.

I still cannot repeat described behavior. Please check your machine's memory for errors and if it is fine, enable core file and send us core file after crash together with mysqld binary.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".