MySQL Bugs: #41733: Start of replication after start of MySql (InnoDB) often results in crash.

Bug #41733	Start of replication after start of MySql (InnoDB) often results in crash.
Submitted:	24 Dec 2008 13:03	Modified:	23 Mar 2009 15:19
Reporter:	Ben Clewett	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.67	OS:	Linux (SUSE 11.0 + Linux 2.6.25.18)
Assigned to:		CPU Architecture:	Any
Tags:	crash, innodb, replication

Description:
MySql InnoDB engine will crash after resumption of replication after starting MySql about 47% of the time with setup described.

Symptoms
========

Start MySql where replicating at full ability of server.
After either about 1 second, or at about the 10 second mark (where observed), InnoDB will crash without a stack-trace.
Second start always succeeded, where observed.
Partial replication was used in these examples.  However full replication also observed to have same results.
This was never observed on previous installed version (5.0.41) with same setup.  However significant triggers added in same period.
Observed on 12 servers with heavy replication and on version 5.0.67.  These are all our servers running this setup.
No crash ever resulted in a corrupt system, or is failure of replication to start consistently.
After successfully started, no crashing every observed again until next restart.

Versions
========

MySql:     mysql-5.0.67-linux-x86_64-glibc23  (Binary image)
Linux:     Linux hp-tm-12 2.6.25.18-0.2-default #1 SMP 2008-10-21 16:30:26 +0200 x86_64 x86_64 x86_64 GNU/Linux
SUSE:      11.0
Server:    HP ProLient DB380 G5 with RAID 5 & 6 through HP P800 & P400 RAID controllers, 12GB memory.

Sample Log File
===============

081224  8:49:01  InnoDB: Started; log sequence number 21 380453471
081224  8:49:01 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.0.67-log'  socket: '/dbms-07-12/mysql.sock'  port: 3306  MySQL Community Server (GPL)
081224  8:49:01 [Note] Slave SQL thread initialized, starting replication in log 'dbms-07-02.000684' at position 219137025, relay log '/dbms-07-12/logs/mysqld-relay-bin.000178' position: 847044
081224  8:49:01 [Note] Slave I/O thread: connected to master 'slave@dbms-07-02:3306',  replication started in log 'dbms-07-02.000684' at position 219137025
081224  8:49:03 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=268435456
read_buffer_size=20971520
max_used_connections=2
max_connections=500
threads_connected=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 20742144 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0xf00c9c0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x44dfe0a8, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
(nil)
New value of fp=0xf00c9c0 failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0xf05c5bf = BEGIN
thd->thread_id=2
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

This will be followed by:

InnoDB: Log scan progressed past the checkpoint lsn 21 380453471
081224  8:51:09  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 21 380466616
081224  8:51:09  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 219147076, file name dbms-07-02.000684
InnoDB: Last MySQL binlog file position 0 10171, file name ../logs/dbms-07-12.000052
081224  8:51:11  InnoDB: Started; log sequence number 21 380466616
081224  8:51:11 [Note] Recovering after a crash using ../logs/dbms-07-12
081224  8:51:11 [Note] Starting crash recovery...
081224  8:51:11 [Note] Crash recovery finished.
081224  8:51:11 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.0.67-log'  socket: '/dbms-07-12/mysql.sock'  port: 3306  MySQL Community Server (GPL)
081224  8:51:11 [Note] Slave SQL thread initialized, starting replication in log 'dbms-07-02.000684' at position 219147103, relay log '/dbms-07-12/logs/mysqld-relay-bin.000180' position: 10314
081224  8:51:11 [Note] Slave I/O thread: connected to master 'slave@dbms-07-02:3306',  replication started in log 'dbms-07-02.000684' at position 220515873

Defaults File
=============

[mysqld]
port                            = 3306
socket                          = /dbms-07-12/mysql.sock
datadir                         = /dbms-07-12/var
tmpdir                          = /dbms-07-12/tmp
default-storage-engine          = 'InnoDB'
server-id                       = 102
skip-external-locking

# Replication Directives
replicate-wild-do-table         = Web_Members.%
replicate-wild-do-table         = par_4.%
replicate-wild-do-table         = par_5.%
# skip-slave-start
relay-log                       = /dbms-07-12/logs/mysqld-relay-bin

# MyIsam and MySql buffers
key_buffer                      = 256M
max_allowed_packet              = 10M
table_cache                     = 256
sort_buffer_size                = 20M
read_buffer_size                = 20M
read_rnd_buffer_size            = 4M
myisam_sort_buffer_size         = 8M
thread_cache                    = 8
tmp_table_size                  = 128M
thread-cache                    = 8

# Query Cache
query_cache_size                = 256M
query_cache_limit               = 1M
query_cache_type                = 1

# Try number of CPU's*2 for thread_concurrency
thread_concurrency              = 8
max_connections                 = 500

# Logs
log-bin                         = ../logs/dbms-07-12
log-error                       = /dbms-07-12/logs/error.log
log-slow-queries                = /dbms-07-12/logs/log.slow
log-queries-not-using-indexes
long-query-time                 = 1
# log                           = /dbms-09/logs/general.log
log-slave-updates

# MySql function
default-time-zone               = 'UTC'
sql-mode                        = 'TRADITIONAL'
skip-external-locking

# InnoDB
innodb_data_home_dir            = /dbms-07-12/var
innodb_data_file_path           = ibdata1:10M:autoextend
innodb_log_group_home_dir       = /dbms-07-12/var
innodb_log_arch_dir             = /dbms-07-12/var
innodb_buffer_pool_size         = 5300M
innodb_additional_mem_pool_size = 40M
# Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size            = 128M
innodb_log_buffer_size          = 8M
innodb_flush_log_at_trx_commit  = 1
innodb_lock_wait_timeout        = 50
innodb_log_files_in_group       = 2
innodb_status_file
innodb_thread_concurrency       = 8

How to repeat:
Test to Reproduce
=================

1. Kill MySql
2. Restart MySql.
3. Monitor MySql to see if crashes, counting number of restarts required, => 1 = normal start.
4. Crash will be found in about 46% of time.

- Replication at approx 700 DML/sec.
- MySql restarted by LinuxHA between 1 and 60 seconds after kill.
- Size of ibdata1: 58G, ib_logfile[12] = 128M
- Kill at least 120 seconds after previous successfully start, where 'Seconds_Behind_Master' = 0.

(Although tests killed after only a few minutes, crash has been observed for up-time of many days.)

The following 56 tests show the number of MySql restarts required to obtain stable running where does not crash:

Test #1: Using 'kill' and without 'skip-slave-start' directive:

	1,2,2,2,1,2,2,1,1,2,2,1,1,1
	=> 50% failed to start first time.
	=> 0% failed to start second time.

Test #2: Using 'kill -9' and without 'skip-slave-start' directive:

	2,1,2,1,1,1,1,2,1,2,1,1,2,1
	=> 36% fails to start first time.
	=> 0% failed to start second time.

For next tests, 'skip-slave-start' removed from defaults file.  Slave manually started 60s after MySql start.

Test #3: Using 'kill' and with 'skip-slave-start' directive:

	1,2,1,1,1,1,1,2,1,1,2,2,2,1
	=> 36% failed to start first time.
	=> 0% failed to start second time.

Test #4: Using 'kill -9' and with 'skip-slave-start' directive:

	2,1,1,1,1,1,2,2,2,2,2,2,1,2
	=> 60% failed to start first time.
	=> 0% failed to start second time.

Summary:

	=> 46% failed to start first time.
	=> 0% sailed to start second time.

Conclusion
==========

It can be seen that starting MySql with the described set-up results in a crash about 46% of the time, but only after the slave starts replicating.  However MySql never failes after a crash-recovery from the initial error.  So a crash recovery from a 'kill -9' or clean exit, leaves MySql unsafe, where as a crash recovery from this problem is always safe.

Although the probability of this is high, the risk is low since the impact is minimal provided MySql is restarted.  No data loss or inconsistency in replication was ever observed, however the size of the data replication is sufficient that individual row checking was not possible.

This is probably due to a change between version 5.0.41 and 5.0.67, although significant triggers were added to the test database at about the same time.

Hi Ben!

o) did you run mysql_upgrade after doing the upgrade ?
o) do you replicate many temporary tables ?
o) please try get a core file so we can see a stack trace

> Hi Ben!
>
> 1) did you run mysql_upgrade after doing the upgrade ?
> 2) do you replicate many temporary tables ?
> 3) please try get a core file so we can see a stack trace

1. The server was created from an SQL dump file.  Therefore it was never at any other version.  Other servers which have been updrageded from 5.0.n show the same problem.

2. You mean ENGINE=TEMPORARY tables?  By default these are replicated, where about 10,000 are created/destroyed a day.  I do not know of a method of accuratelly recording whis without programming.  If there is a simple option to stop replicating these tables, I would like to know.

3. I would expect the file 'core' in the process CWD?   I cannot find either a 'core' file or anything unusual in the CWD, sorry!

Regards, Ben.

Hi Ben!
To get a core file edit my.cnf and restart the mysqld process:

[mysqld]
core-file

[mysqld_safe]
core-file-size=unlimited

Perhaps some OS-specific configuration needs to be done also.  core file will usually be the datadir of mysql.   When corefile is created, can you get a stack trace:

gdb /path/to/mysqld --core /path/to/core
bt

thanks!

I am very sorry but I still cannot get you a core file.  

The directives you requested were honoured by MySql, although I cannot see them in 'SHOW GLOBAL VARIABLES'.  MySql still crashes after replication start, but no core is found. (find / -name "core*" -ls)

I also modified my start script to use: "ulimit -c unlimited"

I have tested core file generation from other broken applictioins, which works.

I can even see at the bottom of the error.log:

> Trying to get some variables.
> Some pointers may be invalid and cause the dump to abort...
> thd->query at 0x7f3e2000200f  is invalid pointer
> thd->thread_id=2
> The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
> information that should help you find out what is causing the crash.
> Writing a core file

However I can't find a core file.  Please let me know if there is anything else I can try...

Hi Ben!  Since you use SUSE, can you check this page about creating core files:
http://www.novell.com/coolsolutions/feature/16257.html

Appears that mysqld is configured properly already.

Thanks for the SUSE URL.  I worked through this URL, including the test they give to make sure the core file is generated from another application, which works.  Still no luck with MySql.

I found a bug in earlier versions of MySql, where it cannot generate a core becuase of the 'setuid' on startup, but this seems fixed in my version.  

I'll keep trying...

Ben

Ben,

without core file we are not really able to help you. Because you told that you will keep trying I will set this back to need feedback.

Please take your time. There is an automatism, that after one month nothing happens this will be set to "no feedback". Please then feel free to re-open the bug report again.

Susanne,

I believe 5.0.67 may be covered by a bug where I must start MySql as user 'mysql' in order to generate core files.

However I am now completely unable to make MySql crash.  What ever pattern of load/data in my replication tree must have changed.  Which is surprising, as nothing I am aware of has changed.

I will keep monitoring and trying to get this core file.  I will keep trying.

Regards,  Ben.

Ben,

many thanks for your feedback. I will check your hint with core file only when systems belongs to user mysql.

Finally I have got the stack trace:

#0  0x00007fac906f5ce6 in pthread_kill () from /lib64/libpthread.so.0
#1  0x00000000005a6729 in handle_segfault (sig=11) at mysqld.cc:2375
#2  <signal handler called>
#3  0x00000000005e375a in lock_tables (thd=0x369c5f0, tables=0x412cc4a0, count=1, need_reopen=0x36a3950) at table.h:731
#4  0x00000000005e31d7 in simple_open_n_lock_tables (thd=0x369c5f0, tables=0x412cc4a0) at sql_base.cc:3022
#5  0x00000000006c228a in my_tz_find_with_opening_tz_tables (thd=0x369c5f0, name=0x412ccfd0) at tztime.cc:2367
#6  0x000000000062df76 in Query_log_event::exec_event (this=0x36ad190, rli=0x368a300, query_arg=0x36951af "BEGIN",
    q_len_arg=57263600) at log_event.cc:2003
#7  0x00000000006a9b6c in exec_relay_log_event (thd=0x369c5f0, rli=0x368a300) at slave.cc:3420
#8  0x00000000006a7994 in handle_slave_sql (arg=0x368af10) at slave.cc:4030
#9  0x00007fac906f1040 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fac8fdb90cd in clone () from /lib64/libc.so.6

Any more I get I'll pass your way,

With great hope this is useful to you,

Ben

looks like bug #21651 which was closed at "can't repeat"

I have a second core file today, which it much the same as the first server:

#0  0x00007fe1b63c1ce6 in pthread_kill () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fe1b63c1ce6 in pthread_kill () from /lib64/libpthread.so.0
#1  0x00000000005a6729 in handle_segfault (sig=11) at mysqld.cc:2375
#2  <signal handler called>
#3  0x00000000005e375a in lock_tables (thd=0x369c600, tables=0x408d94a0, count=1, need_reopen=0x36a3978) at table.h:731
#4  0x00000000005e31d7 in simple_open_n_lock_tables (thd=0x369c600, tables=0x408d94a0) at sql_base.cc:3022
#5  0x00000000006c228a in my_tz_find_with_opening_tz_tables (thd=0x369c600, name=0x408d9fd0) at tztime.cc:2367
#6  0x000000000062df76 in Query_log_event::exec_event (this=0x36b2a10, rli=0x368a310, query_arg=0x36959bf "BEGIN",
    q_len_arg=57263616) at log_event.cc:2003
#7  0x00000000006a9b6c in exec_relay_log_event (thd=0x369c600, rli=0x368a310) at slave.cc:3420
#8  0x00000000006a7994 in handle_slave_sql (arg=0x368af20) at slave.cc:4030
#9  0x00007fe1b63bd040 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fe1b5a850cd in clone () from /lib64/libc.so.6

Regards,

Ben,

sorry that I have to plead you for additional tests.

We have had some threading problems with 5.0 that got fixed in 5.1.

Would you please try MySQL 5.1 here. The actual version is 5.1.30

No easy: my replication tree contains 22 MySql processes, all on 5.0.67, and upgrade not planed.  Much beta testing of our code needed before any roleout.

But I can replicate from 5.0.67 to 5.1.30 and see whether error is still present.  Best I can do.  Will take a day or two...

Ben,

slave is crashing so it would be enough when slave is 5.1.

Many thanks in advance.

As per request:

The bug appears when we replicate from 5.0.67 to 5.0.67.

However when we replicate from 5.0.67 to 5.1.30, this bug does not appear.  The replication appears completely stable.  Therefore we will upgrade.

Regards,  Ben.

Ben,

then I am sure that your problem is fixed in 5.1 by fixing some bugs related to threading.

I will close this bug because it is fixed in version 5.1.

If you will get problems again by using 5.1 feel free to open this bug report again.

may have been fixed along with bug #9953