Bug #47134 Crash on startup when XA support functions activated by a second engine
Submitted: 4 Sep 2009 12:32 Modified: 4 Mar 2013 18:52
Reporter: Paul McCullagh (Basic Quality Contributor) (OCA) Email Updates:
Status: Closed Impact on me:
Category:MySQL Server: Storage Engine API Severity:S1 (Critical)
Version:5.1.35+ OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: crash, pbxt, qc, startup, tc.log, xa

[4 Sep 2009 12:32] Paul McCullagh
When the XA support functions are enabled by a second engine (i.e. an engine in addition to InnoDB) MySQL crashes on startup:

#0	0x001d7607 in TC_LOG_MMAP::recover at log.cc:5282
#1	0x001d7d0e in TC_LOG_MMAP::open at log.cc:4965
#2	0x000fb658 in init_server_components at mysqld.cc:4020
#3	0x00100178 in main at mysqld.cc:4354

line:5282 if (*x && my_hash_insert(&xids, (uchar *)x))

Before this crash, the following assertion fails:

Line 1579, handler.cc:

DBUG_ASSERT(total_ha_2pc == (ulong) opt_bin_log+1); // only InnoDB and binlog

After disabling the assertion, the server crashes in startup as described above.

After removing the "tc.log" file MySQL will startup again without a crash.

How to repeat:
1. Install and build MySQL 5.1.35 or later.

$ cd /home/build/mysql-5.1.35
$ ./configure --prefix=/home/build/test --with-debug=full --with-plugins=max-no-ndb
$ make install

2. Checkout the following PBXT branch:

$ cd ~/repositories/pbxt
$ bzr branch lp:~paul-mccullagh/pbxt/xa-crash

3. Build PBXT for the MySQL version

$ cd ~/repositories/pbxt/xa-crash
$ ./configure --with-mysql=/home/build/mysql-5.1.35
$ make install

4. Run MySQL

At this point you should hit the failed assertion.

After commenting this out, and rebuilding the server should crash on startup as described.

Suggested fix:
I have not investigated the problem enough to be able to tell if this bug means that XA support for engines other than InnoDB has not been implemented, or if it simply has not been tested.

Some feedback in this regard would be welcome.
[4 Sep 2009 13:33] Paul McCullagh
I have found the immediate source of the crash. In the function TC_LOG_MMAP::open() we have:

Line: 4957, file: log.cc:

    pg->end=(my_xid *)(pg->start + tc_log_page_size);

The type of pg->start is (ulonglong *). The code is assuming the type is (char *). So it is adding 8*tc_log_page_size to pg->start instead of 1*tc_log_page_size.

This makes me doubt the reliability of this code in general.
[4 Sep 2009 18:22] MySQL Verification Team
miguel@luz:~/dbs$ 5.1/libexec/mysqld
090904 15:20:27 [Note] Plugin 'FEDERATED' is disabled.
090904 15:20:27  InnoDB: Started; log sequence number 0 44233
090904 15:20:27 [Note] PrimeBase XT (PBXT) Engine 1.0.08d RC loaded...
090904 15:20:27 [Note] Paul McCullagh, PrimeBase Technologies GmbH, http://www.primebase.org
mysqld: handler.cc:1586: int ha_recover(HASH*): Assertion `total_ha_2pc == (ulong) opt_bin_log+1' failed.
090904 15:20:27 - mysqld got signal 6 ;
[4 Sep 2009 18:39] MySQL Verification Team
Thank you for the bug report. Verified against the last tree source 5.1.
[7 Oct 2009 2:14] Elena Stepanova
I am also getting a crash on 6.0.9; it disappears if i switch off pbxt_support_xa, so I assume it is a related problem but it happens not on startup but after some queries, but has somewhat different stack trace, so I thought it might be worth mentioning here as well.

Stack trace:

#0  0x00002b492993eea3 in pthread_kill () from /lib64/libpthread.so.0
#1  0x000000000065ceec in handle_segfault (sig=11) at mysqld.cc:2671
#2  <signal handler called>
#3  0x00002b492993e51a in pthread_cond_signal@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#4  0x000000000070679c in TC_LOG_MMAP::sync (this=0xf11aa0) at log.cc:7062
#5  0x000000000070d080 in TC_LOG_MMAP::log_xid (this=0xf11aa0,
    thd=<value optimized out>, xid=<value optimized out>) at log.cc:7031
#6  0x000000000075d911 in ha_commit_trans (thd=0x1fa03e0,
    all=<value optimized out>) at handler.cc:1172
#7  0x00000000007f7f5c in trans_commit (thd=0x1fa03e0) at transaction.cc:132
#8  0x000000000066e000 in mysql_execute_command (thd=0x1fa03e0)
    at sql_parse.cc:4026
#9  0x0000000000671d07 in mysql_parse (thd=0x1fa03e0,
    inBuf=0x20201f8 "commit", length=6, found_semicolon=0x4c89a100)
    at sql_parse.cc:5735
#10 0x0000000000672606 in dispatch_command (command=COM_QUERY, thd=0x1fa03e0,
    packet=0x2010bd1 "commit", packet_length=<value optimized out>)
    at sql_parse.cc:1007
#11 0x00000000006730d2 in do_command (thd=0x1fa03e0) at sql_parse.cc:690
#12 0x0000000000666284 in handle_one_connection (arg=0x1fa03e0)
    at sql_connect.cc:1145
#13 0x00002b492993a143 in start_thread () from /lib64/libpthread.so.0
#14 0x00002b492a0738cd in clone () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()

# Test case:
# (MySQL server 6.0.9, PBXT 1.0.09)

drop database if exists pbxt_innodb;
create database pbxt_innodb;
use pbxt_innodb;
create table t_pbxt ( i int not null auto_increment, primary key (i), f varchar (128) ) engine = pbxt;
create table t_inno ( i int not null auto_increment, primary key (i), f varchar (128) ) engine = innodb;
start transaction;
insert into t_pbxt (f) values ( 'abc'), ('def' );
insert into t_inno (f) values ( 'abc'), ('def' );
[17 May 2010 16:25] Mark Callaghan
I have no problem reproducing this:
1) start mysqld with InnoDB and PBXT enabled
2) kill -9 mysqld

On recovery the server crashes again and again until I remove var/tc.log. In many cases it is probably a bad idea to remove that file.

100517 09:18:26 [Note] The server was not shutdown correctly, recovery required
100517  9:18:26 [Note] Recovering after a crash using tc.log
100517  9:18:26 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 6409238 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x40000
/data/51s/libexec/mysqld(my_print_stacktrace+0x32) [0x9ad62a]
/data/51s/libexec/mysqld(handle_segfault+0x28a) [0x5e9db8]
/lib64/libpthread.so.0 [0x328f80de70]
/data/51s/libexec/mysqld(TC_LOG_MMAP::recover()+0x180) [0x6a7030]
/data/51s/libexec/mysqld(TC_LOG_MMAP::open(char const*)+0x4a6) [0x6a7592]
/data/51s/libexec/mysqld [0x5ebe08]
/data/51s/libexec/mysqld(main+0x23d) [0x5ec62d]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x328ec1d8b4]
/data/51s/libexec/mysqld [0x510089]
[26 Jun 2011 17:38] Valeriy Kravchuk
See bug #32926 also (looks like the same problem but reported earlier for 6.0.x).
[21 Aug 2012 17:35] Mark Callaghan
Rich Prohaska from Tokutek posted this:

TokuDB is a MySQL storage engine that implements XA.  Unfortunately,
MySQL 5.6.6-m9 (and prior versions) do not support multiple XA storage
engines, as described in MySQL bug #32926.  This bug report described
how to crash mysqld with the falcon and innodb storage engines.
Another way to reproduce this problem is to add a fake prepare method
to the example storage engine, install the example plugin, and run
MySQL with the binlog disabled.

It would be nice (for us) if this problem was fixed sometime prior to
the 5.6.6 GA.

We made some changes to the 5.6.6-m9 code base to support TokuDB as
another XA storage engine.

diff -r mysql-5.6.6-m9/sql/handler.cc mysql-5.6.6-m9-patched/sql/handler.cc
<   DBUG_ASSERT(total_ha_2pc == (ulong) opt_bin_log+1); // only InnoDB
and binlog
>   // DBUG_ASSERT(total_ha_2pc == (ulong) opt_bin_log+1); // only InnoDB and binlog
diff -r mysql-5.6.6-m9/sql/log.cc mysql-5.6.6-m9-patched/sql/log.cc
<     pg->end=(my_xid *)(pg->start + tc_log_page_size);
>     pg->end=(my_xid *)(data + (i+1)*tc_log_page_size);
>     pg->ptr=pg->start;
<   mysql_cond_signal(&active->cond);        // wake up a new syncer
>   if (active)
>       mysql_cond_signal(&active->cond);        // wake up a new syncer
diff -r mysql-5.6.6-m9/storage/example/ha_example.cc
> static int example_prepare(handlerton *hton, THD *thd, bool all)
> {
>   DBUG_ENTER("example_prepare");
> }
>   example_hton->prepare= example_prepare;
[29 Dec 2012 21:26] Erlend Dahl
Marked Bug#32926 "lack of suppport for multiple two-phase-commit storage engines" as a duplicate, since the work is being carried out in the context of Bug#47134
[4 Mar 2013 18:52] Paul DuBois
Noted in 5.7.1 changelog.

If XA support was activated by multiple storage engines, the server would exit.
[6 Nov 2013 12:22] Laurynas Biveinis
mysql-server$ bzr log -r 5201.2.1
revno: 5201.2.1
committer: Dmitry Shulga <Dmitry.Shulga@oracle.com>
branch nick: mysql-trunk-bug11755370
timestamp: Fri 2013-03-01 19:02:41 +0600
  This patch fixes bug#11755370 - CRASH ON STARTUP WHEN XA SUPPORT FUNCTIONS
                                  ACTIVATED BY A SECOND ENGINE.
    The reason for server crash was that there was a bug in pointer arithmetics
  when server calculated begin/end values of the memory pages mmaped to the file.
  Beside this bug reason there were also a bug related to multi-threaded
  programming. There were a race condition when several threads concurrently
  calls TC_LOG_MMAP::commit. Method commit() calls methods log()/unlog() that
  writes xid values to the mmaped file. This file is divided by pages and those
  pages is used as IO units during file synchronization. Access to those pages
  is controlled by a few mutexes. There was a bug in working with those mutexes.
  To fix the race condition bug the work with mutexes was simplified.
  Access to shared data now protected by one dedicated mutex. Formerly to protect
  access to shared data it were used several mutexes. Such approach added
  finer grained concurrency effect but introduced possible race condition.
  After analysis was done we decided to rewrite a logic regarding to
  synchronization of access to shared data in the way that increases robustness
  and decreases fine graining locking. Also there was a bug in code manipulating
  with list of pages in the pool.
  MTR test isn't added since there isn't currently any other engine with
  XA feature enabled aside from INNODB. I tested my patch using the fake second
  engine with XA feature enabled that I created from ha_example.cc. I think
  it won't be correct to modify ha_example.cc just for the sake of patch test
  coverage. Nevertheless I added unittest that runs 10 threads and calls
  TC_LOG_MMAP::commit for distributed transactions. Such test should crash
  the server without changes included in this patch. Also another two new unit
  tests were added to test TC_LOG_MMAP behavior in situations when its
  pages are close to be fully filled are fully filled.

mysql-server$ bzr log -r 5208
revno: 5208
committer: Dmitry Shulga <Dmitry.Shulga@oracle.com>
branch nick: mysql-trunk-new
timestamp: Mon 2013-03-04 22:56:49 +0600
  This is a follow-up for patch for bug#11755370 - CRASH ON STARTUP WHEN XA
  This patch fixes the compiler warnings.

mysql-server$ bzr log -r 5283
revno: 5283
committer: Dmitry Shulga <Dmitry.Shulga@oracle.com>
branch nick: mysql-trunk
timestamp: Wed 2013-03-13 22:30:51 +0600
  This is a follow-up patch for bug#11755370. It fixes compilation warnings
  reported by Sun Studio compiler.

mysql-server$ bzr log -r 5381
revno: 5381
committer: Tor Didriksen <tor.didriksen@oracle.com>
branch nick: trunk-merge
timestamp: Fri 2013-03-22 14:07:07 +0100
  Post push fix: disable msync/fsync in unit test.
  (reduces execution time from seconds to milliseconds)

mysql-server$ bzr log -r 5406
revno: 5406
committer: Tor Didriksen <tor.didriksen@oracle.com>
branch nick: trunk-review
timestamp: Tue 2013-03-26 12:43:43 +0100
  Post-push fix: remove compile warnings for sun studio.
[13 Aug 2015 7:10] Zhenye Xie
This bug still exists in mysql 5.6 branch. But already fixed in 5.7 branch.