MySQL Bugs: #33184: Falcon regression tests fail on Mac OS X

Bug #33184	Falcon regression tests fail on Mac OS X
Submitted:	12 Dec 2007 19:49	Modified:	22 Feb 2008 20:53
Reporter:	Giuseppe Maxia	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0.5	OS:	MacOS (Intel)
Assigned to:		CPU Architecture:	Any
Tags:	F_TEST, falcon, regression, suite

Description:
ChangeSet@1.2748, 2007-12-07 15:11:49-07:00

Tests failing on Mac OS X 10.4

falcon.falcon_bug_22167 
falcon.falcon_bug_26802 
falcon.falcon_bug_28046 
falcon.falcon_bug_28049 
falcon.falcon_bug_28158 
falcon.falcon_bug_28165 
falcon.falcon_bugs

Tests failing on Mac OS X 10.5
falcon.falcon_bug_22167 
falcon.falcon_bug_26802 
falcon.falcon_bug_28048 
falcon.falcon_bug_28049 
falcon.falcon_bug_28158 
falcon.falcon_bug_28500 
falcon.falcon_bugs

How to repeat:
Run the test suite
./mtr --suite=falcon --force

Verified failures with latest mysql-6.0 from bk. Change set is 1.2754, 2007-12-11 17:44:21+01:00, df@pippilotta.erinye.com +1 -0. Failing tests are:

  falcon_bug_22167
  falcon_bug_26802
  falcon_bug_28046
  falcon_bug_28049
  falcon_bug_28158
  falcon_bug_28165
  falcon_bugs

I used BUILD/compile-mac for building, which looks like:
#! /bin/sh

path=`dirname $0`
. "$path/SETUP.sh"

extra_flags=" $fast_cflags $max_cflags -g"
extra_configs=" $max_configs"

. "$path/FINISH.sh"

# based on BUILD/compile-pentium-max

Running the Falcon test suite with latest mysql-6.0-falcon-team tree shows that following tests are failing with BUILD/compile-pentium-max build script:

  falcon_bug_22167
  falcon_bug_26802
  falcon_bug_28046
  falcon_bug_28049
  falcon_bug_28158
  falcon_bug_28165
  falcon_bugs

Most tests are failing when trying to drop a table:
DROP TABLE t1' failed: 2013: Lost connection to MySQL server during query

Running the Falcon test suite with a debug build shows no tests failing:

falcon.rowid_order_falcon      [ pass ]            177
-------------------------------------------------------
Stopping All Servers
All 145 tests were successful.
The servers were restarted 5 times
Spent 810.606 of 851 seconds executing testcases

Tested again on MySQL 6.0.5 (myql-6.0 tree, ChangeSet@1.2497.7.21)

Still failing tests on a server compiled without debugging options:

falcon.falcon_bug_22167 
falcon.falcon_bug_26802 
falcon.falcon_bug_28046 
falcon.falcon_bug_28049 
falcon.falcon_bug_28158 
falcon.falcon_bug_28165 
falcon.falcon_bugs

Verified with latest mysql-6.0-falcon-team tree on Mac OS X/Intel. I build with
   BUILD/compile-pentium-max

and ran the test suite like
    ./mysql-test-run.pl --force --suite=falcon 

Failing tests are:

falcon.falcon_bug_22167
falcon.falcon_bug_26802
falcon.falcon_bug_28046
falcon.falcon_bug_28049
falcon.falcon_bug_28158
falcon.falcon_bug_28165
falcon.falcon_bugs

I see crashes around DROP TABLE.

The assertion we get is:

(gdb) f 4
#4  0x002a1df1 in SyncObject::~SyncObject (this=0x70408bc) at SyncObject.cpp:125
125             ASSERT(lockState == 0);
(gdb) p lockState
$1 = -1341785548

(gdb) bt
#0  0x9003d66c in kill ()
#1  0x9010e8cf in raise ()
#2  0x002e6eb1 in Error::error (string=0x62b358 "assertion failed at line %d in 
file %s\n") at Error.cpp:92
#3  0x002e6f38 in Error::assertionFailed (fileName=0x623d64 "SyncObject.cpp", li
ne=125) at Error.cpp:76
#4  0x002a1df1 in SyncObject::~SyncObject (this=0x70408bc) at SyncObject.cpp:125
#5  0x002a7021 in Table::~Table (this=0x7040760) at Table.cpp:150
#6  0x002d47c8 in Database::dropTable (this=0x6505648, table=0x7040760, transact
ion=0x0) at Database.cpp:1420
#7  0x0035c7ed in Nfs::Statement::executeDDL (this=0x700a740) at Statement.cpp:1
135
#8  0x0035cd68 in Nfs::Statement::execute (this=0x700a740, sqlString=0xb005fc80 
"drop table TEST.\"T1\"", isQuery=false) at Statement.cpp:2696
#9  0x0035cdfe in Nfs::Statement::execute (this=0x700a740, sqlString=0xb005fc80 
"drop table TEST.\"T1\"") at Statement.cpp:161
#10 0x0029736c in StorageDatabase::deleteTable (this=0x7008108, storageConnectio
n=0x703f460, tableShare=0x703ac60) at StorageDatabase.cpp:488
#11 0x0029cf95 in StorageTableShare::deleteTable (this=0x703ac60, storageConnect
ion=0x703f460) at StorageTableShare.cpp:138
#12 0x0029bd55 in StorageTable::deleteTable (this=0x6666978) at StorageTable.cpp
:81
#13 0x0028b16a in StorageInterface::delete_table (this=0x7871298, tableName=0xb0
060c60 "./test/t1") at ha_falcon.cpp:949
#14 0x001b064d in ha_delete_table (thd=0x781c400, table_type=0x6402380, path=0xb
0060c60 "./test/t1", db=0x7871290 "test", alias=0x7871078 "t1", generate_warning
=true) at handler.cc:1484
#15 0x001c32e0 in mysql_rm_table_part2 (thd=0x781c400, tables=0x78710a0, if_exis
ts=false, drop_temporary=false, drop_view=false, dont_log_query=false) at sql_ta
ble.cc:1662
#16 0x001c36d9 in mysql_rm_table (thd=0x781c400, tables=0x78710a0, if_exists=0 '
\0', drop_temporary=0 '\0') at sql_table.cc:1457
#17 0x000c593f in mysql_execute_command (thd=0x781c400) at sql_parse.cc:2876
#18 0x000cc49a in mysql_parse (thd=0x781c400, inBuf=0x7871010 "DROP TABLE t1", l
ength=13, found_semicolon=0xb0061e9c) at sql_parse.cc:5410
#19 0x000ccb20 in dispatch_command (command=COM_QUERY, thd=0x781c400, packet=0x6
1f7001 "DROP TABLE t1", packet_length=13) at sql_parse.cc:921
#20 0x000cde6e in do_command (thd=0x781c400) at sql_parse.cc:697
#21 0x000bc28b in handle_one_connection (arg=0x781c400) at sql_connect.cc:1146
#22 0x90024227 in _pthread_body ()

Bug#33184 occurs regularly on the MAC OS with non-debug code (with the current build).  So it is probably timing dependent.  During a DropTable, Table::~Table is called with a shared lock still active on Table::syncUpdate.  This is the Sync that is sent into Table::checkUniqueIndexes() with the expectation that it would be unlocked before returning.  I think this is a risky approach because the syncObject is not released in the same routine that it was locked.

I was looking for a pathway that could lead to one of these sync objects being left unlocked (unsuccessfully) when I realized that none of them are ever locked exclusively.  What good is a syncObject that is only locked shared?  I think we should just delete it. 

This SyncObject does not do anything now, so if it is deleted, this bug should go away.  But it was originally added to ensure that the search for a duplicate value is done atomically.  This was needed to avoid the theoretical problem where two transactions insert the same unique index value at the same time. Currently, updates and inserts operate in this order;
1)check for duplicates.
2)insert indexes
3)insert record

If two concurrent transactions with the same key value do step 1 before each has reached step 2, then they will not see each other's duplicate key value.  The result would be duplicate values even in a unique index.  The syncUpdate lock was originally designed to prevent this, by holding the lock until steps 2 and 3 were done.  But when Jim implemented it, the exclusive lock caused a major performance degradation.  So it was backed up to a shared lock.  

This may be the source of the duplicate primary keys we have seen periodically. (see Bug#33498).  Or there may be some other synchronization mechanism preventing it under most circumstances.

Here is a possible solution suggested by Jim from an email on Tue 10/2/2007;
>I don't think it is necessary to make record insertion and index update
>atomic.  I'm also quite sure that making record insertion, index update,
>and unique checking atomic is difficult and expensive because any lock
>must be released before waiting for a pending transaction, necessitating
>a restart of the full check operation.  Note that this, in fact, makes
>the insert and check operations non-atomic.
>
>But think about it a little more.  If we do the check after the
>insertion and index update (either order), it works and it works without
>locks. The analysis is simple:  Consider two parallel transactions
>trying to store duplicate values.  Transaction A does its insert and
>index update then checks for duplicates.  If transaction B hasn't
>finished either his insert or index update, transaction A won't see the
>update.  But for A not to see B, A must have completed its insert and
>index update, so A will always be visible to B when B executes it unique
>checks.  QED.  No locks, no atomicity, and it works.

I suggest that this bug be fixed by deleting syncUpdate and the possible duplicate unique key issue be handled by another bug.  Possibly 33498, even though that one seems hard to reproduce.

After removing the SyncObject in Table.cpp around line 3042 I don't see any regression anymore.

        //Sync sync(&syncUpdate, "Table::update(4)");

        if (indexes)
            {
            //do
            //  sync.lock(ATOMIC_UPDATE);
            //while (!checkUniqueIndexes(transaction, record, &sync));
            while (!checkUniqueIndexes(transaction, record, NULL));

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon.falcon_bug_33184        [ pass ]             84
falcon.falcon_bugs             [ pass ]          17448
falcon.falcon_bug_28165        [ pass ]             88
falcon.falcon_bug_28158        [ pass ]             81
falcon.falcon_bug_28049        [ pass ]            512
falcon.falcon_bug_28046        [ pass ]             45
falcon.falcon_bug_26802        [ pass ]             40
falcon.falcon_bug_22167        [ pass ]             40
-------------------------------------------------------
Stopping All Servers
All 8 tests were successful.
The servers were restarted 1 times
Spent 18.338 of 21 seconds executing testcases

The whole falcon test suite passes now on Mac OS X/Intel

This regression turned out to be compiler bug in the exception handling of Apple's gcc.

hakan@au0012:~$ gcc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367)
Copyright (C) 2005 Free Software Foundation, Inc.

After installing gcc 4.2.2 from fink, the Falcon test suite runs fine.

falcon.falcon_ts               [ pass ]            152
falcon.index_merge_falcon      [ disabled ]  Bug#32696 index_merge_falcon.test fails
falcon.ps_8falcon              [ disabled ]  Bug#32683  Test ps_8falcon fails in 6.0
falcon.read_many_rows_falcon   [ disabled ]  Bug#23783 2006-10-30 ML Get pushbuild green
falcon.rowid_order_falcon      [ pass ]            227
falcon.rpl_bug_33688           [ pass ]            218
-------------------------------------------------------
Stopping All Servers
All 165 tests were successful.
The servers were restarted 10 times
Spent 359.264 of 411 seconds executing testcases