MySQL Bugs: #22150: Falcon: both client and server hang in case of update conflict

Bug #22150	Falcon: both client and server hang in case of update conflict
Submitted:	9 Sep 2006 5:35	Modified:	19 Apr 2007 17:09
Reporter:	Georg Richter	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	5.2	OS:	Any
Assigned to:	Jim Starkey	CPU Architecture:	Any

Description:
 If I have two connections, both running the procedure
described in the "How to repeat" below,
both started at approximately the same time, then
one of the connections ends with the message
ERROR 1032 (HY000): Can't find record in 't27'
The failure always happens with "insert".

How to repeat:
[ on connection 1 ]

delimiter //

CREATE TABLE `t27` (`s1` int(11), KEY `s1` (`s1`)) ENGINE=JStar//

CREATE PROCEDURE `p27`()
begin declare v1 int default 0;
declare v2 int;
while v1 < 1000000 do
  select 'insert', v1;
  insert into t27 values (v1);
  set v2 = rand() * 10000;
  update t27 set s1 = v2 where s1 = v1;
  select ' update', v2;
  set v1 = v1 + 1;
  if v1 mod 5000 = 0 then
    select ' delete';
    delete from t27;
    end if;
  end while;
end
//

[ on both connection 1 and connection 0000002 ]

call p27()//
or
call p27();

It runs with two sessions for sometime until Falcon detects an update conflict, at which point one of the clients reports:

+---------+
| delete  |
+---------+
|  delete |
+---------+
1 row in set (2 min 51.99 sec)

ERROR 1022 (23000): Can't write; duplicate key in table 't27'

Is this the correct behavior?

This bug report shows only the first part of the original report. There where significant changes to the report, which are reflected in falcon_bugs_138.test.

Please test falcon_bugs_138.test

  ./mysql-test-run.pl --enable-disabled falcon_bug_138

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_138                 [ fail ]

Errors are (from /home/hakan/work/mysql/mysql-5.1-falcon/mysql-test/var/log/mysqltest-time) :
mysqltest: At line 46: query 'call p27()' failed: 1020: Record has changed since last read in table 't27'
(the last lines may be the most important ones)

Regards, Hakan

falcon_bug_138.test was missing continue handler for 1020 and 1213.

After updating falcon_bug_138.test it fails now with:
1022: Can't write; duplicate key in table 't1'

Please note that table t1 is
CREATE TABLE t1 (a int(11), KEY a (a));

So there is no unique key where a duplicate key error could happen.

Regards, Hakan

Change pushed to mysql-5.1-falcon to return Update Conflict instead of Duplicate Conflict.

I tried this test several times with
last ChangeSet@1.2324, 2006-10-17 .
The result was in all cases that the test hangs
a long time and gets killed by mysql-test-run.pl
because of testcase timeout.
snip of general_log.CSV:
"2006-10-18 12:51:25","root[root] @ localhost []","1","1","Connect","root@localhost on test"
"2006-10-18 12:51:25",...,"1","1","Query","SET STORAGE_ENGINE = Falcon"
"2006-10-18 12:51:25",...,"1","1","Query","DROP TABLE IF EXISTS t1"
"2006-10-18 12:51:25",...,"1","1","Query","DROP PROCEDURE IF EXISTS p1"
"2006-10-18 12:51:25",...,"2","1","Connect","root@localhost on test"
"2006-10-18 12:51:25",...,"2","1","Query","SET @@autocommit = 1"
"2006-10-18 12:51:25",...,"1","1","Query","SET @@autocommit = 1"
"2006-10-18 12:51:25",...,"1","1","Query","CREATE TABLE t1 (a int(11), KEY a (a))"
"2006-10-18 12:51:25",...,"1","1","Query","CREATE PROCEDURE p1()\nbegin declare v1 int default 0;\ndeclare v2 int;\ndeclare continue handler for 1020 begin end;\ndeclare continue handler for 1213 begin end;\nwhile v1 < 10000 do\n/* SELECT 'insert', v1; */\nINSERT INTO t1 VALUES (v1);\nSET v2 = rand() * 10000;\nUPDATE t1 SET a = v2 WHERE a = v1;\n/* SELECT ' update', v1; */\nSET v1 = v1 + 1;\nif v1 mod 50 = 0 then\n/* SELECT ' delete'; */\nDELETE FROM t1;\nend if;\nend while;\nend"
"2006-10-18 12:51:25",...,"1","1","Query","call p1()"
"2006-10-18 12:51:25",...,"2","1","Query","call p1()"
      The following lines are caused by mysql-test-run.pl
      aborting the test.
"2006-10-18 13:06:25","root[root] @ localhost []","3","1","Connect","root@localhost on "
"2006-10-18 13:06:25",...,"3","1","Query","SHOW VARIABLES LIKE 'pid_file'"
"2006-10-18 13:06:25",...,"3","1","Shutdown",""

So it looks like that the wrong error message/number
1022: Can't write; duplicate key in table 't1'
does not occur any more, because the test would abort if
we get error 1022.
I think this test is nice but unusable for automatic testing
of Falcon as long as we do not have a wait for lock timeout feature.

I am not sure how to debug this hang any further.

I have run the stored procedure from multiple MYSQL clients and the result is always the same.  If I run only one at a time it runs to completion.  If I run two or more at a time, only one of them will finish.  The others are hung, but not in Falcon.  The thread that was handling that connection is left open in vio_read(), waitng for a signal.  The hang seems to be a loss of communication between the client and the MYSQLD.  

Here is a typical call stack.

 	ntdll.dll!7c90eb94() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	ntdll.dll!7c90e9c0() 	
 	mswsock.dll!71a54033() 	
 	mswsock.dll!71a557c9() 	
 	ws2_32.dll!71ab4379() 	
 	wsock32.dll!71ad2ea3() 	
>	mysqld.exe!vio_read(st_vio * vio=0x09268130, char * buf=0x09272f50, int size=4)  Line 42 + 0x15 bytes	C
 	mysqld.exe!my_real_read(st_net * net=0x01d32200, unsigned long * complen=0x0a30feb4)  Line 808 + 0x1a bytes	C++
 	mysqld.exe!my_net_read(st_net * net=0x01d32200)  Line 982 + 0xd bytes	C++
 	mysqld.exe!do_command(THD * thd=0x01d318c0)  Line 1577 + 0x9 bytes	C++
 	mysqld.exe!handle_one_connection(void * arg=0x01d318c0)  Line 1227 + 0x9 bytes	C++
 	mysqld.exe!pthread_start(void * param=0x01cfb4f0)  Line 63 + 0x9 bytes	C
 	mysqld.exe!_callthreadstart()  Line 293 + 0xf bytes	C
 	mysqld.exe!_threadstart(void * ptd=0x09276f98)  Line 277	C
 	kernel32.dll!7c80b683() 	
================================================================================

The same is true when running falcon_bug_138.test.  One of the threads will hang, and when the other stored procedure finishes, the test case waits for the one that hung, causing the whole test to hang.  It leaves the same threads with the same call stacks.

I have set falcon_log_mask = 4095 and use --console for mysqld.exe, but no log messages are there to indicate a problem with falcon.

mysql-5.1-falcon last ChangeSet@1.2342, 2006-10-24
on Pentium M(x86-32Bit), SuSE Linux 10.1
----------------------------------------
I tried that test with slightly modified testscript
(attached t/falcon_bug_138x.test) many times in loop.
In most cases I get the already known effect that the test
hangs but there in rare cases I get a server crash during
  call p1('default').
Backtrace generated from core file:
thd=0x90ddd98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xb3400528, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x8286c67 handle_segfault + 491
0x844dec5
_ZN11sp_rcontext12find_handlerEjN11MYSQL_ERROR18enum_warning_levelE +
161
0x844e0b8
_ZN11sp_rcontext12handle_errorEjN11MYSQL_ERROR18enum_warning_levelEP3THD
+ 90
0x82862ff _Z14my_message_sqljPKci + 183
0x870f16c my_error + 408
0x838a4af _ZN7handler11print_errorEii + 1717
0x8328577 _Z12mysql_deleteP3THDP13st_table_listP4ItemP11st_sql_listyyb +
3859
0x82a4aaf _Z21mysql_execute_commandP3THD + 13349
0x8445b57 _ZN13sp_instr_stmt9exec_coreEP3THDPj + 17
0x84464ab _ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr
+ 449
0x8446b82 _ZN13sp_instr_stmt7executeEP3THDPj + 412
0x8448d57 _ZN7sp_head7executeEP3THD + 1177
0x84498de _ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE + 1686
0x82a8b62 _Z21mysql_execute_commandP3THD + 29912
0x82aaec2 _Z11mysql_parseP3THDPcj + 536
0x82ab9aa _Z16dispatch_command19enum_server_commandP3THDPcj + 2434
0x82acee4 _Z10do_commandP3THD + 598
0x82ad2dd handle_one_connection + 783
0xb7c9734b _end + -1356549285
0xb7d6765e _end + -1355696530
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and
follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x918a480 = DELETE FROM t1
thd->thread_id=1
Attempt to get the problem with running GDB:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1287869536 (LWP 24912)]
0x0844df15 in sp_rcontext::find_handler (this=0x9193c50, sql_errno=1020,
    level=WARN_LEVEL_ERROR) at sp_rcontext.cc:217
217         switch (cond->type)
(gdb) bt
#0  0x0844df15 in sp_rcontext::find_handler (this=0x9193c50, sql_errno=1020,
    level=WARN_LEVEL_ERROR) at sp_rcontext.cc:217
#1  0x0844e108 in sp_rcontext::handle_error (this=0x9193c50, sql_errno=1020,
    level=WARN_LEVEL_ERROR, thd=0x90ded98) at sp_rcontext.cc:302
#2  0x0828634f in my_message_sql (error=1020,
    str=0xb33c7948 "Record has changed since last read in table 't1'",
    MyFlags=0) at mysqld.cc:2456
#3  0x0870fcd0 in my_error (nr=1020, MyFlags=0) at my_error.c:103
#4  0x0838a4ff in handler::print_error (this=0x918bac0, error=123, errflag=0)
    at handler.cc:2237
#5  0x083285c7 in mysql_delete (thd=0x90ded98, table_list=0x9165140,
    conds=0x0, order=0x916b7e8, limit=18446744073709551587, options=0,
    reset_auto_increment=false) at sql_delete.cc:258
#6  0x082a4aff in mysql_execute_command (thd=0x90ded98) at sql_parse.cc:3504
#7  0x08445ba7 in sp_instr_stmt::exec_core (this=0x91652f0, thd=0x90ded98,
    nextp=0xb33c93d4) at sp_head.cc:2551
#8  0x084464fb in sp_lex_keeper::reset_lex_and_exec_core (this=0x9165318,
    thd=0x90ded98, nextp=0xb33c93d4, open_tables=false, instr=0x91652f0)
    at sp_head.cc:2422
#9  0x08446bd2 in sp_instr_stmt::execute (this=0x91652f0, thd=0x90ded98,
    nextp=0xb33c93d4) at sp_head.cc:2502
#10 0x08448da7 in sp_head::execute (this=0x9120f38, thd=0x90ded98)
    at sp_head.cc:1096
---Type <return> to continue, or q <return> to quit---
#11 0x0844992e in sp_head::execute_procedure (this=0x9120f38, thd=0x90ded98,
    args=0x90df2e8) at sp_head.cc:1725
#12 0x082a8bb2 in mysql_execute_command (thd=0x90ded98) at sql_parse.cc:4619
#13 0x082aaf12 in mysql_parse (thd=0x90ded98,
    inBuf=0x9192568 "call p1('default')", length=18) at sql_parse.cc:6080
#14 0x082ab9fa in dispatch_command (command=COM_QUERY, thd=0x90ded98,
    packet=0x90fbec9 " call p1('default')", packet_length=20)
    at sql_parse.cc:1828
#15 0x082acf34 in do_command (thd=0x90ded98) at sql_parse.cc:1612
#16 0x082ad32d in handle_one_connection (arg=0x90ded98) at sql_parse.cc:1227
#17 0xb7c8f34b in start_thread () from /lib/libpthread.so.0
#18 0xb7d5f65e in clone () from /lib/libc.so.6
(gdb)
I do not know if both traces above and the often hanging tests
are caused by the same problem. There are also some open Bug reports
for stored procedures which probably suffer from the same problem
like
Bug#23032 Handlers declared in a SP do not handle warnings generated
          in sub-SP  (non Falcon problem)
Bug#22852 Falcon: Packets out of order after handled error

Also probably the fix for
Bug#22185 Incorrect Error message in two concurrent transactions
(Please look at Hakans last comment:
 "Now we get
 [00:44] root@sbtest>update sbtest set pad='eee' where id=10000;
 ERROR 1020 (HY000): Record has changed since last read in table 'sbtest'"
)
might be the cause for the crash.
InnoDB runs this test without any problems.
But I do not think that our current informations allow to draw
the conclusion that the crashes are a clear Falcon bug.
The behaviour of InnoDB and Falcon around two connections trying
to insert/update/delete partially the same records differ though
both try to follow the SQL standard. That means it is thinkable
that some standard compliant but Falcon specific properties
cause that some faulty stored procedure related code is executed.

Therefore I would propose the following activities:
1. Check, if the backtraces above help to locate the reason
   for the crashes
2. If 1. fails, wait till Bug#22852 is fixed and try to reproduce the 
   crash again.

Attached files:
falcon_bug_138x.test  Slightly extended original testscript
Y.sh and Z.sh         Shell scripts which help to reproduce
                      the random crash

Shellscript for reproduction of crash

Attachment: Y.sh (application/x-shellscript, text), 1.52 KiB.

Shellscript for reproduction of crash with GDB

Attachment: Z.sh (application/x-shellscript, text), 1.57 KiB.

test script

Attachment: falcon_bug_138x.test (application/octet-stream, text), 4.12 KiB.

Expected results (in most but not in all cases valid)

Attachment: falcon_bug_138x.result (application/octet-stream, text), 3.12 KiB.

I was asked to set a breakpoint in SqlError::error() and see if there are any unhandled exceptions originating from the Falcon code.   I ran this several times using the test script as well as multiple MYSQL clients running the stored procedure.  The only SqlError::error() call that is made is the UPDATE_CONFLICT from Table::ValidateAndInsert().  This error is caught and handled correctly.  Sometimes I would get multiples of these errors before the hang would occur, sometimes none at all, but usually about once or twice before the hang.
The hang has all threads in the engine in a wait state.  Calvin reported that in his testing the client threads were waiting on the engine.

Experiments with falcon_bugs_138x, InnoDB and ~10000 loops show often the 
following effect:
Connection "default" runs '--send call p1('default')' and connection
"conn1" runs call p1('conn1').
The execution aborts with the message 
query 'call p1('conn1')' failed: 
   1213: Deadlock found when trying to get lock; try restarting transaction
I assume it is possible that a run with InnoDB might get a deadlock,
but the surprising thing is that the execution aborts.
We have a continue handler for error 1213 within the procedure.
Maybe this is caused by the fact that InnoDB does an implicit ROLLBACK of
the complete transaction and probably any further execution of the procedure
is impossible or intentionally disabled.
At least we have here a case where the continue handler does not prevent
that the test aborts.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14703

ChangeSet@1.2321, 2006-11-01 19:41:09+02:00, monty@mysql.com +28 -0
  Fixed a lot of compiler warnings (Mainly in mysqld and instance manager)
  Fixed some possible fatal wrong arguments to printf() style functions
  Initialized some not initialized variables
  Fixed bug in stored procedure and continue handlers
  (Fixes Bug#22150)

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_138                 [ pass ]         347645
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 347.645 seconds actually executing testcases

Test case falcon_bug_138.test has loop count of 1 million. We reduced the loop count to 10000 for usability with Pushbuild.

Running falcon_bug_138.test with with 1 million loops results in an assertion in Falcon:

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 1103469488 (LWP 20167)]
0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x40182541 in raise () from /lib/tls/libc.so.6
#2  0x08448ac6 in Error::debugBreak () at Error.cpp:93
#3  0x08448b23 in Error::error (
    string=0x8744678 "assertion failed at line %d in file %s\n")
    at Error.cpp:70
#4  0x08448bb3 in Error::assertionFailed (fileName=0x874a3ee "Section.cpp", 
    line=444) at Error.cpp:77
#5  0x0848b34f in Section::updateRecord (this=0x40306420, recordNumber=520100, 
    stream=0x0, transId=1050847, earlyWrite=false) at Section.cpp:444
#6  0x08441f81 in Dbb::updateRecord (this=0x402dc928, sectionId=43, 
    recordId=520100, stream=0x0, transId=1050847, earlyWrite=false)
    at Dbb.cpp:285
#7  0x084a70f7 in SRLUpdateRecords::redo (this=0x41c59330)
    at SRLUpdateRecords.cpp:120
#8  0x084a6b54 in SRLUpdateRecords::commit (this=0x41c59330)
    at SRLUpdateRecords.cpp:42
#9  0x08492103 in SerialLogTransaction::commit (this=0x40313c68)
    at SerialLogTransaction.cpp:61
#10 0x084923d0 in SerialLogTransaction::doAction (this=0x40313c68)
    at SerialLogTransaction.cpp:116
#11 0x0848d51b in SerialLog::gopherThread (this=0x402dd98c)
    at SerialLog.cpp:145
#12 0x0848d5d6 in SerialLog::gopherThread (arg=0x402dd98c) at SerialLog.cpp:115
#13 0x08417883 in Thread::thread (this=0x402e3ef4) at Thread.cpp:162
#14 0x08417d43 in Thread::thread (parameter=0x402e3ef4) at Thread.cpp:139
#15 0x40284297 in start_thread () from /lib/tls/libpthread.so.0
#16 0x4021937e in clone () from /lib/tls/libc.so.6
#17 0x41c59bb0 in ?? ()

(gdb) p pageNumber
$1 = 0
(gdb) l
439             int sequence = slot / dbb->pagesPerSection;
440             Bdb *bdb = getSectionPage (sequence, Shared, transId);
441             SectionPage *sectionPage = (SectionPage*) bdb->buffer;
442             int flags = sectionPage->flags;
443             int32 pageNumber = sectionPage->pages [slot %dbb->pagesPerSection];

Regards, Hakan

Another backtrace. This time segmentation fault.

Program received signal SIGSEGV, Segmentation fault.
0x0841815b in interlockedIncrement (ptr=0xc) at Interlock.h:74
74                      );
(gdb) bt
#0  0x0841815b in interlockedIncrement (ptr=0xc) at Interlock.h:74
#1  0x08475b91 in Record::addRef (this=0x0) at Record.cpp:485
#2  0x08405752 in StorageDatabase::deleteRow (this=0x402dc3d8, 
    storageConnection=0x40306ee8, table=0x403062b0, recordNumber=373650)
    at StorageDatabase.cpp:497
#3  0x08408b98 in StorageTable::deleteRow (this=0x407ed700, 
    recordNumber=373650) at StorageTable.cpp:90
#4  0x0840017c in NfsStorageTable::delete_row (this=0x89b90e8, 
    buf=0x89b9250 "�;%") at ha_falcon.cpp:705
#5  0x08352387 in handler::ha_delete_row (this=0x89b90e8, buf=0x89b9250 "�;%")
    at handler.cc:3706
#6  0x082f22d0 in mysql_delete (thd=0x89135b8, table_list=0x89c3068, 
    conds=0x0, order=0x89cc0e0, limit=18446744073709551615, options=0, 
    reset_auto_increment=false) at sql_delete.cc:240
#7  0x08277bbc in mysql_execute_command (thd=0x89135b8) at sql_parse.cc:3524
#8  0x083d1365 in sp_instr_stmt::exec_core (this=0x89c3218, thd=0x89135b8, 
    nextp=0x40620484) at sp_head.cc:2556
#9  0x083d111f in sp_lex_keeper::reset_lex_and_exec_core (this=0x89c3240, 
    thd=0x89135b8, nextp=0x40620484, open_tables=false, instr=0x89c3218)
    at sp_head.cc:2425
#10 0x083d4ead in sp_instr_stmt::execute (this=0x89c3218, thd=0x89135b8, 
    nextp=0x40620484) at sp_head.cc:2507
#11 0x083ceccf in sp_head::execute (this=0x89c14f0, thd=0x89135b8)
    at sp_head.cc:1095
#12 0x083d361b in sp_head::execute_procedure (this=0x89c14f0, thd=0x89135b8, 
    args=0x8913b0c) at sp_head.cc:1726
#13 0x0827b32d in mysql_execute_command (thd=0x89135b8) at sql_parse.cc:4623
#14 0x0827cee4 in mysql_parse (thd=0x89135b8, inBuf=0x89894d8 "call p1()", 
    length=9) at sql_parse.cc:6036
#15 0x0827d88e in dispatch_command (command=COM_QUERY, thd=0x89135b8, 
    packet=0x8981479 "call p1()", packet_length=10) at sql_parse.cc:1835
#16 0x0827ebbd in do_command (thd=0x89135b8) at sql_parse.cc:1619
#17 0x0827f06d in handle_one_connection (arg=0x89135b8) at sql_parse.cc:1234
#18 0x40284297 in start_thread () from /lib/tls/libpthread.so.0
#19 0x4021937e in clone () from /lib/tls/libc.so.6
#20 0x40621bb0 in ?? ()

(gdb) f 1
#1  0x08475b91 in Record::addRef (this=0x0) at Record.cpp:485
485             INTERLOCKED_INCREMENT (useCount);
(gdb) p useCount
Cannot access memory at address 0x0

Regards, Hakan

It seems to run fine on Win32 or Linux64 until one or the other
sessions get an update conflict.  Falcon is returning reasonable
text plus the error code HA_ERR_RECORD_CHANGED, which is the closest
thing in the standard MySQL errors codes to the condition.

Terminal 1 gives this for me:

+---------+------+
| update  | v2   |
+---------+------+
|  update | 4908 |
+---------+------+
1 row in set (1 min 46.27 sec)

+---------+
| delete  |
+---------+
|  delete |
+---------+
1 row in set (1 min 46.27 sec)

ERROR 1020 (HY000): Record has changed since last read in table 't27'

Please use the test case falcon_bug_138.test.

Note: Test case falcon_bug_138.test has loop count of 1 million. We reduced the loop count to 10000 for usability with Pushbuild.

Before running falcon_bug_138.test the loop count should be increased to 1 million.

The errors with error codes 1020 and 1213 are handled by following continue handlers:
  declare continue handler for 1020 begin end;
  declare continue handler for 1213 begin end;

Regards, Hakan

Changed the test to 1M :

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

falcon_bug_138                 [ fail ]

Errors are (from /home/patg/mysql-build/mysql-5.1-falcon-clean/mysql-test/var/log/mysqltest-time) :
mysqltest: At line 52: query 'call p1()' failed: 2013: Lost connection to MySQL server during query
(the last lines may be the most important ones)
Result from queries before failure can be found in r/falcon_bug_138.log

I'll run with gdb next.

Recently broken by "innodb compatibility mode."

Works again.