Bug #22150 | Falcon: both client and server hang in case of update conflict | ||
---|---|---|---|
Submitted: | 9 Sep 2006 5:35 | Modified: | 19 Apr 2007 17:09 |
Reporter: | Georg Richter | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S3 (Non-critical) |
Version: | 5.2 | OS: | Any |
Assigned to: | Jim Starkey | CPU Architecture: | Any |
[9 Sep 2006 5:35]
Georg Richter
[24 Sep 2006 16:30]
Jim Starkey
It runs with two sessions for sometime until Falcon detects an update conflict, at which point one of the clients reports: +---------+ | delete | +---------+ | delete | +---------+ 1 row in set (2 min 51.99 sec) ERROR 1022 (23000): Can't write; duplicate key in table 't27' Is this the correct behavior?
[26 Sep 2006 15:59]
Hakan Küçükyılmaz
This bug report shows only the first part of the original report. There where significant changes to the report, which are reflected in falcon_bugs_138.test. Please test falcon_bugs_138.test ./mysql-test-run.pl --enable-disabled falcon_bug_138 TEST RESULT TIME (ms) ------------------------------------------------------- falcon_bug_138 [ fail ] Errors are (from /home/hakan/work/mysql/mysql-5.1-falcon/mysql-test/var/log/mysqltest-time) : mysqltest: At line 46: query 'call p27()' failed: 1020: Record has changed since last read in table 't27' (the last lines may be the most important ones) Regards, Hakan
[28 Sep 2006 11:17]
Hakan Küçükyılmaz
falcon_bug_138.test was missing continue handler for 1020 and 1213. After updating falcon_bug_138.test it fails now with: 1022: Can't write; duplicate key in table 't1' Please note that table t1 is CREATE TABLE t1 (a int(11), KEY a (a)); So there is no unique key where a duplicate key error could happen. Regards, Hakan
[17 Oct 2006 20:32]
Kevin Lewis
Change pushed to mysql-5.1-falcon to return Update Conflict instead of Duplicate Conflict.
[18 Oct 2006 12:44]
Matthias Leich
I tried this test several times with last ChangeSet@1.2324, 2006-10-17 . The result was in all cases that the test hangs a long time and gets killed by mysql-test-run.pl because of testcase timeout. snip of general_log.CSV: "2006-10-18 12:51:25","root[root] @ localhost []","1","1","Connect","root@localhost on test" "2006-10-18 12:51:25",...,"1","1","Query","SET STORAGE_ENGINE = Falcon" "2006-10-18 12:51:25",...,"1","1","Query","DROP TABLE IF EXISTS t1" "2006-10-18 12:51:25",...,"1","1","Query","DROP PROCEDURE IF EXISTS p1" "2006-10-18 12:51:25",...,"2","1","Connect","root@localhost on test" "2006-10-18 12:51:25",...,"2","1","Query","SET @@autocommit = 1" "2006-10-18 12:51:25",...,"1","1","Query","SET @@autocommit = 1" "2006-10-18 12:51:25",...,"1","1","Query","CREATE TABLE t1 (a int(11), KEY a (a))" "2006-10-18 12:51:25",...,"1","1","Query","CREATE PROCEDURE p1()\nbegin declare v1 int default 0;\ndeclare v2 int;\ndeclare continue handler for 1020 begin end;\ndeclare continue handler for 1213 begin end;\nwhile v1 < 10000 do\n/* SELECT 'insert', v1; */\nINSERT INTO t1 VALUES (v1);\nSET v2 = rand() * 10000;\nUPDATE t1 SET a = v2 WHERE a = v1;\n/* SELECT ' update', v1; */\nSET v1 = v1 + 1;\nif v1 mod 50 = 0 then\n/* SELECT ' delete'; */\nDELETE FROM t1;\nend if;\nend while;\nend" "2006-10-18 12:51:25",...,"1","1","Query","call p1()" "2006-10-18 12:51:25",...,"2","1","Query","call p1()" The following lines are caused by mysql-test-run.pl aborting the test. "2006-10-18 13:06:25","root[root] @ localhost []","3","1","Connect","root@localhost on " "2006-10-18 13:06:25",...,"3","1","Query","SHOW VARIABLES LIKE 'pid_file'" "2006-10-18 13:06:25",...,"3","1","Shutdown","" So it looks like that the wrong error message/number 1022: Can't write; duplicate key in table 't1' does not occur any more, because the test would abort if we get error 1022. I think this test is nice but unusable for automatic testing of Falcon as long as we do not have a wait for lock timeout feature.
[18 Oct 2006 18:21]
Kevin Lewis
I am not sure how to debug this hang any further. I have run the stored procedure from multiple MYSQL clients and the result is always the same. If I run only one at a time it runs to completion. If I run two or more at a time, only one of them will finish. The others are hung, but not in Falcon. The thread that was handling that connection is left open in vio_read(), waitng for a signal. The hang seems to be a loss of communication between the client and the MYSQLD. Here is a typical call stack. ntdll.dll!7c90eb94() [Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll] ntdll.dll!7c90e9c0() mswsock.dll!71a54033() mswsock.dll!71a557c9() ws2_32.dll!71ab4379() wsock32.dll!71ad2ea3() > mysqld.exe!vio_read(st_vio * vio=0x09268130, char * buf=0x09272f50, int size=4) Line 42 + 0x15 bytes C mysqld.exe!my_real_read(st_net * net=0x01d32200, unsigned long * complen=0x0a30feb4) Line 808 + 0x1a bytes C++ mysqld.exe!my_net_read(st_net * net=0x01d32200) Line 982 + 0xd bytes C++ mysqld.exe!do_command(THD * thd=0x01d318c0) Line 1577 + 0x9 bytes C++ mysqld.exe!handle_one_connection(void * arg=0x01d318c0) Line 1227 + 0x9 bytes C++ mysqld.exe!pthread_start(void * param=0x01cfb4f0) Line 63 + 0x9 bytes C mysqld.exe!_callthreadstart() Line 293 + 0xf bytes C mysqld.exe!_threadstart(void * ptd=0x09276f98) Line 277 C kernel32.dll!7c80b683() ================================================================================ The same is true when running falcon_bug_138.test. One of the threads will hang, and when the other stored procedure finishes, the test case waits for the one that hung, causing the whole test to hang. It leaves the same threads with the same call stacks. I have set falcon_log_mask = 4095 and use --console for mysqld.exe, but no log messages are there to indicate a problem with falcon.
[24 Oct 2006 11:24]
Matthias Leich
mysql-5.1-falcon last ChangeSet@1.2342, 2006-10-24 on Pentium M(x86-32Bit), SuSE Linux 10.1 ---------------------------------------- I tried that test with slightly modified testscript (attached t/falcon_bug_138x.test) many times in loop. In most cases I get the already known effect that the test hangs but there in rare cases I get a server crash during call p1('default'). Backtrace generated from core file: thd=0x90ddd98 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... Cannot determine thread, fp=0xb3400528, backtrace may not be correct. Stack range sanity check OK, backtrace follows: 0x8286c67 handle_segfault + 491 0x844dec5 _ZN11sp_rcontext12find_handlerEjN11MYSQL_ERROR18enum_warning_levelE + 161 0x844e0b8 _ZN11sp_rcontext12handle_errorEjN11MYSQL_ERROR18enum_warning_levelEP3THD + 90 0x82862ff _Z14my_message_sqljPKci + 183 0x870f16c my_error + 408 0x838a4af _ZN7handler11print_errorEii + 1717 0x8328577 _Z12mysql_deleteP3THDP13st_table_listP4ItemP11st_sql_listyyb + 3859 0x82a4aaf _Z21mysql_execute_commandP3THD + 13349 0x8445b57 _ZN13sp_instr_stmt9exec_coreEP3THDPj + 17 0x84464ab _ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr + 449 0x8446b82 _ZN13sp_instr_stmt7executeEP3THDPj + 412 0x8448d57 _ZN7sp_head7executeEP3THD + 1177 0x84498de _ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE + 1686 0x82a8b62 _Z21mysql_execute_commandP3THD + 29912 0x82aaec2 _Z11mysql_parseP3THDPcj + 536 0x82ab9aa _Z16dispatch_command19enum_server_commandP3THDPcj + 2434 0x82acee4 _Z10do_commandP3THD + 598 0x82ad2dd handle_one_connection + 783 0xb7c9734b _end + -1356549285 0xb7d6765e _end + -1355696530 New value of fp=(nil) failed sanity check, terminating stack trace! Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved stack trace is much more helpful in diagnosing the problem, so please do resolve it Trying to get some variables. Some pointers may be invalid and cause the dump to abort... thd->query at 0x918a480 = DELETE FROM t1 thd->thread_id=1 Attempt to get the problem with running GDB: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1287869536 (LWP 24912)] 0x0844df15 in sp_rcontext::find_handler (this=0x9193c50, sql_errno=1020, level=WARN_LEVEL_ERROR) at sp_rcontext.cc:217 217 switch (cond->type) (gdb) bt #0 0x0844df15 in sp_rcontext::find_handler (this=0x9193c50, sql_errno=1020, level=WARN_LEVEL_ERROR) at sp_rcontext.cc:217 #1 0x0844e108 in sp_rcontext::handle_error (this=0x9193c50, sql_errno=1020, level=WARN_LEVEL_ERROR, thd=0x90ded98) at sp_rcontext.cc:302 #2 0x0828634f in my_message_sql (error=1020, str=0xb33c7948 "Record has changed since last read in table 't1'", MyFlags=0) at mysqld.cc:2456 #3 0x0870fcd0 in my_error (nr=1020, MyFlags=0) at my_error.c:103 #4 0x0838a4ff in handler::print_error (this=0x918bac0, error=123, errflag=0) at handler.cc:2237 #5 0x083285c7 in mysql_delete (thd=0x90ded98, table_list=0x9165140, conds=0x0, order=0x916b7e8, limit=18446744073709551587, options=0, reset_auto_increment=false) at sql_delete.cc:258 #6 0x082a4aff in mysql_execute_command (thd=0x90ded98) at sql_parse.cc:3504 #7 0x08445ba7 in sp_instr_stmt::exec_core (this=0x91652f0, thd=0x90ded98, nextp=0xb33c93d4) at sp_head.cc:2551 #8 0x084464fb in sp_lex_keeper::reset_lex_and_exec_core (this=0x9165318, thd=0x90ded98, nextp=0xb33c93d4, open_tables=false, instr=0x91652f0) at sp_head.cc:2422 #9 0x08446bd2 in sp_instr_stmt::execute (this=0x91652f0, thd=0x90ded98, nextp=0xb33c93d4) at sp_head.cc:2502 #10 0x08448da7 in sp_head::execute (this=0x9120f38, thd=0x90ded98) at sp_head.cc:1096 ---Type <return> to continue, or q <return> to quit--- #11 0x0844992e in sp_head::execute_procedure (this=0x9120f38, thd=0x90ded98, args=0x90df2e8) at sp_head.cc:1725 #12 0x082a8bb2 in mysql_execute_command (thd=0x90ded98) at sql_parse.cc:4619 #13 0x082aaf12 in mysql_parse (thd=0x90ded98, inBuf=0x9192568 "call p1('default')", length=18) at sql_parse.cc:6080 #14 0x082ab9fa in dispatch_command (command=COM_QUERY, thd=0x90ded98, packet=0x90fbec9 " call p1('default')", packet_length=20) at sql_parse.cc:1828 #15 0x082acf34 in do_command (thd=0x90ded98) at sql_parse.cc:1612 #16 0x082ad32d in handle_one_connection (arg=0x90ded98) at sql_parse.cc:1227 #17 0xb7c8f34b in start_thread () from /lib/libpthread.so.0 #18 0xb7d5f65e in clone () from /lib/libc.so.6 (gdb) I do not know if both traces above and the often hanging tests are caused by the same problem. There are also some open Bug reports for stored procedures which probably suffer from the same problem like Bug#23032 Handlers declared in a SP do not handle warnings generated in sub-SP (non Falcon problem) Bug#22852 Falcon: Packets out of order after handled error Also probably the fix for Bug#22185 Incorrect Error message in two concurrent transactions (Please look at Hakans last comment: "Now we get [00:44] root@sbtest>update sbtest set pad='eee' where id=10000; ERROR 1020 (HY000): Record has changed since last read in table 'sbtest'" ) might be the cause for the crash. InnoDB runs this test without any problems. But I do not think that our current informations allow to draw the conclusion that the crashes are a clear Falcon bug. The behaviour of InnoDB and Falcon around two connections trying to insert/update/delete partially the same records differ though both try to follow the SQL standard. That means it is thinkable that some standard compliant but Falcon specific properties cause that some faulty stored procedure related code is executed. Therefore I would propose the following activities: 1. Check, if the backtraces above help to locate the reason for the crashes 2. If 1. fails, wait till Bug#22852 is fixed and try to reproduce the crash again. Attached files: falcon_bug_138x.test Slightly extended original testscript Y.sh and Z.sh Shell scripts which help to reproduce the random crash
[24 Oct 2006 11:27]
Matthias Leich
Shellscript for reproduction of crash
Attachment: Y.sh (application/x-shellscript, text), 1.52 KiB.
[24 Oct 2006 11:27]
Matthias Leich
Shellscript for reproduction of crash with GDB
Attachment: Z.sh (application/x-shellscript, text), 1.57 KiB.
[24 Oct 2006 11:29]
Matthias Leich
test script
Attachment: falcon_bug_138x.test (application/octet-stream, text), 4.12 KiB.
[24 Oct 2006 11:30]
Matthias Leich
Expected results (in most but not in all cases valid)
Attachment: falcon_bug_138x.result (application/octet-stream, text), 3.12 KiB.
[27 Oct 2006 2:56]
Kevin Lewis
I was asked to set a breakpoint in SqlError::error() and see if there are any unhandled exceptions originating from the Falcon code. I ran this several times using the test script as well as multiple MYSQL clients running the stored procedure. The only SqlError::error() call that is made is the UPDATE_CONFLICT from Table::ValidateAndInsert(). This error is caught and handled correctly. Sometimes I would get multiples of these errors before the hang would occur, sometimes none at all, but usually about once or twice before the hang. The hang has all threads in the engine in a wait state. Calvin reported that in his testing the client threads were waiting on the engine.
[31 Oct 2006 14:10]
Matthias Leich
Experiments with falcon_bugs_138x, InnoDB and ~10000 loops show often the following effect: Connection "default" runs '--send call p1('default')' and connection "conn1" runs call p1('conn1'). The execution aborts with the message query 'call p1('conn1')' failed: 1213: Deadlock found when trying to get lock; try restarting transaction I assume it is possible that a run with InnoDB might get a deadlock, but the surprising thing is that the execution aborts. We have a continue handler for error 1213 within the procedure. Maybe this is caused by the fact that InnoDB does an implicit ROLLBACK of the complete transaction and probably any further execution of the procedure is impossible or intentionally disabled. At least we have here a case where the continue handler does not prevent that the test aborts.
[1 Nov 2006 17:41]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/14703 ChangeSet@1.2321, 2006-11-01 19:41:09+02:00, monty@mysql.com +28 -0 Fixed a lot of compiler warnings (Mainly in mysqld and instance manager) Fixed some possible fatal wrong arguments to printf() style functions Initialized some not initialized variables Fixed bug in stored procedure and continue handlers (Fixes Bug#22150)
[10 Nov 2006 18:57]
Patrick Galbraith
TEST RESULT TIME (ms) ------------------------------------------------------- falcon_bug_138 [ pass ] 347645 ------------------------------------------------------- Stopping All Servers All 1 tests were successful. The servers were restarted 1 times Spent 347.645 seconds actually executing testcases
[11 Nov 2006 23:41]
Hakan Küçükyılmaz
Test case falcon_bug_138.test has loop count of 1 million. We reduced the loop count to 10000 for usability with Pushbuild. Running falcon_bug_138.test with with 1 million loops results in an assertion in Falcon: Program received signal SIGILL, Illegal instruction. [Switching to Thread 1103469488 (LWP 20167)] 0xffffe410 in __kernel_vsyscall () (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0x40182541 in raise () from /lib/tls/libc.so.6 #2 0x08448ac6 in Error::debugBreak () at Error.cpp:93 #3 0x08448b23 in Error::error ( string=0x8744678 "assertion failed at line %d in file %s\n") at Error.cpp:70 #4 0x08448bb3 in Error::assertionFailed (fileName=0x874a3ee "Section.cpp", line=444) at Error.cpp:77 #5 0x0848b34f in Section::updateRecord (this=0x40306420, recordNumber=520100, stream=0x0, transId=1050847, earlyWrite=false) at Section.cpp:444 #6 0x08441f81 in Dbb::updateRecord (this=0x402dc928, sectionId=43, recordId=520100, stream=0x0, transId=1050847, earlyWrite=false) at Dbb.cpp:285 #7 0x084a70f7 in SRLUpdateRecords::redo (this=0x41c59330) at SRLUpdateRecords.cpp:120 #8 0x084a6b54 in SRLUpdateRecords::commit (this=0x41c59330) at SRLUpdateRecords.cpp:42 #9 0x08492103 in SerialLogTransaction::commit (this=0x40313c68) at SerialLogTransaction.cpp:61 #10 0x084923d0 in SerialLogTransaction::doAction (this=0x40313c68) at SerialLogTransaction.cpp:116 #11 0x0848d51b in SerialLog::gopherThread (this=0x402dd98c) at SerialLog.cpp:145 #12 0x0848d5d6 in SerialLog::gopherThread (arg=0x402dd98c) at SerialLog.cpp:115 #13 0x08417883 in Thread::thread (this=0x402e3ef4) at Thread.cpp:162 #14 0x08417d43 in Thread::thread (parameter=0x402e3ef4) at Thread.cpp:139 #15 0x40284297 in start_thread () from /lib/tls/libpthread.so.0 #16 0x4021937e in clone () from /lib/tls/libc.so.6 #17 0x41c59bb0 in ?? () (gdb) p pageNumber $1 = 0 (gdb) l 439 int sequence = slot / dbb->pagesPerSection; 440 Bdb *bdb = getSectionPage (sequence, Shared, transId); 441 SectionPage *sectionPage = (SectionPage*) bdb->buffer; 442 int flags = sectionPage->flags; 443 int32 pageNumber = sectionPage->pages [slot %dbb->pagesPerSection]; Regards, Hakan
[12 Nov 2006 0:38]
Hakan Küçükyılmaz
Another backtrace. This time segmentation fault. Program received signal SIGSEGV, Segmentation fault. 0x0841815b in interlockedIncrement (ptr=0xc) at Interlock.h:74 74 ); (gdb) bt #0 0x0841815b in interlockedIncrement (ptr=0xc) at Interlock.h:74 #1 0x08475b91 in Record::addRef (this=0x0) at Record.cpp:485 #2 0x08405752 in StorageDatabase::deleteRow (this=0x402dc3d8, storageConnection=0x40306ee8, table=0x403062b0, recordNumber=373650) at StorageDatabase.cpp:497 #3 0x08408b98 in StorageTable::deleteRow (this=0x407ed700, recordNumber=373650) at StorageTable.cpp:90 #4 0x0840017c in NfsStorageTable::delete_row (this=0x89b90e8, buf=0x89b9250 "�;%") at ha_falcon.cpp:705 #5 0x08352387 in handler::ha_delete_row (this=0x89b90e8, buf=0x89b9250 "�;%") at handler.cc:3706 #6 0x082f22d0 in mysql_delete (thd=0x89135b8, table_list=0x89c3068, conds=0x0, order=0x89cc0e0, limit=18446744073709551615, options=0, reset_auto_increment=false) at sql_delete.cc:240 #7 0x08277bbc in mysql_execute_command (thd=0x89135b8) at sql_parse.cc:3524 #8 0x083d1365 in sp_instr_stmt::exec_core (this=0x89c3218, thd=0x89135b8, nextp=0x40620484) at sp_head.cc:2556 #9 0x083d111f in sp_lex_keeper::reset_lex_and_exec_core (this=0x89c3240, thd=0x89135b8, nextp=0x40620484, open_tables=false, instr=0x89c3218) at sp_head.cc:2425 #10 0x083d4ead in sp_instr_stmt::execute (this=0x89c3218, thd=0x89135b8, nextp=0x40620484) at sp_head.cc:2507 #11 0x083ceccf in sp_head::execute (this=0x89c14f0, thd=0x89135b8) at sp_head.cc:1095 #12 0x083d361b in sp_head::execute_procedure (this=0x89c14f0, thd=0x89135b8, args=0x8913b0c) at sp_head.cc:1726 #13 0x0827b32d in mysql_execute_command (thd=0x89135b8) at sql_parse.cc:4623 #14 0x0827cee4 in mysql_parse (thd=0x89135b8, inBuf=0x89894d8 "call p1()", length=9) at sql_parse.cc:6036 #15 0x0827d88e in dispatch_command (command=COM_QUERY, thd=0x89135b8, packet=0x8981479 "call p1()", packet_length=10) at sql_parse.cc:1835 #16 0x0827ebbd in do_command (thd=0x89135b8) at sql_parse.cc:1619 #17 0x0827f06d in handle_one_connection (arg=0x89135b8) at sql_parse.cc:1234 #18 0x40284297 in start_thread () from /lib/tls/libpthread.so.0 #19 0x4021937e in clone () from /lib/tls/libc.so.6 #20 0x40621bb0 in ?? () (gdb) f 1 #1 0x08475b91 in Record::addRef (this=0x0) at Record.cpp:485 485 INTERLOCKED_INCREMENT (useCount); (gdb) p useCount Cannot access memory at address 0x0 Regards, Hakan
[20 Nov 2006 22:28]
Jim Starkey
It seems to run fine on Win32 or Linux64 until one or the other sessions get an update conflict. Falcon is returning reasonable text plus the error code HA_ERR_RECORD_CHANGED, which is the closest thing in the standard MySQL errors codes to the condition.
[21 Nov 2006 1:02]
Patrick Galbraith
Terminal 1 gives this for me: +---------+------+ | update | v2 | +---------+------+ | update | 4908 | +---------+------+ 1 row in set (1 min 46.27 sec) +---------+ | delete | +---------+ | delete | +---------+ 1 row in set (1 min 46.27 sec) ERROR 1020 (HY000): Record has changed since last read in table 't27'
[21 Nov 2006 1:43]
Hakan Küçükyılmaz
Please use the test case falcon_bug_138.test. Note: Test case falcon_bug_138.test has loop count of 1 million. We reduced the loop count to 10000 for usability with Pushbuild. Before running falcon_bug_138.test the loop count should be increased to 1 million. The errors with error codes 1020 and 1213 are handled by following continue handlers: declare continue handler for 1020 begin end; declare continue handler for 1213 begin end; Regards, Hakan
[21 Nov 2006 3:55]
Patrick Galbraith
Changed the test to 1M : TEST RESULT TIME (ms) ------------------------------------------------------- falcon_bug_138 [ fail ] Errors are (from /home/patg/mysql-build/mysql-5.1-falcon-clean/mysql-test/var/log/mysqltest-time) : mysqltest: At line 52: query 'call p1()' failed: 2013: Lost connection to MySQL server during query (the last lines may be the most important ones) Result from queries before failure can be found in r/falcon_bug_138.log I'll run with gdb next.
[19 Apr 2007 16:49]
Jim Starkey
Recently broken by "innodb compatibility mode."
[19 Apr 2007 17:09]
Hakan Küçükyılmaz
Works again.