Bug #30032 | Falcon: running falcon_concurrent_blob_updates leads to possible memory leak | ||
---|---|---|---|
Submitted: | 25 Jul 2007 11:04 | Modified: | 10 Aug 2007 19:07 |
Reporter: | Hakan Küçükyılmaz | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0 | OS: | Any |
Assigned to: | Kevin Lewis | CPU Architecture: | Any |
Tags: | falcon, falcon_concurrent_blob_updates |
[25 Jul 2007 11:04]
Hakan Küçükyılmaz
[25 Jul 2007 11:06]
Hakan Küçükyılmaz
falcon_concurrent_blob_updates
Attachment: falcon_concurrent_blob_updates.tar.gz (application/x-tar, text), 3.62 KiB.
[25 Jul 2007 12:04]
Hakan Küçükyılmaz
Another assertion is: Program received signal SIGILL, Illegal instruction. [Switching to Thread 1085045088 (LWP 4096)] 0x00002aaaab16643a in raise () from /lib64/tls/libc.so.6 (gdb) f 4 #4 0x000000000094b300 in Cache::fakePage (this=0x2aaaabad7f50, dbb=0x2aaaabb287c8, pageNumber=3123, type=PAGE_data_overflow, transId=1721) at Cache.cpp:275 275 ASSERT(!bdb->syncObject.isLocked()); (gdb) bt #0 0x00002aaaab16643a in raise () from /lib64/tls/libc.so.6 #1 0x00000000008b7ef4 in Error::debugBreak () at Error.cpp:92 #2 0x00000000008b8008 in Error::error (string=0xc47288 "assertion failed at line %d in file %s\n") at Error.cpp:69 #3 0x00000000008b809f in Error::assertionFailed (fileName=0xc55380 "Cache.cpp", line=275) at Error.cpp:76 #4 0x000000000094b300 in Cache::fakePage (this=0x2aaaabad7f50, dbb=0x2aaaabb287c8, pageNumber=3123, type=PAGE_data_overflow, transId=1721) at Cache.cpp:275 #5 0x00000000008ac84c in Dbb::fakePage (this=0x2aaaabb287c8, pageNumber=3123, pageType=PAGE_data_overflow, transId=1721) at Dbb.cpp:197 #6 0x00000000008e94de in PageInventoryPage::allocPage (dbb=0x2aaaabb287c8, pageType=PAGE_data_overflow, transId=1721) at PageInventoryPage.cpp:99 #7 0x00000000008ac967 in Dbb::allocPage (this=0x2aaaabb287c8, pageType=PAGE_data_overflow, transId=1721) at Dbb.cpp:230 #8 0x0000000000904a86 in Section::storeTail (this=0x2aaaabb2eba8, stream=0x40ac4d00, maxRecord=4084, pLength=0x40ac4ab4, transId=1721, earlyWrite=true) at Section.cpp:676 #9 0x0000000000904cdf in Section::storeRecord (this=0x2aaaabb2eba8, recordLocatorPage=0x2aaaf0120000, indexPageNumber=2389, index=0x2aaaf0120f30, stream=0x40ac4d00, transId=1721, earlyWrite=true) at Section.cpp:589 #10 0x0000000000908134 in Section::updateRecord (this=0x2aaaabb2eba8, recordNumber=2014, stream=0x40ac4d00, transId=1721, earlyWrite=true) at Section.cpp:517 #11 0x00000000008af602 in Dbb::updateRecord (this=0x2aaaabb287c8, sectionId=1, recordId=2014, stream=0x40ac4d00, transId=1721, earlyWrite=true) at Dbb.cpp:316 #12 0x00000000008af67b in Dbb::updateBlob (this=0x2aaaabb287c8, sectionId=1, recordId=2014, stream=0x40ac4d00, transId=1721) at Dbb.cpp:310 #13 0x00000000008af6c9 in Dbb::updateBlob (this=0x2aaaabb287c8, blobSectionId=1, recordNumber=2014, stream=0x40ac4d00, transaction=0x2aaaabbfbd70) at Dbb.cpp:1322 #14 0x0000000000875244 in Table::storeBlob (this=0x2aaaabb305c0, transaction=0x2aaaabbfbd70, length=57200, data=0x10ca6a8 "oqsasfrtcmjcrjdfmkzghqfxfhkavdmjjodqezvuebmzdwkqrrvsyvfpfqluuzlzohdjsaedvzbdhluiwuwikifkzdyixzrbrwuyamrryhqelztqguevdrclgdyjqagweegivrswdfambmszcbrrvjbkfpdchxfqijtuloaovewsrjphfxfhtolsgruudokmnrplljty"...) at Table.cpp:2764 #15 0x000000000086509d in StorageDatabase::storeBlob (this=0x2aaaabad71c8, connection=0x2aaaabb80460, table=0x2aaaabb305c0, blob=0x40ac5210) at StorageDatabase.cpp:731 #16 0x000000000086aecd in StorageTable::storeBlob (this=0x2aaaaba10ec8, blob=0x40ac5210) at StorageTable.cpp:274 #17 0x000000000085f2c2 in StorageInterface::encodeRecord (this=0x1095808, buf=0x1095a48 "�\021nQP", updateFlag=false) at ha_falcon.cpp:2133
[25 Jul 2007 16:29]
MySQL Verification Team
Thank you for the bug report. Verified on Windows, the server prints the below error messages and the server memory usage increments until left the machine without resources. :\dev\6.0>bin\mysqld-nt --standalone --console 070725 13:00:39 InnoDB: Started; log sequence number 0 46409 070725 13:00:49 [Note] Event Scheduler: Loaded 0 events 070725 13:00:49 [Note] bin\mysqld-nt: ready for connections. Version: '6.0.1-alpha-nt' socket: '' port: 3306 Source distribution 070725 13:09:56 [ERROR] bin\mysqld-nt: Record has changed since last read in table 't1' 070725 13:09:56 [ERROR] bin\mysqld-nt: Sort aborted 070725 13:09:57 [ERROR] bin\mysqld-nt: Record has changed since last read in table 't1' 070725 13:09:57 [ERROR] bin\mysqld-nt: Sort aborted 070725 13:10:00 [ERROR] bin\mysqld-nt: Record has changed since last read in table 't1' 070725 13:10:00 [ERROR] bin\mysqld-nt: Sort aborted 070725 13:10:00 [ERROR] bin\mysqld-nt: Record has changed since last read in table 't1' 070725 13:10:00 [ERROR] bin\mysqld-nt: Sort aborted 070725 13:10:01 [ERROR] bin\mysqld-nt: Record has changed since last read in table 't1'
[26 Jul 2007 2:23]
Kevin Lewis
I am going to take a stab at a possible fix for this memory leak. From the information schema output that Hakan sent earlier today, we know that what is leaking is a RecordVersion allocated during selectForUpdate, a blobObject, and an object used by the pageWriter to track dirty pages. I am not at all familiar with the pageWriter, but I have a theory about the other two; Last Friday, Jim checked in a change to StorageDatabase to fix a memory leak he was seeing running falcon_concurrent_blob_updates. I don't think there was a bug associated with it, but I remember him describing the problem. The description says; "Fix record reference problem during record deadlock processes.". The change was that a recordVersion object that got a refCount inside the scope of a try{} would not get released in the catch {} scope if an exception was thrown. Jim fixed this by declaring the RecordVersion *candidate outside the try{} scope and adding a release() in the catch{} section. This was done in StorageDatabase::nextRow and StorageDatabase::fetch(). When I reviewed the code for him, I noticed that the same thing should be done in StorageDatabase::deleteRow(). When I talked to him about it on Sunday, he did not think that an exception there would likely happen, and suggested holding this fix for after the Beta, since it did not seem to be a problem at the time. But he also mentioned that the candidate returned in the try{} section of StorageDatabase::deleteRow() would have been allocated in fetchForUpdate! (This is where our current leaking memory is originating). So I am going to check in this change, which has been approved by Jim, but was on hold. Hakan, please let me know if it makes a difference in the current slow memory leak you see running falcon_concurrent_blob_updates. It is possible, in my limited understanding, that this leak explains the Cache::fakePage() problem as well. Jim wrote in a recent email; "Fake page is used to create a buffer for a newly allocated (or reallocated page). If somebody else has it lock, there has been a terrible misunderstanding." Maybe this unreleased recordVersion is keeping the blob page buffer around and locked after it should have been deleted. Anyway, this change is worth a try.
[26 Jul 2007 10:25]
Hakan Küçükyılmaz
Memory usage is still growing steady when running falcon_concurrent_blob_update in a loop.
[26 Jul 2007 17:44]
Hakan Küçükyılmaz
Running falcon_concurrent_blob_update in a loop with constant seed value of 1 leeds to segmentations fault. (Memory leak is still present). while ( [ $? -eq 0 ] ) do ./falcon_concurrent_blob_updates -r540 -s1 done Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1076373856 (LWP 17319)] 0x00000000008f159e in RecordVersion::fetchVersion (this=0x0, trans=0x2aaaf1a332c8) at RecordVersion.cpp:89 89 Transaction *recTransaction = transaction; (gdb) bt #0 0x00000000008f159e in RecordVersion::fetchVersion (this=0x0, trans=0x2aaaf1a332c8) at RecordVersion.cpp:89 #1 0x00000000008f16db in RecordVersion::fetchVersion (this=0x2aab1ecbde50, trans=0x2aaaf1a332c8) at RecordVersion.cpp:111 #2 0x00000000008f16db in RecordVersion::fetchVersion (this=0x2aab1ecbbd20, trans=0x2aaaf1a332c8) at RecordVersion.cpp:111 #3 0x0000000000863ff6 in StorageDatabase::nextRow (this=0x2aaaabad71c8, storageTable=0x2aab1e56e2d8, recordNumber=7903, lockForUpdate=false) at StorageDatabase.cpp:250 #4 0x000000000086aa11 in StorageTable::next (this=0x2aab1e56e2d8, recordNumber=0, lockForUpdate=false) at StorageTable.cpp:126 #5 0x0000000000860448 in StorageInterface::rnd_next (this=0x2aaaf1b37168, buf=0x2aaaf1b373a8 "�C:��") at ha_falcon.cpp:481 #6 0x000000000078620d in rr_sequential (info=0x2aaaf1b03cb0) at records.cc:363 #7 0x00000000006efdc7 in join_init_read_record (tab=0x2aaaf1b03c30) at sql_select.cc:11480 #8 0x00000000006ec7ae in sub_select (join=0x2aaaf1b28b68, join_tab=0x2aaaf1b03c30, end_of_records=false) at sql_select.cc:10830 #9 0x00000000006ecb6d in do_select (join=0x2aaaf1b28b68, fields=0x10bfe30, table=0x0, procedure=0x0) at sql_select.cc:10594 #10 0x0000000000702d69 in JOIN::exec (this=0x2aaaf1b28b68) at sql_select.cc:2099 #11 0x00000000007031c5 in mysql_select (thd=0x10be328, rref_pointer_array=0x10bff10, tables=0x2aaaf1b02cc8, wild_num=0, fields=@0x10bfe30, conds=0x2aaaf1b032b8, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2148813312, result=0x2aaaf1b034f0, unit=0x10bf960, select_lex=0x10bfd28) at sql_select.cc:2275 #12 0x00000000007034f7 in handle_select (thd=0x10be328, lex=0x10bf8c0, result=0x2aaaf1b034f0, setup_tables_done_option=0) at sql_select.cc:258 #13 0x0000000000678cfe in execute_sqlcom_select (thd=0x10be328, all_tables=0x2aaaf1b02cc8) at sql_parse.cc:4485 #14 0x000000000067ede9 in mysql_execute_command (thd=0x10be328) at sql_parse.cc:1835 #15 0x000000000068720b in mysql_parse (thd=0x10be328, inBuf=0x2aaaf1b02a08 "SELECT c1 FROM t1 WHERE c2 = 'qkgdmkinlsyocwqvnfvcguqwbitvlkisfmswiwyqzgenxngyreyidxhebouuzqenjgofwhmykunzvkbjbfixdnzjnqknabyrvjsshmcfaopfqhbxjhbvzksqvlwqwcducafdhubmybwtckchgsvqlrpkwheuaztnkkysy'", length=196, found_semicolon=0x40281fa0) at sql_parse.cc:5381 #16 0x0000000000687e0d in dispatch_command (command=COM_QUERY, thd=0x10be328, packet=0x2aaaf1b17b19 "", packet_length=197) at sql_parse.cc:907 #17 0x0000000000689192 in do_command (thd=0x10be328) at sql_parse.cc:669 #18 0x0000000000677522 in handle_one_connection (arg=0x103aba8) at sql_connect.cc:1091 #19 0x00002aaaab46cfa5 in start_thread () from /lib64/tls/libpthread.so.0 #20 0x00002aaaab1f4cb2 in clone () from /lib64/tls/libc.so.6 #21 0x0000000000000000 in ?? () #22 0x0000000000000000 in ?? () Or slightly different backtrace: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1076373856 (LWP 21049)] 0x00000000008f159e in RecordVersion::fetchVersion (this=0x0, trans=0x2aaaf1eee4d0) at RecordVersion.cpp:89 89 Transaction *recTransaction = transaction; (gdb) bt #0 0x00000000008f159e in RecordVersion::fetchVersion (this=0x0, trans=0x2aaaf1eee4d0) at RecordVersion.cpp:89 #1 0x00000000008f16db in RecordVersion::fetchVersion (this=0x2aab4457c698, trans=0x2aaaf1eee4d0) at RecordVersion.cpp:111 #2 0x0000000000863ff6 in StorageDatabase::nextRow (this=0x2aaaabad71c8, storageTable=0x2aab43bfbb48, recordNumber=4250, lockForUpdate=false) at StorageDatabase.cpp:250 #3 0x000000000086aa11 in StorageTable::next (this=0x2aab43bfbb48, recordNumber=4250, lockForUpdate=false) at StorageTable.cpp:126 #4 0x0000000000860448 in StorageInterface::rnd_next (this=0x2aaaf251a218, buf=0x2aaaf251a458 "�?\206\237�") at ha_falcon.cpp:481 #5 0x000000000078620d in rr_sequential (info=0x10968f0) at records.cc:363 #6 0x00000000006ec7d8 in sub_select (join=0x10917d8, join_tab=0x1096870, end_of_records=false) at sql_select.cc:10836 #7 0x00000000006ecb6d in do_select (join=0x10917d8, fields=0x108edf0, table=0x0, procedure=0x0) at sql_select.cc:10594 #8 0x0000000000702d69 in JOIN::exec (this=0x10917d8) at sql_select.cc:2099 #9 0x00000000007031c5 in mysql_select (thd=0x108d2e8, rref_pointer_array=0x108eed0, tables=0x1095a78, wild_num=0, fields=@0x108edf0, conds=0x1095ef8, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2147764736, result=0x1096130, unit=0x108e920, select_lex=0x108ece8) at sql_select.cc:2275 #10 0x00000000007034f7 in handle_select (thd=0x108d2e8, lex=0x108e880, result=0x1096130, setup_tables_done_option=0) at sql_select.cc:258 #11 0x0000000000678cfe in execute_sqlcom_select (thd=0x108d2e8, all_tables=0x1095a78) at sql_parse.cc:4485 #12 0x000000000067ede9 in mysql_execute_command (thd=0x108d2e8) at sql_parse.cc:1835 #13 0x000000000068720b in mysql_parse (thd=0x108d2e8, inBuf=0x1095858 "SELECT c1 FROM t1 WHERE c2 = 'uimfslitzlo'", length=42, found_semicolon=0x40281fa0) at sql_parse.cc:5381 #14 0x0000000000687e0d in dispatch_command (command=COM_QUERY, thd=0x108d2e8, packet=0x10998d9 "", packet_length=43) at sql_parse.cc:907 #15 0x0000000000689192 in do_command (thd=0x108d2e8) at sql_parse.cc:669 #16 0x0000000000677522 in handle_one_connection (arg=0x103aba8) at sql_connect.cc:1091 #17 0x00002aaaab46cfa5 in start_thread () from /lib64/tls/libpthread.so.0 #18 0x00002aaaab1f4cb2 in clone () from /lib64/tls/libc.so.6
[30 Jul 2007 15:28]
Kevin Lewis
I tried running falcon_concurrent_blob_update using only one thread to see if concurrency was the problem. I found that even with only one thread, there are several memory leaks. When I run the following sveral times, the memory increases fairly consistently by the following amounts falcon_concurrent_blob_update -r300 -s1 -n1 File,Line Function Call Bytes Percent PageWriter.cpp,126 PageWriter::getElement new DirtyLinen 971 1.10% Table.cpp,2511 Table::getBinaryBlob new BinaryBlob 73030 82.47% Table.cpp,2550 Table::insert new RecordVersion 2048 2.31% Table.cpp,2652 Table::update new RecordVersion 862 0.97% Table.cpp,3007 Table::fetchForUpdate new RecordVersion 11646 13.15% Most of the memory lost is due to the buildup of BinaryBlob objects. It turns out that many of the calls to Value::getBlob(), which increments the refCount by calling BinaryBlob::addRef(), do not have equivalent calls to BinaryBlob::release(). The same situation exists for calls to Value::getClob. After adding about 20 calls to release(), I was able to eliminate the memory leak of BinaryBlob objects. This accounts for about 82% of the memory leakage in this bug.
[3 Aug 2007 20:50]
Kevin Lewis
There were three memory leaks identified by INFORMATION_SCHEMA FALCON_SYSTEM_MEMORY_DETAIL when running falcon_concurrent_blob_updates. The first one was fixed last week with changes to the BinaryBlob and AsciiBlob code to make sure the refCount on those objects were always released. The second problem is a leak of RecordVersion objects. The problem with debugging this code is that there are hundreds of places where the RecordVersion objects are referenced. I had to add a bunch of temporary code that tracks and displays the history of these objects during the test. The problem turned out to be in Transaction::releaseRecordLocks. It was unlinking the RecordVersion object without releasing the refCount. It needs to call Transaction::removeRecord() except that this function contains a 'for' loop to find the object. So I split removeRecord into two functions, one to find the object, the other to remove it from the transaction. Now releaseRecordLocks will call the second removeRecord directly. This solves the second memory leak. The third is of DirtyLinen objects allocated at PageWriter.cpp, line 126. That is next.
[8 Aug 2007 15:07]
Kevin Lewis
Fixed the third and final memory leak from this test. DirtyPage elements were not being added back into the freePages list within the PageWriter. This happened in PageWriter::writer(). It would call PageWriter::removeElement and then only call PageWriter.::release() if the Bdb was null. If bdb was not null, it would call cache->writePage() which would have recycled the element in PageWriter::pageWritten() if removeElement had not already been called. The fix involved always calling release() after removeElement() and adjusting the Sync objects a little to add better protection to the hash table and other lists of DirtyPage elements.
[10 Aug 2007 8:06]
Hakan Küçükyılmaz
I don't see any memory leak with this test anymore. But it crashes with segmentation fault. I ran the test in a loop with increasing SEED value like SEED=1 while ( [ $? -eq 0 ] ); do ./falcon_concurrent_blob_updates -r540 -s$SEED; SEED=$(($SEED + 3)); done Completed spawning new database worker threads. Runtime is 540. Seed value is 25. Isolation level is READ-COMMITTED. Testcase is now running, so watch for error output. Waiting for worker threads to finish ... Running initializations ... Pre-generating 16777216 bytes of random data. About to spawn 2 threads 1 2 Completed spawning new database worker threads. Runtime is 540. Seed value is 28. Isolation level is READ-COMMITTED. Testcase is now running, so watch for error output. Waiting for worker threads to finish ... --> crash Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1074272592 (LWP 935)] 0x0000000000924010 in RecordVersion::fetchVersion (this=0x0, trans=0x2aaab2758a80) at RecordVersion.cpp:89 89 Transaction *recTransaction = transaction; (gdb) bt #0 0x0000000000924010 in RecordVersion::fetchVersion (this=0x0, trans=0x2aaab2758a80) at RecordVersion.cpp:89 #1 0x000000000092414d in RecordVersion::fetchVersion (this=0x2aaab2749cc8, trans=0x2aaab2758a80) at RecordVersion.cpp:111 #2 0x000000000092414d in RecordVersion::fetchVersion (this=0x2aaab23b1e08, trans=0x2aaab2758a80) at RecordVersion.cpp:111 #3 0x000000000089e06c in StorageDatabase::nextRow (this=0x2b051364c1c8, storageTable=0x2b051349c750, recordNumber=8099, lockForUpdate=false) at StorageDatabase.cpp:250 #4 0x00000000008a3939 in StorageTable::next (this=0x2b051349c750, recordNumber=0, lockForUpdate=false) at StorageTable.cpp:126 #5 0x000000000088c64a in StorageInterface::rnd_next (this=0x11bd118, buf=0x11bd358 "�\005N\206X") at ha_falcon.cpp:480 #6 0x00000000007b16d9 in rr_sequential (info=0x11c2a58) at records.cc:363 #7 0x000000000070dcd5 in join_init_read_record (tab=0x11c29d8) at sql_select.cc:11480 #8 0x000000000070d3b3 in sub_select (join=0x11bd6f8, join_tab=0x11c29d8, end_of_records=false) at sql_select.cc:10830 #9 0x0000000000715ecd in do_select (join=0x11bd6f8, fields=0x11bc770, table=0x0, procedure=0x0) at sql_select.cc:10594 #10 0x000000000072ebd6 in JOIN::exec (this=0x11bd6f8) at sql_select.cc:2099 #11 0x000000000072a3d8 in mysql_select (thd=0x11bac68, rref_pointer_array=0x11bc850, tables=0x11c1a30, wild_num=0, fields=@0x11bc770, conds=0x11c2060, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2148813312, result=0x11c2298, unit=0x11bc2a0, select_lex=0x11bc668) at sql_select.cc:2275 #12 0x000000000072eeef in handle_select (thd=0x11bac68, lex=0x11bc200, result=0x11c2298, setup_tables_done_option=0) at sql_select.cc:258 #13 0x00000000006a950a in execute_sqlcom_select (thd=0x11bac68, all_tables=0x11c1a30) at sql_parse.cc:4485 #14 0x00000000006aad7c in mysql_execute_command (thd=0x11bac68) at sql_parse.cc:1835 #15 0x00000000006b3103 in mysql_parse (thd=0x11bac68, inBuf=0x11c1738 "SELECT c1 FROM t1 WHERE c2 = 'pucuufjgxbdxdyxbcymbtcelwtxboqfrnfqefjlhrgddslbgsbbpjjyeylnntylxanfaobrnnczrsqtnjiaqitzfqxjgspejyzigkqjusdbrthvenfbtirymqaomyrxzinssowylbngxjtgjnqvojpquyaceapsawhrsusiivz"..., length=256, found_semicolon=0x40080f00) at sql_parse.cc:5381 #16 0x00000000006b3e3d in dispatch_command (command=COM_QUERY, thd=0x11bac68, packet=0x11d8849 "", packet_length=257) at sql_parse.cc:907 #17 0x00000000006b51e4 in do_command (thd=0x11bac68) at sql_parse.cc:669 #18 0x00000000006a3392 in handle_one_connection (arg=0x118ea68) at sql_connect.cc:1091 #19 0x00002b0511dd9225 in start_thread () from /lib/libpthread.so.0 #20 0x00002b0512ba733d in clone () from /lib/libc.so.6
[10 Aug 2007 19:07]
Hakan Küçükyılmaz
Memory leak is fixed. Crash and assertion will be reported as new bugs.