Bug #33362 | Query cache invalidation may hang/crash if cached query uses many tables | ||
---|---|---|---|
Submitted: | 19 Dec 2007 10:42 | Modified: | 12 Aug 2008 19:22 |
Reporter: | Bjorn Julander | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Query Cache | Severity: | S2 (Serious) |
Version: | all | OS: | Any |
Assigned to: | Davi Arnaut | CPU Architecture: | Any |
[19 Dec 2007 10:42]
Bjorn Julander
[19 Dec 2007 10:44]
Bjorn Julander
Script to prepare db + comment on usage
Attachment: hang1.zip (application/zip, text), 54.89 KiB.
[19 Dec 2007 20:26]
Susanne Ebrecht
Hello Bjorn, I took your test and did exactly what you described. Of course it needed time, to create all ... mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); Empty set (5 min 43.49 sec) mysql> truncate table isupquarterly_2008_06_02_09; ERROR 1105 (HY000): MyISAM table 'isupquarterly_2008_06_02_09' is in use (most likely by a MERGE table). Try FLUSH TABLES. mysql> flush tables; Query OK, 0 rows affected (4 min 29.14 sec) mysql> truncate table isupquarterly_2008_06_02_09; Query OK, 0 rows affected (0.00 sec) mysql> select version()\G *************************** 1. row *************************** version(): 5.0.54-debug 1 row in set (0.00 sec) select, truncate and flush needed time, but they worked. Also I connected to another database with: $ mysql mydatabase No problem, I could work on this database during the select, truncate and flush were running. Of course, it was a little bit slower, but it worked. In made the same tests again with: mysql> select version()\G *************************** 1. row *************************** version(): 5.0.51-debug 1 row in set (0.02 sec) That's our latest released community server. The tests have had the same results. All worked fine. My testsystem: $ uname -a Linux swedishchef 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64 GNU/Linux $ mount /dev/sda2 on / type ext3 (rw,errors=remount-ro) My my.cnf: [mysqld] port=5054 socket=/tmp/mysql50bk.sock user=miracee [mysqld_safe] open-files-limit=65535 [client] port=5054 socket=/tmp/mysql50bk.sock Please, try one of our newer versions like 5.0.51 or 5.0.52. It also could be an error of your file system. As you can see, there was curious behaviour with ext3. The version 5.0.49 confused me too. We never have had such a version. Maybe you meant 5.0.48. Also you didn't tell us, how you installed MySQL. Please, consider, we only support our own packages. We don't support packages from distributors. Many thanks for choosing MySQL.
[20 Dec 2007 8:02]
Bjorn Julander
Susanne, thanks for taking your time to check this! Too bad you cannot reproduce the error. I have tested some more and the error seems to be OS specific to RedHat RHEL4, but not very sensitive to kernel version. Perhaps some library that is included creates the malfunction? To verify this, I would like to try a statically built mysql, but am unable to find a prebuilt statically linked such for 64-bit/AMD. Does a such exist or do I have to create it manually? For what it is worth, I can confirm that of the four testboxes we have in our lab, all four has the identical hanging. We also have a number of customer installation with a similar behaviour. We have not tested exactly this test script on them, but we get a very similar hanging on many installations (30+ instances). You asked about mysql-installations. We tried binary releases and RPM version, all downloaded from mysql.com. And yes, I wrote wrong on version, it should have been 5.0.48. Some info about the machines that have the hanging: * single and dual CPU, all have dualCore (HP DL 365) * All running RedHat enterprise 4, 'RHEL4', update 4 * File system is ext3. You suggested some oddities about this, can you elaborate? Any other info I can provide that would be helpful? Again, thansk for taking your time about this!
[3 Jan 2008 13:23]
Susanne Ebrecht
Hi Björn, we don't provide "statically linked" for 64bit. We provided once one for Linux 32bit, but this only occurs problems. For the newest community version (5.0.51), we only provide the source code. Please try one of our newest versions and let us know, if the error still occurs.
[28 Jan 2008 9:57]
Bjorn Julander
A bit slow on updates, but here we go: We have this problem on all 5.0 versions tested. We have tried to switch kernel, compiling MySQL and installing through RPMs. We have the same issue on all 5.0 versions. The scenario I reported was a way for us to hang the system every time in a systematic fashion. What I'm trying to say is that we do have this issue never the less on all our servers, the above description is the only way we've been able to trigger it systematically. It seem to be dependent on HW and linux release, but not too much on kernel versions. The actual issue might be related to or even identical to this bug report: http://bugs.mysql.com/bug.php?id=18137 I'm especially thinking about the entries starting from 2:d December and after that. The behaviour looks identical to our issue.
[14 Feb 2008 13:39]
Susanne Ebrecht
Björn, I wanted to try the tests again with a RedHat machine. Therefor I need to know: 32bit or 64bit?
[14 Feb 2008 14:23]
Bjorn Julander
Susanne, we're using 64-bit only, both on operating system and on mysql release. We tried to run 32-bit mysql on the same machine, and it also hang. If you want any more machine/OS specific details, just ask.
[8 Apr 2008 15:33]
Susanne Ebrecht
This is a very weird bug. I tested it again yesterday/today by using RedHat RHEL 4 (64bit) and by using my Ubuntu. I made the same tests as in December. -bash-3.00$ uname -a Linux ndbsup-4.mysql.com 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux $ uname -a Linux swedishchef 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64 GNU/Linux I know that I used the same Ubuntu system in December for testing. The only thing I did here from December until know was: "apt-get update && apt-get upgrade" I remeber this December day of testing and that I could work as usual on the system during the "SELECT/FLUSH/TRUNCATE" command on the "hang1" database. Today I got the same on the Red Hat and on the Ubuntu system: I set the open files limit to 65536. $ ./bin/mysql -A hang1.sql need time but works fine ... 1. Thread: $ ./bin/mysql -A hang1 mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); 2. Thread: $ ./bin/mysql mydb hangs until the select is finished. 3. Thread: $ ./bin/mysql mysql> use test; hangs until the select is finished. 4. Thread: here I already have had opened the CLI with use of "mydb" mysql> select * from t; hangs until the select is finished. 5. Thread: $ ./bin/mysql -A mydb ahhh this works .... but: mysql> select * from t; hangs .... The same during "flush table" or "truncate". During this I made: $ strace ./bin/mysql test strace showed me that all worked fine until here: write(3, "\f\0\0\0\3show tables", 16) = 16 read(3, I made this test again and used gdb -p <PID> (gdb) thread apply all bt This shows me that there is a confusion in libc/libpthread. I didn't got a total hang like the bug reporter here but I also figured out: the newer the kernel the more problems will you get. Unfortunately I only could do some short test with newer RH Linux Kernel (2.6.9-34) but during these test I got a total hang too. Summary of my tests: - The newer the kernel the more problems are here - Its not only RedHat related, you will run into these problem on other Linux distributions too but RedHat is the system where you will find it surely. - There is a common with newer libc/libpthread
[8 Apr 2008 19:33]
Bjorn Julander
Thank you Susanne for taking time to test and validate this properly. It's nice to hear that you at least get a result somewhat similar to what we get, perhaps that will make it somewhat more understandable or even debuggable. I might add that we have also seen the same behaviour in MySQL 5.1.23 (rc) and also on large set of kernels. Not with the attached script, but with our running product. I'm really, really convinced that it is the same root cause. Our general feeling, after seen this hanging several hundred times by now, is that the functions max() and min() on large merge tables improve the odds for the hanging considerably.
[11 Apr 2008 17:19]
Susanne Ebrecht
I made tests again with Ubuntu: Linux mycomputer 2.6.22-14-generic #1 SMP Thu Jan 31 23:33:13 UTC 2008 x86_64 GNU/Linux using: NPTL 2.6.1 --- Test: open_files are set to 65536 Thread 1: connect to a database and create a test table t. Stay connected. Thread 2: $ ./bin/mysql -A < hang1.sql $ ./bin/mysql -A hang1 mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); mysql> truncate table isupquarterly_2008_06_02_09; mysql> flush tables; mysql> truncate table isupquarterly_2008_06_02_09; During select/truncate/flush process: Thread 3: $ ./bin/mysql test Thread 1: mysql> select * from t; or an insert/update/delete query. Thread 4: $ ./bin/mysql mysql> select * from test.t; or an insert/update/delete query. or just: mysql> use test; --- MySQL 5.0-bk-tree: During Thread 2 is doing the select/flush all other threads are waiting. If thread 2 is finishing select or flush or truncate you will get a connection to another database and the queries that you tried on other database will be executed. mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); Empty set (6 min 1.47 sec) mysql> flush tables; Query OK, 0 rows affected (2 min 58.11 sec) mysql> truncate table isupquarterly_2008_06_02_09; Query OK, 0 rows affected (0.00 sec) --- MySQL 5.1-bk-tree: Thread 2: mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); I can work fine on thread 1,3 and 4 during this select. Unfortunately the select is hanging. I aborted after 20 minutes waiting. mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); Query aborted by Ctrl+C $ ./bin/mysql -A hang1 This just hangs I tried another connection: $ ./bin/mysql test This hanged too I tried one of the still open connection: mysql> select * from t; hanged ... $ ./bin/mysqladmin shutdown hangs I had to use: $ kill -9 pid and then to restart the daemon. After restart: mysql> truncate table isupquarterly_2008_06_02_09; Query OK, 0 rows affected (0.00 sec) mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly); all hangs again ... --- MySQL 6.0 bk-tree: The same behaviour as 5.1
[23 Apr 2008 11:11]
Ingo Strüwing
Do you use debug servers for testing? On my Debian machine with debug servers I got 5 min in 5.0 (where you got 6) and 55 min in 5.1 (where you killed it after 20). Perhaps you was not patient enough. I noticed 100% CPU during the whole select. You said just "hang". So it's not clear to me if you had a different behavior. Using production servers (non-debug), I got 47 and 53 seconds respectively. The performance difference comes from the new MERGE implementation in 5.1. It is less efficient, but fixes a lot of MERGE table problems. BTW. In 5.1 the MyISAM tables are opened through the table cache. If you access the MERGE table frequently, I suggest to set table-open-cache and table-definition-cache greater than 9601 for the 9600 sub-tables. If multiple sessions access that MERGE table in parallel, multiply table-open-cache by the number of sessions.
[23 Apr 2008 11:40]
Bjorn Julander
Ingo: I'm the author of the original bug report. Some comments on your comment: On our servers, the hang is indefinite, on several occasions many days, perhaps 4-6 days, at the longest for several weeks on an non supervised box. One can also note that if you do a system trace on these queries on the mysqld process and it's childs, there is no activity after say one second. When we are experiencing this issue, there was only one single query accessed the merge table, that query works fine. It is the then the first question that accesses any of the tables that the merge consists of that will cause the hang, ie the usage of the select on the merge table is only a 'preparation' phase. Also, the CPU load is 0 and the machine seem to completely idling at this stage. Using other tools such as vmstat etc. also show no I/O activity at all, and as I mentioned earlier, system traces show no activity at all. The above seem to be different than the behaviour you have, but this is also noticed, this really changes on different hw/sw, notably kernel and/or different linux distributions. We have a couple of boxes at office where this is 100% repeatable using the original instructions (start of this thread).
[23 Apr 2008 12:52]
Ingo Strüwing
Thanks Bjorn. My comment was more directed to Susanne, who used a similar system as me. I understand that the bug is not fixed though we cannot repeat it on our local PCs. I will try to repeat the real problem on a machine with Redhat Linux. OTOH it could be helpful if you could provide stack backtraces of all threads that participate in the hang. This might be just one thread sitting on the mutex (or futex as you said) and all others waiting for SQL statements, I/O or other system events. But it could also be two threads waiting on their resources mutually. In seldom cases even more than two threads block each other. These backtraces, in conjunction with the history of SQL statements executed by each thread, would be what I would try to get at by repeating the test on our machines. In a majority of cases this information is sufficient to understand what went wrong with lockups. There are other cases though... Regards Ingo
[23 Apr 2008 15:45]
Ingo Strüwing
Hi Bjorn, the only machine with Redhat, I could get my hands at, is a 32-bit machine and has a file limit of 1024, which I cannot increase. I reduced the test file to use 250 tables only. The result worked fine. No hang. So my last hope is that you can retrieve the backtraces for us as described in my email. Regards Ingo
[29 May 2008 19:37]
Davi Arnaut
I was able to reproduce it on 5.1 head and yes, it's query cache related.
[29 May 2008 21:42]
Davi Arnaut
Hi Bjorn, I think I nailed down the problem in the query cache code and the fix seems to be working fine so far. I'm going to test it some more and if everything goes well a patch will be ready for review friday. Thanks for you detailed explanation and the provided backtrace, it was crucial to reproduce and isolate the problem.
[30 May 2008 15:18]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/47263 2658 Davi Arnaut 2008-05-30 Bug#33362: Query cache invalidation (truncate) may hang if cached query uses many tables The problem was that query cache would not properly cache queries which used 256 or more tables but yet would leave behind query cache blocks pointing to freed data. Later when invalidating the query cache (due to a truncate), query cache would attempt to grab a lock which resided in the freed data, leading to hangs or undefined behavior. This was happening due to a improper return value from the function responsible for registering the tables used in the query (so the cache can be invalidated if one of the tables is modified). The function expected a return value of type boolean (char, 8 bits) that indicating success (1) or failure (0) but the number of tables registered (unsigned int, 32 bits) was being returned instead. This caused the function to return failure for cases where it had actually succeed because when an type (unsigned int) is converted to a narrower type (char), the excess bits on the left are discarded. Thus if the 8 rightmost bit were zero, the return value would be 0 (failure). The solution is to simply return true (1) only if the number of registered table is greater than zero and false (0) otherwise.
[30 May 2008 15:46]
Kristofer Pettersson
Review over irc => patch approved.
[3 Jun 2008 13:59]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/47367 2660 Davi Arnaut 2008-06-03 Bug#33362: Query cache invalidation (truncate) may hang if cached query uses many tables The problem was that query cache would not properly cache queries which used 256 or more tables but yet would leave behind query cache blocks pointing to freed (destroyed) data. Later when invalidating (due to a truncate) query cache would attempt to grab a lock which resided in the freed data, leading to hangs or undefined behavior. This was happening due to a improper return value from the function responsible for registering the tables used in the query (so the cache can be invalidated later if one of the tables is modified). The function expected a return value of type boolean (char, 8 bits) indicating success (1) or failure (0) but the number of tables registered (unsigned int, 32 bits) was being returned instead. This caused the function to return failure for cases where it had actually succeed because when a type (unsigned int) is converted to a narrower type (char), the excess bits on the left are discarded. Thus if the 8 rightmost bits are zero, the return value will be 0 (failure). The solution is to simply return true (1) only if the number of registered table is greater than zero and false (0) otherwise.
[3 Jun 2008 14:17]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/47373 2660 Davi Arnaut 2008-06-03 Bug#33362: Query cache invalidation (truncate) may hang if cached query uses many tables The problem was that query cache would not properly cache queries which used 256 or more tables but yet would leave behind query cache blocks pointing to freed (destroyed) data. Later when invalidating (due to a truncate) query cache would attempt to grab a lock which resided in the freed data, leading to hangs or undefined behavior. This was happening due to a improper return value from the function responsible for registering the tables used in the query (so the cache can be invalidated later if one of the tables is modified). The function expected a return value of type boolean (char, 8 bits) indicating success (1) or failure (0) but the number of tables registered (unsigned int, 32 bits) was being returned instead. This caused the function to return failure for cases where it had actually succeed because when a type (unsigned int) is converted to a narrower type (char), the excess bits on the left are discarded. Thus if the 8 rightmost bits are zero, the return value will be 0 (failure). The solution is to simply return true (1) only if the number of registered table is greater than zero and false (0) otherwise.
[3 Jun 2008 14:18]
Davi Arnaut
Queued to 5.1-bugteam
[23 Jul 2008 10:54]
Georgi Kodinov
Pushed in 5.1.28 and 6.0.7-alpha
[24 Jul 2008 0:23]
Paul DuBois
Noted in 5.1.28, 6.0.7 changelogs. Cached queries that used 256 or more tables were not properly cached, so that later query invalidation due to a TRUNCATE TABLE for one of the tables caused the server to hang. Setting report to Patch queued pending push of fix into 5.0.x.
[24 Jul 2008 7:05]
Georgi Kodinov
This bug has a target version 5.1+. Currently it won't go in 5.0.
[24 Jul 2008 7:08]
Simon Mudd
Is there any plan to backport this to 5.0 which is used in production by a lot of companies?
[24 Jul 2008 7:15]
Georgi Kodinov
Not plans atm to backport it to 5.0 : as you've mentioned it's a production release and the fix for this bug is too complicated for us to be sure that it won't destabilize 5.0 without the additional testing that it will get in 5.1.
[24 Jul 2008 10:27]
Georgi Kodinov
Changing target version to 5.0 after double-checking the risk of the bug fix with key developers.
[24 Jul 2008 10:40]
Susanne Ebrecht
Simon, sorry, this was an internal missunderstanding ... as Georgi just told we will check the risk again and if risk is not too risky we will backport it to 5.0.
[24 Jul 2008 14:16]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/50414 2648 Davi Arnaut 2008-07-24 Cherry-pick Bug#33362 from mysql-5.1
[24 Jul 2008 14:29]
Davi Arnaut
Queued to 5.0-bugteam
[28 Jul 2008 8:27]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/50584 2649 Igor Babaev 2008-07-28 [merge] Merge
[28 Jul 2008 16:49]
Bugs System
Pushed into 5.1.27 (revid:joerg@mysql.com-20080626161101-h5hsaluwtf05xdr2) (version source revid:joerg@mysql.com-20080626161101-h5hsaluwtf05xdr2) (pib:3)
[28 Jul 2008 17:45]
Paul DuBois
Resetting report to Patch Queued pending push of fix into 5.0.x.
[12 Aug 2008 14:53]
Bugs System
Pushed into 6.0.7-alpha (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:davi.arnaut@sun.com-20080812141852-8e6knbqclpfd8irn) (pib:3)
[12 Aug 2008 15:16]
Bugs System
Pushed into 5.1.28 (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:davi.arnaut@sun.com-20080812142843-he05ncsggstbn57z) (pib:3)
[12 Aug 2008 18:51]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/51449 2647 Davi Arnaut 2008-08-12 [merge] Merge mysql-5.0-bugteam into mysql-5.0
[12 Aug 2008 19:02]
Paul DuBois
Setting report to Patch Queued pending push of fix into 5.0.x.
[12 Aug 2008 19:13]
Bugs System
Pushed into 5.0.68 (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:davi.arnaut@sun.com-20080812185100-d47qb8mz2ye6pe6b) (pib:3)
[12 Aug 2008 19:22]
Paul DuBois
Noted in 5.0.68 changelog.
[14 Aug 2008 7:33]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/51602 2684 He Zhenxing 2008-08-14 [merge] Merge 6.0 -> 6.0-rpl-testfixes
[14 Aug 2008 7:36]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/51603 2684 He Zhenxing 2008-08-14 [merge] Merge 6.0 -> 6.0-rpl-testfixes
[19 Aug 2008 9:39]
MySQL Verification Team
just noting that all folks who keep 1 base table per day starting at 01/01 will hit 256 tables in the merge set on september 12.
[28 Aug 2008 20:14]
Bugs System
Pushed into 6.0.7-alpha (revid:cbell@mysql.com-20080822132131-uveo6wiuecy6m2b8) (version source revid:cbell@mysql.com-20080822132131-uveo6wiuecy6m2b8) (pib:3)
[14 Sep 2008 3:21]
Bugs System
Pushed into 6.0.7-alpha (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:sven@mysql.com-20080818175803-c1nutd5773r6b4gd) (pib:3)
[8 Oct 2008 22:23]
Patric Müller
Hello developers, Hello Bjorn (Submitter), I am using mysql 5.1.26 (the non-patched revision in focus of this bug) and I am also experiencing random server crashes with a large query cache. To check if the cause of my problem is the same which is described here, I'd be very glad if either Bjorn or one of the developers can send/post the provided backtraces. Thank you! Patric