MySQL Bugs: #33362: Query cache invalidation may hang/crash if cached query uses many tables

Bug #33362	Query cache invalidation may hang/crash if cached query uses many tables
Submitted:	19 Dec 2007 10:42	Modified:	12 Aug 2008 19:22
Reporter:	Bjorn Julander	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Query Cache	Severity:	S2 (Serious)
Version:	all	OS:	Any
Assigned to:	Davi Arnaut	CPU Architecture:	Any

Description:
mysqld hangs infinitely on simple select/truncate table
after preparation has been made.

We have a large merge table (with or without data in the underlying tables
does not matter). A select with a join on that table work as it should,
but the select / usage following that select will hang mysqld
forever. No activity is being made at all at this stage.
mysqld as a process hangs on a futex (seen by running strace).

This is observed on a set of machines and is easily repeatable
on all machines.

We are running RedHat Enterprise RHEL 4 update 4.
We tried several mysql-versions:
5.0.24, 5.0.26, 5.0.45, 5.0.49, 5.1.22
and also several kernel versions:
2.6.9-34.ELsmp, 2.6.9-55.0.12.ELsmp, 2.6.18-8, 2.6.23.9

Machines are all AMD 64-bit double dualCore:

[root@sunfire qgate_isupgen]# uname -a
Linux sunfire 2.6.23.9 #1 SMP Fri Dec 14 16:40:04 CET 2007 x86_64 x86_64 x86_64 GNU/Linux

Also, one cannot access/connect to it either. start/stop scripts
does not work andf there are no logs in the .err file.
The only way to restart is to 'kill -KILL on mysqld process.

How to repeat:
Start mysql with any of the versions mentioned above.
Our my.cnf is default 'my'huge', but with the addition of
max-open-files set to 65535

We then create a big set (9600 in our case) of tables like the following:
CREATE TABLE `template` (a int) ENGINE=MyISAM;

Then create a merge table containing all of the tables.

The attached script 'hang1.sql' will create this structure in the
database hang1.

Then do the following select:

select a from isupquarterly where a>=(select max(a) from isupquarterly);

Works fine. But commands following this that usa any of the tables above
will hang mysqld. For example:

truncate table isupquarterly_2008_06_02_09;

To sum up:
1) Prepare database using hang1.sql script
2) issue command: select a from isupquarterly where a>=(select max(a) from isupquarterly);
3) issue command: truncate table isupquarterly_2008_06_02_09;

Now the system will hang.

Also, one cannot access/connect to it either. start/stop scripts
does not work andf there are no logs in the .err file.
The only way to restart is to 'kill -KILL on mysqld process.

Script to prepare db + comment on usage

Attachment: hang1.zip (application/zip, text), 54.89 KiB.

Hello Bjorn,

I took your test and did exactly what you described.

Of course it needed time, to create all ...

mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);
Empty set (5 min 43.49 sec)

mysql> truncate table isupquarterly_2008_06_02_09;
ERROR 1105 (HY000): MyISAM table 'isupquarterly_2008_06_02_09' is in use (most likely by a MERGE table). Try FLUSH TABLES.
mysql> flush tables;
Query OK, 0 rows affected (4 min 29.14 sec)

mysql> truncate table isupquarterly_2008_06_02_09;
Query OK, 0 rows affected (0.00 sec)

mysql> select version()\G
*************************** 1. row ***************************
version(): 5.0.54-debug
1 row in set (0.00 sec)

select, truncate and flush needed time, but they worked. Also I connected to another database with:
$ mysql mydatabase

No problem, I could work on this database during the select, truncate and flush were running.

Of course, it was a little bit slower, but it worked.

In made the same tests again with:
mysql> select version()\G
*************************** 1. row ***************************
version(): 5.0.51-debug
1 row in set (0.02 sec)

That's our latest released community server.

The tests have had the same results. All worked fine.

My testsystem:
$ uname -a
Linux swedishchef 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64 GNU/Linux
$ mount
/dev/sda2 on / type ext3 (rw,errors=remount-ro)

My my.cnf:
[mysqld]
port=5054
socket=/tmp/mysql50bk.sock
user=miracee
[mysqld_safe]
open-files-limit=65535
[client]
port=5054
socket=/tmp/mysql50bk.sock

Please, try one of our newer versions like 5.0.51 or 5.0.52.

It also could be an error of your file system. As you can see, there was curious behaviour with ext3.

The version 5.0.49 confused me too. We never have had such a version. Maybe you meant 5.0.48.

Also you didn't tell us, how you installed MySQL. Please, consider, we only support our own packages. We don't support packages from distributors.

Many thanks for choosing MySQL.

Susanne, thanks for taking your time to check this!

Too bad you cannot reproduce the error. I have tested some more
and the error seems to be OS specific to RedHat RHEL4,
but not very sensitive to kernel version. Perhaps some
library that is included creates the malfunction?

To verify this, I would like to try a statically built mysql, but 
am unable to find a prebuilt statically linked such for 64-bit/AMD.
Does a such exist or do I have to create it manually?

For what it is worth, I can confirm that of the four testboxes we have 
in our lab, all four has the identical hanging. We also have a number
of customer installation with a similar behaviour. We have not tested
exactly this test script on them, but we get a very similar hanging
on many installations (30+ instances).

You asked about mysql-installations. We tried binary releases and RPM 
version, all downloaded from mysql.com. And yes, I wrote wrong on version, 
it should have been 5.0.48.

Some info about the machines that have the hanging:
* single and dual CPU, all have dualCore (HP DL 365)
* All running RedHat enterprise 4, 'RHEL4', update 4
* File system is ext3. You suggested some oddities about this,
  can you elaborate?

Any other info I can provide that would be helpful?

Again, thansk for taking your time about this!

Hi Björn,

we don't provide "statically linked" for 64bit. We provided once one for Linux 32bit, but this only occurs problems.

For the newest community version (5.0.51), we only provide the source code.

Please try one of our newest versions and let us know, if the error still occurs.

A bit slow on updates, but here we go:

We have this problem on all 5.0 versions tested. We have tried
to switch kernel, compiling MySQL and installing through RPMs.

We have the same issue on all 5.0 versions. The scenario I reported
was a way for us to hang the system every time in a systematic 
fashion. 
What I'm trying to say is that we do have this issue never the less
on all our servers, the above description is the only way we've been able
to trigger it systematically. It seem to be dependent on HW and 
linux release, but not too much on kernel versions.

The actual issue might be related to or even identical to this bug report:
http://bugs.mysql.com/bug.php?id=18137

I'm especially thinking about the entries starting from 2:d December and
after that. The behaviour looks identical to our issue.

Björn,

I wanted to try the tests again with a RedHat machine.

Therefor I need to know: 32bit or 64bit?

Susanne,

we're using 64-bit only, both on operating system and on mysql
release. We tried to run 32-bit mysql on the same machine,
and it also hang.

If you want any more machine/OS specific details, just ask.

This is a very weird bug.

I tested it again yesterday/today by using RedHat RHEL 4 (64bit) and by using my Ubuntu.

I made the same tests as in December.

-bash-3.00$ uname -a
Linux ndbsup-4.mysql.com 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux

$ uname -a
Linux swedishchef 2.6.20-16-generic #2 SMP Sun Sep 23 18:31:23 UTC 2007 x86_64 GNU/Linux

I know that I used the same Ubuntu system in December for testing. The only thing I did here from December until know was: "apt-get update && apt-get upgrade"

I remeber this December day of testing and that I could work as usual on the system during the "SELECT/FLUSH/TRUNCATE" command on the "hang1" database.

Today I got the same on the Red Hat and on the Ubuntu system:

I set the open files limit to 65536.

$ ./bin/mysql -A hang1.sql

need time but works fine ...

1. Thread:
$ ./bin/mysql -A hang1
mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);

2. Thread:
$ ./bin/mysql mydb
hangs until the select is finished.

3. Thread:
$ ./bin/mysql
mysql> use test;
hangs until the select is finished.

4. Thread:
here I already have had opened the CLI with use of "mydb"
mysql> select * from t;
hangs until the select is finished.

5. Thread:
$ ./bin/mysql -A mydb
ahhh this works .... but:
mysql> select * from t;
hangs ....

The same during "flush table" or "truncate".

During this I made:

$ strace ./bin/mysql test

strace showed me that all worked fine until here:
write(3, "\f\0\0\0\3show tables", 16) = 16
read(3,

I made this test again and used gdb -p <PID>
(gdb) thread apply all bt

This shows me that there is a confusion in libc/libpthread.

I didn't got a total hang like the bug reporter here but I also figured out: the newer the kernel the more problems will you get.

Unfortunately I only could do some short test with newer RH Linux Kernel (2.6.9-34) but during these test I got a total hang too.

Summary of my tests:
- The newer the kernel the more problems are here
- Its not only RedHat related, you will run into these problem on other Linux distributions too but RedHat is the system where you will find it surely.
- There is a common with newer libc/libpthread

Thank you Susanne for taking time to test and validate
this properly. It's nice to hear that you at least 
get a result somewhat similar to what we get,
perhaps that will make it somewhat more understandable
or even debuggable.

I might add that we have also seen the same behaviour in 
MySQL 5.1.23 (rc) and also on large set of kernels. 
Not with the attached script, but with our running 
product. I'm really, really convinced that it is the 
same root cause. Our general feeling, after seen this
hanging several hundred times by now, is that the 
functions max() and min() on large merge tables improve the odds
for the hanging considerably.

I made tests again with Ubuntu:

Linux mycomputer 2.6.22-14-generic #1 SMP Thu Jan 31 23:33:13 UTC 2008 x86_64 GNU/Linux

using: NPTL 2.6.1

---
Test:

open_files are set to 65536

Thread 1:
connect to a database and create a test table t. Stay connected.

Thread 2:
$ ./bin/mysql -A < hang1.sql
$ ./bin/mysql -A hang1
mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);
mysql> truncate table isupquarterly_2008_06_02_09;
mysql> flush tables;
mysql> truncate table isupquarterly_2008_06_02_09;

During select/truncate/flush process:

Thread 3:
$ ./bin/mysql test

Thread 1:
mysql> select * from t;

or an insert/update/delete query.

Thread 4:
$ ./bin/mysql
mysql> select * from test.t;

or an insert/update/delete query.

or just: 
mysql> use test;

---
MySQL 5.0-bk-tree:

During Thread 2 is doing the select/flush all other threads are waiting. If thread 2 is finishing select or flush or truncate you will get a connection to another database and the queries that you tried on other database will be executed.

mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);
Empty set (6 min 1.47 sec)

mysql> flush tables;
Query OK, 0 rows affected (2 min 58.11 sec)

mysql> truncate table isupquarterly_2008_06_02_09;
Query OK, 0 rows affected (0.00 sec)

---
MySQL 5.1-bk-tree:

Thread 2:
mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);

I can work fine on thread 1,3 and 4 during this select.

Unfortunately the select is hanging. I aborted after 20 minutes waiting.

mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);
Query aborted by Ctrl+C

$ ./bin/mysql -A hang1
This just hangs

I tried another connection:

$ ./bin/mysql test
This hanged too

I tried one of the still open connection:

mysql> select * from t;
hanged ...

$ ./bin/mysqladmin shutdown
hangs

I had to use:
$ kill -9 pid
and then to restart the daemon.

After restart:
mysql> truncate table isupquarterly_2008_06_02_09;
Query OK, 0 rows affected (0.00 sec)

mysql> select a from isupquarterly where a>=(select max(a) from isupquarterly);

all hangs again ...

---
MySQL 6.0 bk-tree:

The same behaviour as 5.1

Do you use debug servers for testing?

On my Debian machine with debug servers I got 5 min in 5.0 (where you got 6) and 55 min in 5.1 (where you killed it after 20). Perhaps you was not patient enough. I noticed 100% CPU during the whole select. You said just "hang". So it's not clear to me if you had a different behavior.

Using production servers (non-debug), I got 47 and 53 seconds respectively.

The performance difference comes from the new MERGE implementation in 5.1. It is less efficient, but fixes a lot of MERGE table problems.

BTW. In 5.1 the MyISAM tables are opened through the table cache. If you access the MERGE table frequently, I suggest to set table-open-cache and table-definition-cache greater than 9601 for the 9600 sub-tables. If multiple sessions access that MERGE table in parallel, multiply table-open-cache by the number of sessions.

Ingo: I'm the author of the original bug report.

Some comments on your comment:

On our servers, the hang is indefinite, on several occasions
many days, perhaps 4-6 days, at the longest for several weeks on an 
non supervised box. One can also note that if you do a system trace 
on these queries on the mysqld process and it's childs, there is no 
activity after say one second. 

When we are experiencing this issue, there was only one single
query accessed the merge table, that query works fine. 
It is the then the first question that accesses any of the tables
that the merge consists of that will cause the hang,
ie the usage of the select on the merge table is only 
a 'preparation' phase.
Also, the CPU load is 0 and the machine seem to completely idling
at this stage. Using other tools such as vmstat etc. also show
no I/O activity at all, and as I mentioned earlier, system traces 
show no activity at all.

The above seem to be different than the behaviour you have, but this is also
noticed, this really changes on different hw/sw, notably kernel
and/or different linux distributions. We have a couple of boxes at office
where this is 100% repeatable using the original instructions (start of this 
thread).

Thanks Bjorn. My comment was more directed to Susanne, who used a similar system as me. I understand that the bug is not fixed though we cannot repeat it on our local PCs. I will try to repeat the real problem on a machine with Redhat Linux.

OTOH it could be helpful if you could provide stack backtraces of all threads that participate in the hang. This might be just one thread sitting on the mutex (or futex as you said) and all others waiting for SQL statements, I/O or other system events. But it could also be two threads waiting on their resources mutually. In seldom cases even more than two threads block each other.

These backtraces, in conjunction with the history of SQL statements executed by each thread, would be what I would try to get at by repeating the test on our machines. In a majority of cases this information is sufficient to understand what went wrong with lockups. There are other cases though...

Regards Ingo

Hi Bjorn,

the only machine with Redhat, I could get my hands at, is a 32-bit machine and has a file limit of 1024, which I cannot increase.

I reduced the test file to use 250 tables only. The result worked fine. No hang.

So my last hope is that you can retrieve the backtraces for us as described in my email.

Regards Ingo

I was able to reproduce it on 5.1 head and yes, it's query cache related.

Hi Bjorn,

I think I nailed down the problem in the query cache code and the fix seems to be working fine so far. I'm going to test it some more and if everything goes well a patch will be ready for review friday.

Thanks for you detailed explanation and the provided backtrace, it was crucial to reproduce and isolate the problem.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47263

2658 Davi Arnaut	2008-05-30
      Bug#33362: Query cache invalidation (truncate) may hang
                 if cached query uses many tables
      
      The problem was that query cache would not properly cache
      queries which used 256 or more tables but yet would leave
      behind query cache blocks pointing to freed data. Later when
      invalidating the query cache (due to a truncate), query cache
      would attempt to grab a lock which resided in the freed data,
      leading to hangs or undefined behavior.
      
      This was happening due to a improper return value from the
      function responsible for registering the tables used in the
      query (so the cache can be invalidated if one of the tables
      is modified). The function expected a return value of type
      boolean (char, 8 bits) that indicating success (1) or failure
      (0) but the number of tables registered (unsigned int, 32 bits)
      was being returned instead. This caused the function to return
      failure for cases where it had actually succeed because when an
      type (unsigned int) is converted to a narrower type (char),
      the excess bits on the left are discarded. Thus if the 8
      rightmost bit were zero, the return value would be 0 (failure).
      
      The solution is to simply return true (1) only if the number of
      registered table is greater than zero and false (0) otherwise.

Review over irc => patch approved.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47367

2660 Davi Arnaut	2008-06-03
      Bug#33362: Query cache invalidation (truncate) may hang
                 if cached query uses many tables
      
      The problem was that query cache would not properly cache
      queries which used 256 or more tables but yet would leave
      behind query cache blocks pointing to freed (destroyed)
      data. Later when invalidating (due to a truncate) query cache
      would attempt to grab a lock which resided in the freed data,
      leading to hangs or undefined behavior.
      
      This was happening due to a improper return value from the
      function responsible for registering the tables used in the
      query (so the cache can be invalidated later if one of the
      tables is modified). The function expected a return value of
      type boolean (char, 8 bits) indicating success (1) or failure
      (0) but the number of tables registered (unsigned int, 32 bits)
      was being returned instead. This caused the function to return
      failure for cases where it had actually succeed because when
      a type (unsigned int) is converted to a narrower type (char),
      the excess bits on the left are discarded. Thus if the 8
      rightmost bits are zero, the return value will be 0 (failure).
      
      The solution is to simply return true (1) only if the number of
      registered table is greater than zero and false (0) otherwise.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47373

2660 Davi Arnaut	2008-06-03
      Bug#33362: Query cache invalidation (truncate) may hang
                 if cached query uses many tables
      
      The problem was that query cache would not properly cache
      queries which used 256 or more tables but yet would leave
      behind query cache blocks pointing to freed (destroyed)
      data. Later when invalidating (due to a truncate) query cache
      would attempt to grab a lock which resided in the freed data,
      leading to hangs or undefined behavior.
      
      This was happening due to a improper return value from the
      function responsible for registering the tables used in the
      query (so the cache can be invalidated later if one of the
      tables is modified). The function expected a return value of
      type boolean (char, 8 bits) indicating success (1) or failure
      (0) but the number of tables registered (unsigned int, 32 bits)
      was being returned instead. This caused the function to return
      failure for cases where it had actually succeed because when
      a type (unsigned int) is converted to a narrower type (char),
      the excess bits on the left are discarded. Thus if the 8
      rightmost bits are zero, the return value will be 0 (failure).
      
      The solution is to simply return true (1) only if the number of
      registered table is greater than zero and false (0) otherwise.

Queued to 5.1-bugteam

Pushed in 5.1.28 and 6.0.7-alpha

Noted in 5.1.28, 6.0.7 changelogs.

Cached queries that used 256 or more tables were not properly cached,
so that later query invalidation due to a TRUNCATE TABLE for one of
the tables caused the server to hang.

Setting report to Patch queued pending push of fix into 5.0.x.

This bug has a target version 5.1+. Currently it won't go in 5.0.

Is there any plan to backport this to 5.0 which is used in production by a lot of companies?

Not plans atm to backport it to 5.0 : as you've mentioned it's a production release and the fix for this bug is too complicated for us to be sure that it won't destabilize 5.0 without the additional testing that it will get in 5.1.

Changing target version to 5.0 after double-checking the risk of the bug fix with key developers.

Simon,

sorry, this was an internal missunderstanding ... as Georgi just told we will check the risk again and if risk is not too risky we will backport it to 5.0.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/50414

2648 Davi Arnaut	2008-07-24
      Cherry-pick Bug#33362 from mysql-5.1

Queued to 5.0-bugteam

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/50584

2649 Igor Babaev	2008-07-28 [merge]
      Merge

Pushed into 5.1.27  (revid:joerg@mysql.com-20080626161101-h5hsaluwtf05xdr2) (version source revid:joerg@mysql.com-20080626161101-h5hsaluwtf05xdr2) (pib:3)

Resetting report to Patch Queued pending push of fix into 5.0.x.

Pushed into 6.0.7-alpha  (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:davi.arnaut@sun.com-20080812141852-8e6knbqclpfd8irn) (pib:3)

Pushed into 5.1.28  (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:davi.arnaut@sun.com-20080812142843-he05ncsggstbn57z) (pib:3)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/51449

2647 Davi Arnaut	2008-08-12 [merge]
      Merge mysql-5.0-bugteam into mysql-5.0

Setting report to Patch Queued pending push of fix into 5.0.x.

Pushed into 5.0.68  (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:davi.arnaut@sun.com-20080812185100-d47qb8mz2ye6pe6b) (pib:3)

Noted in 5.0.68 changelog.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/51602

2684 He Zhenxing	2008-08-14 [merge]
      Merge 6.0 -> 6.0-rpl-testfixes

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/51603

2684 He Zhenxing	2008-08-14 [merge]
      Merge 6.0 -> 6.0-rpl-testfixes

just noting that all folks who keep 1 base table per day starting at 01/01 will hit 256 tables in the merge set on september 12.

Pushed into 6.0.7-alpha  (revid:cbell@mysql.com-20080822132131-uveo6wiuecy6m2b8) (version source revid:cbell@mysql.com-20080822132131-uveo6wiuecy6m2b8) (pib:3)

Pushed into 6.0.7-alpha  (revid:davi.arnaut@sun.com-20080724141434-e1s0xs64y1ik29zl) (version source revid:sven@mysql.com-20080818175803-c1nutd5773r6b4gd) (pib:3)

Hello developers, Hello Bjorn (Submitter),

I am using mysql 5.1.26 (the non-patched revision in focus of this bug) and I am also experiencing random server crashes with a large query cache.

To check if the cause of my problem is the same which is described here, I'd be very glad if either Bjorn or one of the developers can send/post the provided backtraces.

Thank you!

Patric