MySQL Bugs: #37633: certain queries become very slow

Bug #37633	certain queries become very slow - locking related?
Submitted:	25 Jun 2008 14:58	Modified:	7 Nov 2015 7:37
Reporter:	Don Cohen	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.0.45	OS:	Any
Assigned to:	Heikki Tuuri	CPU Architecture:	Any

Description:
This might not be what you view as a bug.  From my point of view,
if the system is working as intended, then I need some way to find
out what the problem is and how to fix it.  If the answer is actually
already somewhere in the doc, then my complaint is that it's too well
hidden.
In this particular case, I have a query (below) that normally takes
a few seconds, and for reasons that I have not yet been able to find,
it suddenly starts taking about a minute.  In the past I've associated
this sort of thing with locking/transactions.  The problem is that I
don't know how to find out what locks or transactions are causing it.
Once the problem occurs it seems not to go away without restarting the
server.  Restarting the server, however, does solve the problem.

Here's a simplified sample showing the problem.

 mysql> select count(*) from sr1 join LEO_RoleRating sr2 where
    sr1.roleid=sr2.roleid and sr1.ratingtypeid=sr2.ratingtypeid
    and mrd=sr2.ratingdate ;
 +----------+
 | count(*) |
 +----------+
 |     2717 |
 +----------+
 1 row in set (31.25 sec)

Now I'll simply copy the table LEO_RoleRating to a new table, sr2:

 mysql> create temporary table sr2  select * from LEO_RoleRating;
 Query OK, 2901 rows affected (0.04 sec)
 Records: 2901  Duplicates: 0  Warnings: 0

and replace the table in the query with its copy:

 mysql> select count(*) from sr1 join sr2 where
    sr1.roleid=sr2.roleid and sr1.ratingtypeid=sr2.ratingtypeid
    and mrd=sr2.ratingdate ;
 +----------+
 | count(*) |
 +----------+
 |     2717 |
 +----------+
 1 row in set (6.00 sec)

Since the copy is unindexed, its performance should be, if anything, worse
than the original.  Just in case you're interested ...

 mysql> explain select count(*) from sr1 join LEO_RoleRating sr2 where
    sr1.roleid=sr2.roleid and sr1.ratingtypeid=sr2.ratingtypeid
    and mrd=sr2.ratingdate ;
 +----+-------------+-------+------+------------------------+-------+---------+\
-----------------------+------+-------------+
 | id | select_type | table | type | possible_keys          | key   | key_len |\
 ref                   | rows | Extra       |
 +----+-------------+-------+------+------------------------+-------+---------+\
-----------------------+------+-------------+
 |  1 | SIMPLE      | sr1   | ALL  | NULL                   | NULL  | NULL    |\
 NULL                  | 2717 |             |
 |  1 | SIMPLE      | sr2   | ref  | _when,_RoleID,_RTypeID | _when | 8       |\
 develop080401.sr1.mrd |    4 | Using where |
 +----+-------------+-------+------+------------------------+-------+---------+\
-----------------------+------+-------------+
 2 rows in set (0.00 sec)

 mysql> explain select count(*) from sr1 join sr2 where
    sr1.roleid=sr2.roleid and sr1.ratingtypeid=sr2.ratingtypeid
    and mrd=sr2.ratingdate ;
 +----+-------------+-------+------+---------------+------+---------+------+---\
---+-------------+
 | id | select_type | table | type | possible_keys | key  | key_len | ref  | ro\
ws | Extra       |
 +----+-------------+-------+------+---------------+------+---------+------+---\
---+-------------+
 |  1 | SIMPLE      | sr1   | ALL  | NULL          | NULL | NULL    | NULL | 27\
17 |             |
 |  1 | SIMPLE      | sr2   | ALL  | NULL          | NULL | NULL    | NULL | 29\
01 | Using where |
 +----+-------------+-------+------+---------------+------+---------+------+---\
---+-------------+
 2 rows in set (0.00 sec)

(BTW, I find the documentation on the meaning of the results above
rather difficult to follow.)

I thought at first that the extra time might be spent all at the beginning,
waiting for some lock, but I find that adding a limit clause to the end of
the slow query results in a proportional time.

I'm using the INNODB storage engine.  I've been trying things like
 show global status
 show engine innodb status
but so far I can't see how I'm supposed to figure out what's going on.

Is there some way to find out what the problem is, and if so, what is it?

How to repeat:
I don't know what specifically causes the transition from fast to slow.
That's what I'm hoping an answer to this question will help me to find out.
I do think it's related to something in my application, since this only started happening a few days ago and now it seems to happen within a few hours of restarting the server.

You have three indexes here: _when,_RoleID,_RTypeID
and index _when is used.

Try without using indexes. Just to look what happens.

MySQL 5.0.45 is quite old. Please try newer version. Our actual version is MySQL 5.0.51b.

You have three indexes here: _when,_RoleID,_RTypeID
  and index _when is used.
  Try without using indexes. Just to look what happens.
This is clearly unrelated to the fact that when I restart the server the
query is fast and then later it becomes much slower.

  MySQL 5.0.45 is quite old. Please try newer version. 
  Our actual version is MySQL 5.0.51b.
The difference between a developer and a real user is that the real user does not want to upgrade all the time.  I've already upgraded to .45 related to another bug.  For now I've worked around the problem.  There are really two questions.
- what can cause this behavior
- how can I find out relevant state of the server
Trying a new version does not help to answer either of those.

We're sorry, but the bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug.

Support on using our products is available both free in our forums at http://forums.mysql.com/ and for a reasonable fee direct from our skilled support engineers at http://www.mysql.com/support/

Thank you for your interest in MySQL.

How do you know this is not a bug if you don't know what causes it?
I think it may very well be a bug.  If not a bug in the sense that the current code is doing the wrong thing, then I would argue that it's a bug in the design if I really cannot find out what the cause is, and if there is a way, then perhaps a bug in the documentation that I can't find out what that way is.
I'm trying to help you determine which sort of bug this is.

"I thought at first that the extra time might be spent all at the beginning,
waiting for some lock"

you use community server, right ? with show profile you can find out where it spends the extra time.

transcript showing sample query, explain query, show profile before & after restarting server

Attachment: mysqlbug (application/octet-stream, text), 9.75 KiB.

> with show profile you can find out where it spends the extra time.
Thanks, that seems like a good start.  I've just added a file showing
my new sample query before and after restarting the server.
It's interesting that the explain query returns something different.
Perhaps what happens as the server runs is related to gathering more
data about queries?  Is there some documentation about how to interpret
the results of explain ?  Or how to control how a query is optimized ?

In the manual. Check the page for EXPLAIN and for SELECT.
When the server starts preferring slow plan try to use hints (check the manual) to force EXPLAIN to show the original, fast, plan. And tell us what you've got (the query with hints, the resulting explain)

> In the manual. Check the page for EXPLAIN and for SELECT.
I've seen this before.  It leaves a lot to be figured out.
One thing I wonder is what should cause the optimizer to change its mind.
I'd have thought that this would be something like updates that change the statistics, but I don't think this explains my situation.

> When the server starts preferring slow plan try to use hints (check the manual)
> to force EXPLAIN to show the original, fast, plan. And tell us what you've got 
> (the query with hints, the resulting explain)
My last file attachment shows the fast and slow plans.
I have now changed the code to ask for the fast plan and that seems to work.
My only worry is that what seems to be working fine one day might stop working the next.  This is why I'd like to understand what causes the plans to change.
Is that question addressed in the manual?

I know that your last file attachment shows the fast and slow plans.
It shows a slow plan before the restart and a fast plan after.

I asked for slow plan before the restart and a fast plan *before* the restart (where you force the optimizer to take the fast plan). In particular I was interested in "rows" column, as optimizer prefers a slow plan it should believe that it would need to access less rows than a fast plan.

> I asked for slow plan before the restart and a fast plan *before* the restart 
Now that I've waited most of a day since the restart this counts as *before*
again, and I did see the same slow plan.  I got the fast plan again as you suggested.  The difference was in the index used in the second row:
mysql> explain select  # sql doesn't know my extreme protocol                  
 count(*) from                                                                   
  (select max(ratingdate) as mrd, LEO_RoleRating.roleid, ratingtypeid            
   from LEO_RoleRating join SelectedRoles on                                     
     LEO_RoleRating.roleid = SelectedRoles.roleid                                
   where status>4       # gearing or better                                      
   and loginname='d'                                                             
   group by roleid, ratingtypeid) sr1                                            
  join -- LEO_RoleRating -- weird problem related to transactions/locking?(*)    
  LEO_RoleRating sr2 use index (_roleid) where                                   
   sr1.roleid=sr2.roleid and sr1.ratingtypeid=sr2.ratingtypeid                   
   and mrd=sr2.ratingdate;

|  1 | PRIMARY     | <derived2>     | ALL  | NULL           | NULL    | NULL    | NULL                               | 2716 |                                              |
|  1 | PRIMARY     | sr2            | ref  | _RoleID        | _RoleID | 5       | sr1.roleid                         |   17 | Using where                                  |
|  2 | DERIVED     | SelectedRoles  | ref  | lname          | lname   | 52      |                                    | 4146 | Using where; Using temporary; Using filesort |
|  2 | DERIVED     | LEO_RoleRating | ref  | _RoleID,Status | _RoleID | 5       | develop080401.SelectedRoles.RoleID |   17 | Using where                                  |

> (where you force the optimizer to take the fast plan). In particular I was 
> interested in "rows" column, as optimizer prefers a slow plan it should 
> believe that it would need to access less rows than a fast plan.
I'm not sure what all that means.  You're trying to see why it has the wrong plan?  The "rows" column is the one with 2716 and 17.  The other two lines are
4146, 17.  
The two plans in the previous file - the slow one gave rows of 2716, 4 (using _when), 4146, 17 and the fast one gave rows of 2716, 1 (using roleid), 4146, 1.
So the question is what caused the estimate for the fast plan to go from
2716 - 1 - 4146 - 1  to 2716 - 17 - 4146 - 17 ?

Okay, I guess it's something inside InnoDB, and an InnoDB developer needs to look at it. Just to have a clearer proof - could you try SHOW INDEX FROM LEO_RoleRating just after restart when you have a fast plan and later when you have a slow plan. Or, better, the other way around - wait for a slow plan, SHOW INDEX, restart, SHOW INDEX. It should show that cardinality values for _RoleID index are off in one of the cases. Also SELECT COUNT(DISTINCT _RoleID) would be nice - it'll show the exact value of the cardinality.

I don't have to wait for the slow version.  Rather I have to restart the server whenever I want the fast version.
Here's the slow state:
mysql>  SHOW 
    -> INDEX FROM LEO_RoleRating
    -> ;
    -> $
ERROR 2006 (HY000): MySQL server has gone away
^GNo connection. Trying to reconnect...
Connection id:    148
Current database: develop080401

+----------------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| Table          | Non_unique | Key_name | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| LEO_RoleRating |          0 | PRIMARY  |            1 | ID           | A         |        3229 |     NULL | NULL   |      | BTREE      |         |
| LEO_RoleRating |          1 | _when    |            1 | RatingDate   | A         |         807 |     NULL | NULL   |      | BTREE      |         |
| LEO_RoleRating |          1 | _RoleID  |            1 | RoleID       | A         |         189 |     NULL | NULL   | YES  | BTREE      |         |
| LEO_RoleRating |          1 | _RTypeID |            1 | RatingTypeID | A         |           2 |     NULL | NULL   |      | BTREE      |         |
| LEO_RoleRating |          1 | Status   |            1 | Status       | A         |          11 |     NULL | NULL   |      | BTREE      |         |
+----------------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.01 sec)

Then after restartmysql>  SHOW INDEX FROM LEO_RoleRating;
    -> $
ERROR 2006 (HY000): MySQL server has gone away
^GNo connection. Trying to reconnect...
Connection id:    1
Current database: develop080401

+----------------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| Table          | Non_unique | Key_name | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| LEO_RoleRating |          0 | PRIMARY  |            1 | ID           | A         |        3229 |     NULL | NULL   |      | BTREE      |         |
| LEO_RoleRating |          1 | _when    |            1 | RatingDate   | A         |          10 |     NULL | NULL   |      | BTREE      |         |
| LEO_RoleRating |          1 | _RoleID  |            1 | RoleID       | A         |        3229 |     NULL | NULL   | YES  | BTREE      |         |
| LEO_RoleRating |          1 | _RTypeID |            1 | RatingTypeID | A         |           2 |     NULL | NULL   |      | BTREE      |         |
| LEO_RoleRating |          1 | Status   |            1 | Status       | A         |           8 |     NULL | NULL   |      | BTREE      |         |
+----------------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.02 sec)

I don't know what the other query is supposed to show.
 SELECT
    -> COUNT(DISTINCT _RoleID) $
ERROR 1054 (42S22): Unknown column '_RoleID' in 'field list'

I mean the query

  SELECT COUNT(DISTINCT _when), COUNT(DISTINCT _RoleID), COUNT(*) FROM LEO_RoleRating;

Sorry, that one:

  SELECT COUNT(DISTINCT RatingDate), COUNT(DISTINCT RoleID), COUNT(*) FROM LEO_RoleRating;

mysql>  SELECT COUNT(DISTINCT RatingDate), COUNT(DISTINCT RoleID), COUNT(*) 
    -> FROM LEO_RoleRating $
+----------------------------+------------------------+----------+
| COUNT(DISTINCT RatingDate) | COUNT(DISTINCT RoleID) | COUNT(*) |
+----------------------------+------------------------+----------+
|                        120 |                   2782 |     2911 |
+----------------------------+------------------------+----------+
1 row in set (0.01 sec)

Don,

do you do lots of inserts/deletes/updates in table LEO_RoleRating?

If the InnoDB purge does not keep up, then the table may grow to occupy a large physical space, which would cause the query to be slow due to disk I/O.

Please print SHOW INNODB STATUS\G a couple of times during the slow query execution. From those outputs we will see what it is doing.

Regards,

Heikki

long query followed by SHOW INNODB STATUS

Attachment: bug-transcript (application/octet-stream, text), 6.17 KiB.

There are essentially no updates going on any time near the query.
I guess I could do several SHOW INNODB STATUS's while the query is running by 
starting another DB connection, but for now I've uploaded another file containing
only one, right after a long query.  I think this query is similar to the example I sent before, but if you like I suppose I could confirm that by doing an explain of it.

Don, we need to understand between the time you restart the server,
when the query is fast, and such a time when the query slows down which
apparently coincides with the change in the execution plan. So what we need is more SHOW INNODB STATUS then you do query executions. For example, you could run SHOW INNODB STATUS every 10 seconds and the sequence of 'explain query - execute query' every minute, if it takes, say, 10 minutes for the query to slow down.
At the very minimum we need what Heikki asked: execute SHOW INNODB STATUS a few times during the slow query, for example start running SHOW INNODB STATUS every one second and then run the slow (15-30 sec) query.
I also have the following question: when you say 'reastart mysqld', how do you shut it down (kill -9 or mysqladmin or something else), and could you show the output of 

SGOW GLOBAL VARIABLES LIKE '%INNO%';

transcript of several "show innodb status" commands while slow query running

Attachment: trans1 (application/octet-stream, text), 35.59 KiB.

Wow, out of the blue after several months - takes some time for me to 
recover the context of this problem. 
BTW it's much less convenient for me to enter this reply into the web 
page than to reply by email. 
 > At the very minimum we need what Heikki asked: execute SHOW INNODB 
 > STATUS a few times during the slow query, for example start running 
 > SHOW INNODB STATUS every one second and then run the slow (15-30 sec) 
 > query. 
That's the file I just uploaded.
Of course, this won't show the transition from fast to slow.
While this transcript was being collected in a shell the db client transcript looked like the following:
mysql>select                                                                    
 count(*) from                                                                  
  (select max(ratingdate) as mrd, LEO_RoleRating.roleid, ratingtypeid           
   from LEO_RoleRating join SelectedRoles on                                    
     LEO_RoleRating.roleid = SelectedRoles.roleid                               
   where status>4                                                               
   and loginname='dan'                                                          
   group by roleid, ratingtypeid) sr1                                           
  join                                                                          
  LEO_RoleRating sr2 where                                                      
   sr1.roleid=sr2.roleid and sr1.ratingtypeid=sr2.ratingtypeid                  
and mrd=sr2.ratingdate;
    ->     ->     ->     ->     ->     ->     ->     ->     ->     ->     -> +-\
---------+
| count(*) |
+----------+
|     2716 |
+----------+
1 row in set (31.87 sec)

====
 > I also have the following question: when you say 'reastart mysqld', how 
 > do you shut it down (kill -9 or mysqladmin or something else), and could 
Off hand I don't remember.  I rarely kill it.  I've occasionally seen bugs that
cause it to die.
I probably do kill in one variation or another.  Why does it make a difference?
 > you show the output of  
 > SGOW GLOBAL VARIABLES LIKE '%INNO%'; 
I guess that last line should be "SHOW". 
Here's the current value (after the server has been up a long time and the query is slow)
^Gmysql> SHOW GLOBAL VARIABLES LIKE '%INNO%';
+---------------------------------+------------------------+
| Variable_name                   | Value                  |
+---------------------------------+------------------------+
| have_innodb                     | YES                    |
| innodb_additional_mem_pool_size | 1048576                |
| innodb_autoextend_increment     | 8                      |
| innodb_buffer_pool_awe_mem_mb   | 0                      |
| innodb_buffer_pool_size         | 8388608                |
| innodb_checksums                | ON                     |
| innodb_commit_concurrency       | 0                      |
| innodb_concurrency_tickets      | 500                    |
| innodb_data_file_path           | ibdata1:10M:autoextend |
| innodb_data_home_dir            | /home/dcohen/mysql/    |
| innodb_doublewrite              | ON                     |
| innodb_fast_shutdown            | 1                      |
| innodb_file_io_threads          | 4                      |
| innodb_file_per_table           | OFF                    |
| innodb_flush_log_at_trx_commit  | 1                      |
| innodb_flush_method             |                        |
| innodb_force_recovery           | 0                      |
| innodb_lock_wait_timeout        | 50                     |
| innodb_locks_unsafe_for_binlog  | OFF                    |
| innodb_log_arch_dir             | /home/dcohen/mysql/    |
| innodb_log_archive              | OFF                    |
| innodb_log_buffer_size          | 1048576                |
| innodb_log_file_size            | 5242880                |
| innodb_log_files_in_group       | 2                      |
| innodb_log_group_home_dir       | /home/dcohen/mysql/    |
| innodb_max_dirty_pages_pct      | 90                     |
| innodb_max_purge_lag            | 0                      |
| innodb_mirrored_log_groups      | 1                      |
| innodb_open_files               | 300                    |
| innodb_rollback_on_timeout      | OFF                    |
| innodb_support_xa               | ON                     |
| innodb_sync_spin_loops          | 20                     |
| innodb_table_locks              | ON                     |
| innodb_thread_concurrency       | 8                      |
| innodb_thread_sleep_delay       | 10000                  |
+---------------------------------+------------------------+
35 rows in set (0.00 sec)

I gather what you really want is for me to restart the server then start doing
 show innodb status
every few seconds
while running the query more or less continuously until it becomes slow.
I guess I can arrange to do that, but it's a sufficiently big deal that before
I do I'd like to let you look at what I've just uploaded and let you verify that this is really what you want.  The transcript is likely to be rather large.
I notice you limit the size of uploads.

If you still have same problem on 5.5.45/5.6.27 or newer, please reopen this.
I believe the various innodb variables regarding index stats would help you.

https://dev.mysql.com/doc/refman/5.6/en/innodb-persistent-stats.html

Note:  after populating a clean table, first run ANALYZE TABLE before expecting indexes to be updated (background thread takes ~10 seconds).