MySQL Bugs: #61677: thread stall on insert

Bug #61677	thread stall on insert
Submitted:	28 Jun 2011 19:14	Modified:	14 Jul 2011 19:51
Reporter:	Greg Hazel	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: General	Severity:	S5 (Performance)
Version:	5.5.12	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
I'm experiencing an issue where sometimes INSERTs do not return for 20-30 seconds, and meanwhile the query is not present in SHOW FULL PROCESSLIST;

During this time the transaction on that thread remains open, and other queries waiting on locks it holds also wait.

I started logging information_schema.processlist and information_schema.innodb_trx once a second while this occurs. Neither show the query running for any of the seconds while the thread is hung. The general query log shows the query exactly when the client sent it, but nothing else on that thread for 25 seconds.

How to repeat:
This does not seem specific to a particular table, although it occurs more frequently with the more heavily used tables. The tables themselves are quite large - around 10GB. There are many of them totally 300GB, and the machine only has a 35GB buffer pool.

I catch these stalled inserts quite frequently - sometimes a few an hour. I would be happy to add or run any debugging to provide more information.

It would be nice to get INNODB STATUS and vmstat/iostat results during this period of 25 seconds with no visible progress.

I added the innodb status for that time period (visible to developers).

What command(s)/time periods would you like me to run for vmstat/iostat info when the problem occurs?

Let's wait with OS level statistics for a while (but something like vmstat 1 10 will be enough to begin with). Please, send your my.cnf file content and the results of:

show create table products\G
show create table notices\G

All INSERTs are waiting for IX lock. I wonder why is that so...

They're all waiting for the IX lock and the hung transaction, C6763433, has it. It says "ACTIVE" and that it's been running for 84 seconds, but it doesn't seem to have been doing much -- the client thinks the query is still running.

Config and schema to follow.

Here's the vmstat 1 10 output during the problem:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  2   9560  39404  45388 3939228    0    0   541   714    1    1 14  1 73 12  0
 1  2   9560  36356  45404 3941936    0    0  1592  6021 3773 3132 16  1 51 32  0
 1  3   9560  34160  45404 3943392    0    0  1716  3646 1739 1418  4  1 70 25  0
 0  2   9560  31924  45404 3945768    0    0  2064   607 1539 1382  3  0 67 30  0
 2  0   9560  33632  45376 3943976    0    0   404   411 1755 1271  9  1 85  5  0
 0  1   9560  43380  45336 3933828    0    0   632    86 2792 1989 11  1 74 15  0
 0  1   9560  43380  45336 3934920    0    0   644    93 2426 1720  7  1 80 13  0
 0  0   9560  42004  45344 3935344    0    0   452    77 2397 1607 10  1 76 13  0
 0  1   9560  42004  45344 3935708    0    0   500    17 1847 1449  3  0 87  9  0
 0  1   9560  40548  45344 3937056    0    0  1044    45 2032 1644  5  0 80 15  0

It appears perhaps SELECTs are stalling as well, they simply don't hold locks which block other queries. I can run "SELECT * FROM tables WHERE id = #{random_id}" and occasionally get a stall.

Here's the query timing information from the perspective of a Rails application:

log output:
  SQL (7.2ms)   SET @@profiling_history_size = 0
  SQL (0.3ms)   SET @@profiling_history_size = 100
  SQL (0.4ms)   SET profiling = 1
  Image Load (3129.5ms)   SELECT * FROM `images` WHERE (`images`.`id` = 12345) LIMIT 1
  SQL (1.0ms)   SET profiling = 0
  SQL (0.4ms)   SHOW PROFILES
  SQL (0.6ms)   SHOW PROFILE BLOCK IO, CONTEXT SWITCHES, CPU, MEMORY, PAGE FAULTS, SWAPS FOR QUERY 56

rails monitored performance:
{:sql_seconds=>3.1402702331543,
 :query_timing=>
  [["SET @@profiling_history_size = 0", 0.00739908218383789],
   ["SET @@profiling_history_size = 100", 0.00040888786315918],
   ["SET profiling = 1", 0.000462055206298828],
   ["SELECT * FROM `images` WHERE (`images`.`id` = 12345)  LIMIT 1",
    3.12966918945312],
   ["SET profiling = 0", 0.00107192993164062],
   ["SHOW PROFILES", 0.000458002090454102],
   ["SHOW PROFILE BLOCK IO, CONTEXT SWITCHES, CPU, MEMORY, PAGE FAULTS, SWAPS FOR QUERY 56",
    0.00080108642578125]],
 :slowest_query=>
  [3.12966918945312,
   "SELECT * FROM `images` WHERE (`images`.`id` = 12345)  LIMIT 1"],
 :total_seconds=>3.269846}

MySQL profiling:
{:sql_seconds=>0.000222,
 :slowest_query=>
  [0.000222, "SELECT * FROM `images` WHERE (`images`.`id` = 12345)  LIMIT 1"],
 :total_seconds=>3.140149}
0.000222
SELECT * FROM `images` WHERE (`images`.`id` = 12345)  LIMIT 1
+--------------------------------+----------+----------+------------+-------------------+---------------------+--------------+---------------+-------------------+-------------------+-------+
| Status                         | Duration | CPU_user | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Page_faults_major | Page_faults_minor | Swaps |
+--------------------------------+----------+----------+------------+-------------------+---------------------+--------------+---------------+-------------------+-------------------+-------+
| starting                       | 0.000013 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting for query cache lock   | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting on query cache mutex   | 0.000002 |          |            |                   |                     |              |               |                   |                   |       |
| checking query cache for query | 0.000017 |          |            |                   |                     |              |               |                   |                   |       |
| checking permissions           | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
| Opening tables                 | 0.000011 |          |            |                   |                     |              |               |                   |                   |       |
| System lock                    | 0.000006 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting for query cache lock   | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting on query cache mutex   | 0.000013 |          |            |                   |                     |              |               |                   |                   |       |
| init                           | 0.000019 |          |            |                   |                     |              |               |                   |                   |       |
| optimizing                     | 0.000005 |          |            |                   |                     |              |               |                   |                   |       |
| statistics                     | 0.000034 |          |            |                   |                     |              |               |                   |                   |       |
| preparing                      | 0.00001  |          |            |                   |                     |              |               |                   |                   |       |
| executing                      | 0.000002 |          |            |                   |                     |              |               |                   |                   |       |
| Sending data                   | 0.000022 |          |            |                   |                     |              |               |                   |                   |       |
| end                            | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
| query end                      | 0.000004 |          |            |                   |                     |              |               |                   |                   |       |
| closing tables                 | 0.000006 |          |            |                   |                     |              |               |                   |                   |       |
| freeing items                  | 0.000006 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting for query cache lock   | 0.000002 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting on query cache mutex   | 0.000002 |          |            |                   |                     |              |               |                   |                   |       |
| freeing items                  | 0.000019 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting for query cache lock   | 0.000002 |          |            |                   |                     |              |               |                   |                   |       |
| Waiting on query cache mutex   | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
| freeing items                  | 0.000002 |          |            |                   |                     |              |               |                   |                   |       |
| storing result in query cache  | 0.000004 |          |            |                   |                     |              |               |                   |                   |       |
| logging slow query             | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
| cleaning up                    | 0.000003 |          |            |                   |                     |              |               |                   |                   |       |
+--------------------------------+----------+----------+------------+-------------------+---------------------+--------------+---------------+-------------------+-------------------+-------+

You can clearly see here that the amount of time MySQL thinks it's spending running the query and the amount of time the client thinks the query blocks for do not match. Where is all this time being spent?

I downgraded to mysqld 5.5.12 to get the "OS thread id" in TRANSACTIONS output. Using pstack I found that the hung thread is always in:

Thread 32 (Thread 0x58ab8940 (LWP 996)):
#0  0x00002aaaaacd691b in read () from /lib64/libpthread.so.0
#1  0x00000000004f5aeb in my_real_read(st_net*, unsigned long*) ()
#2  0x00000000004f5e33 in my_net_read ()
#3  0x000000000055e925 in do_command(THD*) ()
#4  0x00000000005e9311 in do_handle_one_connection(THD*) ()
#5  0x00000000005e93fa in handle_one_connection ()
#6  0x00002aaaaaccf73d in start_thread () from /lib64/libpthread.so.0
#7  0x00002aaaac1a64bd in clone () from /lib64/libc.so.6

Which seems like it's probably trying to read the next query? But, my client never received the response to the first query. I'm attaching a full pstack from this occurrence, which is particular fascinating because there were *no* other queries running at the time.

mysql hang report (pstack, innodb status, vmstat)

Attachment: mysql_hang_report.txt (text/plain), 55.18 KiB.

Looks similar to bug #29900

Thank you for the feedback.

If my guess is correct and this is similar issue to bug #29900 SELECT should die because of error in middle of execution. Could you please send us error log file and/or attach gdb to running mysqld in time when problem occurs, then run `bt thread apply all`and send us its output?

We (at Webyog) have a user report that may be related. Running our Data Sync tool to a replication master "breaks replication" to slave, user tells. 

The Data Sync tool will execute large BULK INSERT (as large as max_allowed_packet setting allows for up to 16MB).

I could imagine that it is related to this. If the INSERT stalls on slave then it effectively breaks replication I think  (as there is only one replication thread). 

I do not have full details about the environment yet for this case. Only I think that statement based replication is used (as user did not tell otherwise).

I am sorry: gdb command in previous reply should be `bt full thread apply all`

I get: 

(gdb) bt full thread apply all
No symbol "thread" in current context.

Thank you for the feedback.

My bad: thread apply all bt and/or thread apply all bt full

Attached. No running queries according to SHOW FULL PROCESSLIST. The mysql thread id which was hung was 888192. This corresponds to:

---TRANSACTION CF56DFBE, ACTIVE 159 sec, process no 410, OS thread id 1488435520
2 lock struct(s), heap size 376, 1 row lock(s), undo log entries 1
MySQL thread id 888192, query id 42493850 10.10.10.10 myapp

which is thread 0x58b7b940 in the gdb output.

Thank you for the feedback.

Interesting: there is no OS thread id #1488435520 in the backtrace. Will keep trying to guess what is happening.

Sure there is. 1488435520 is the decimal representation of the thread handle value. In the gdb output it uses hex, which is 0x58b7b940. So: "Thread 30 (Thread 0x58b7b940 (LWP 1003)):"

Thank you for the feedback.

Could you please check if "stalling" INSERT really inserted something? Do you use session autocommit=0 or multiple statement transactions in BEGIN ... END/COMMIT format?

The INSERT did actually insert something. I'm using BEGIN/COMMIT.

Thank you for the feedback.

> The INSERT did actually insert something. I'm using BEGIN/COMMIT.

This can be expected behavior: if you run transaction like:

BEGIN
-- some queries
INSERT ...
-- [1] some queries or just sleep
COMMIT

and run SHOW PROCESSLIST at point [1] you will not see INSERT while rows still will be locked. Please check if this is not your case.

You description is accurate, except the client which submitted the INSERT query is still waiting on the response.

Thank you for the feedback.

Could you please describe which client do you use and how exactly it waits on response? Is is MySQL CLI or some application? How do you check response if this is an application?

This is a Rails application where I tried both the mysql2 and mysql gem, which both use libmysqlclient. ActiveRecord logs as soon as the response returns with the amount of time it took.

I mean I need example of code which checks if something is really returned. If you can create minimal test case or at least show how code checks if result of INSERT is returned would be good.

The call is blocking, so it returns when it gets a response from the server. Here is an example:

require 'benchmark'
require 'activerecord'

ActiveRecord::Base.establish_connection(
  :adapter  => "mysql",
  :host     => "the.db.host",
  :database => "myapp_db",
  :username => "user",
  :password => "secret"
)

t = Benchmark.realtime do
  ActiveRecord::Base.connection.execute("INSERT ...")
end
puts "completed in #{t} seconds!"

Thank you for the feedback.

But how do you know if INSERT really inserted some rows if transaction was not committed?

Oh. Well, the INSERT does eventually complete and the row is present - sorry, I thought that's what you meant. I do not know, during the stalled period, what the state of the row is.

Thank you for the feedback.

Still not clear what also happen in transaction with problem INSERT.

Please temporarily turn general query log to on, then catch problem, record exact query which was not returned and send it to us. Or you can also redirect output into table, then get all statements of problem transaction with query:

SELECT * FROM mysql.general_log WHERE thread_id IN (SELECT thread_id FROM mysql.general_log WHERE argument='PROBLEM_QUERY')

The INSERT appears in the general query log as soon as the client submits it. It is not specific to any one INSERT. Anything my app does can suffer from this pause, including SELECTs (but those don't block other queries).

Correct. But I want to know what else appears in general query log from same thread who does INSERT.

It can happen directly after BEGIN, or as the first SELECT on a connection. It is not tied to previous activity of this connection at all.

Thank you for the feedback.

In this case no row should be actually inserted into table if COMMIT was not issued. But you wrote it was inserted.

There can be 3 chances if this is true:

1. You run SELECT immediately after INSERT in same transaction, but this means client is not locked and there is no bug here
2. There is COMMIT in general query log. This means question is why your application does not see it.
3. There is nor COMMIT, neither BEGIN in general query log. This means transaction with INSERT was autocommitted, but still exists in SHOW ENGINE INNODB STATUS. This is definately bug.

To get this fixed we need a way to reproduce problem.

So I want to check which of 3 possible issues you hit.

Please also send us error log file.

I think we had a mis-communication about the INSERTed row's status. I see:

BEGIN
INSERT ...
(long pause)
COMMIT
(row exists)

In any case, I think I've figured out the problem, and it is not strictly a MySQL bug.

tcpdump is indicating the response to the INSERT has been written, but the other side does not ACK it. I believe this is due to occasional packet-loss, as sometimes after a retransmits an ACK is received. These delays are large due to TCP's RTO calculation, and match the sorts of delay I've been seeing in my application. This is EC2's internal network, and does not seem to be specific to any particular machine instance. Quite a catastrophic failure for such a small amount of dropped packets!

Thank you for the feedback.

Closing as "Not a Bug" as network delays is really not MySQL bug.