MySQL Bugs: #79185: Innodb freeze running REPLACE statements

Bug #79185	Innodb freeze running REPLACE statements
Submitted:	9 Nov 2015 11:09	Modified:	15 Jan 2016 18:42
Reporter:	Will Bryant	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	5.5.46	OS:	Ubuntu (14.04)
Assigned to:	Shaohua Wang	CPU Architecture:	Any
Tags:	REPLACE hang

Description:
After upgrading from 5.5.44 to 5.5.46 we have found our database server repeatedly freezing.  

The problem seems to be caused by running REPLACE statements on multiple connections, although each connection is working on a separate table to the others.  The tables all have different schema, but all are utf-8.  They started out mostly empty.

When INNODB locked up, the process CPU and IO dropped almost to nil (when I was running this data load job there were no other processes using the database).

If you have a console running at the time it locks up, you can do SHOW PROCESSLIST and see the blocked REPLACE queries, but SHOW ENGINE INNODB STATUS hangs.

Attempting to shutdown the server doesn't work; have to kill -9 and let innodb do recovery.

How to repeat:
Using 12 clients repeatedly running REPLACE statements with ~16kb of data per statement (mostly or all new rows, rather than changes to old rows; inserting into a different table for each worker; committing periodically), I was able to lock up the server after a few minutes.

Using 6 clients, I was able to lock up the server after 12 minutes.

Using 4 clients, it's been running for an hour and hasn't locked up yet.

This is a server with fast enterprise SSDs with consistent throughput (mysql was writing 2-300 MBps before the hang).  On previous versions it was able to happily do this for hours with no lock-ups.

Some of the tables have blob or text columns, but most just have strings, integers, and decimals.  They all have primary keys, and some have unique secondary keys.  There is some SELECT work mixed in with the REPLACE statements.

Please provide your my.cnf. Thanks.

Our config file

Attachment: my.cnf (application/octet-stream, text), 4.69 KiB.

More data points: using 2 clients it locked up after a few hours.

Whereas using 4-8 clients but doing INSERT only, instead of REPLACE, it ran for hours with no lock-ups.

Hello,

We've noticed similar problem while stress testing Debian mysql update from 5.5.44-0+deb7u1 to 5.5.46-0+deb7u1 (about 200 simultaneous OTRS user sessions simulated with jmeter, two apache application servers hitting one backend mysql database server over standard TCP connection).

Only one in 10 test iterations caused this problem; we couldn't reproduce it again.

No such problem occurred during our testing the previous mysql-server 5.5.x versions in the past.

Symptoms: mysqld process is alive, no errors in syslog nor in mysql error log; cpu & disk are idle, existing db otrs connections (selects, updates, etc.) all are frozen; new db connections freeze on otrs (innodb) database queries too. Some status commands freeze too (i.e. "show engine innodb status") other not (i.e. "show processlist", "show status"); "service mysql stop" does not work - "kill -9" was necessarry; problem occured in our development environment so we had time to dump some logs and stats which may be helpful in debugging.

Attached please find logs & additional comments & our mysqld config.

Other Debian users also reported similar problem after 5.5.46-0+deb7u1 udate:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=804214

Thank you & regards,
Pawel

IB Develoment Team
https://dev.ib.pl/

Logs, dumps, comments and mysqld config.

Attachment: mysql-sever-5.5.46-0+deb7u1_crash.tar.gz (application/gzip, text), 17.58 KiB.

All reporters,  Please get the threads stack traces at next hang, using gdb.

gdb -ex "set pagination 0" -ex "thread apply all bt"--batch -p $(pidof mysqld)

gdb backtrace (5.5.46-0+deb8u1)

Attachment: backtrace.log (text/x-log), 799.70 KiB.

> Shane Bester
> gdb -ex "set pagination 0" -ex "thread apply all bt"--batch -p $(pidof mysqld)

Same issue on 5.5.46-0+deb8u1.
Backtrace attached

Having analysed the stacktraces of Percona Server occurrences (a recent release that has merged fix for bug 76135), and having tested a preliminary fix, we are pretty certain this is a regression from bug 76135: mutex lock word and waiters flag accesses are not ordered properly, as explained by Kristian Nielsen at https://lists.launchpad.net/maria-developers/msg07860.html. Note that Percona Server 5.5 InnoDB mutex implementation is identical to MySQL 5.5 one.

If we patch the server that e.g. on x86_64 IB_STRONG_MEMORY_MODEL takes precedence over HAVE_IB_GCC_ATOMIC_TEST_AND_SET, the issue disappears. Obviously this is not a fix for other platforms.

Hi Will, would you please provide a detailed steps and related scripts to reproduce this bug? so that we can verify our fix once we have a solution.

Thank you in advance!

Will, also your CPU/OS/Compiler versions, and is it possible to reproduce it on HDD?

The issue affects us as well. I am uploading the relevant information.

The snapshot of the data that I will be sending is after us trying to kill sql queries.

my.cnf, show processes and trace for bug

Attachment: 5.5.46-bug.tar.gz (application/gzip, text), 15.23 KiB.

I forgot to mention:

Distributor ID: Debian
Description:    Debian GNU/Linux 7.9 (wheezy)
Release:        7.9
Codename:       wheezy
Linux <hostname> 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u6 x86_64 GNU/Linux

I believe that we experienced this issue as well multiple times after upgrading to 5.5.46-0+deb7u1. mysqld would completely hang / freeze as would all connection attempts to the server (our primary). I had to kill -9 the process.

No errors were reported anywhere that I could find. All logging simply stopped. The entire server crashed at one point as well and I had to reboot it.

Distributor ID: Debian
Description:    Debian GNU/Linux 7.9 (wheezy)
Release:        7.9
Codename:       wheezy

Linux <primary DB server> 3.2.0-4-amd64 #1 SMP Debian 3.2.73-2+deb7u1 x86_64 GNU/Linux

I downgraded to 5.5.44-0+deb7u1 and have been running fine for two days. I'll attach a backtrace, an extended status just seconds before the hang, and the config. The full process list showed no active queries but this was seconds before the hang.

I had the general log turned on too but I see nothing that stands out to me. We ARE running REPLACE INTO queries, but the last query that it logged at the time of the hang was an INSERT statement. I didn't think to keep a session open and couldn't open one at the time of the hang, so I don't know what, if any, queries were stuck.

I will attach a backtrace, our config, and the last extended status seconds before the hang.

backtrace, my.cnf, extended status at time of hang

Attachment: 5.5.46_bug.tar.gz (application/x-gzip, text), 9.68 KiB.

also: http://bugs.mysql.com/bug.php?id=79815

Fixed as of the upcoming 5.5.49, 5.5.30, 5.7.12, 5.8.0 release, and here's the changelog entry:

Running REPLACE operations on multiple connections resulted in a hang.

Thank you for the bug report.

Correction to the previous comment. This bug is fixed as of the 
5.5.49, 5.6.30, 5.7.12, and 5.8.0 releases.

Hi,

we are hit by this problem every other day on one of our productive Jessie machines. 
Could you provide a patch or any other workaround as long as 5.5.49 is not released ? 

Or if not could you give a vague hint at what date the fix will be released ?

Thank you !

Moritz - we downgraded to 5.5.44 as a workaround and have been running solid ever since.

Laurynas,

Thanks for the insight. We were badly hurt by this. We are going to fall back to IB_STRONG_MEMORY_MODEL till 5.6.30 is out. I'll post here to update if it solved the problem for us.

Inaam, thank you. I'd also highly recommend to fix bug 79477, even if for 5.8 only. So that bugs like this are caught sooner instead of resulting in a sub-second stalling server with some hang probability.

Falling back to IB_STRONG_MEMORY_MODEL indeed solved the problem. We were ending up in lock ups quite consistently on some of our heavily loaded clusters (almost a node every couple of hours). Now we have been running for few days without hitting this issue.

Any news on when the fixed versions will be released? This is a quite bad regression bug which forced us to downgrade and it is now fixed for nearly 3 months...
Would be nice to know...