MySQL Bugs: #20408: Switching binlog can cause mysql to hang

Bug #20408	Switching binlog can cause mysql to hang
Submitted:	12 Jun 2006 19:48	Modified:	16 Dec 2009 10:00
Reporter:	Andreas Oesterer	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.18, 5.1.37-ndb-7.0.8a	OS:	Linux (Linux 64 bit)
Assigned to:		CPU Architecture:	Any

Description:
We are running mysql-max-5.0.18-linux-x86_64-glibc23 on CentOs.

We started to notice that switching the binlog (once it reaches the 1GB default) can cause the master DB to hang. One can still connect to the DB but no queries will return and the DB cannot be shutdown anymore. Only a "kill -9" works to terminate the DB and restart it.

This seemed to happen more frequenlty when the DB load is high. So we created a nighly job to manually flush the logs during down times using "mysqladmin flush-logs"

However, it still happens every 2-3 weeks that the DB will hang when the flush is executed.

Is this a known issue? Any suggested workarounds?

Thanks, Andreas

How to repeat:
Not easy to repeat. Our binlong grows by about 1GB per day so usually we require one bin long switch. This leads to one crash every 2-3 weeks.

Thank you for a problem report. I know one similar bug report, bug #17733. But it is about crashes, not hangs. Can you find anything similar to your case?

Please, send

getconf GNU_LIBC_VERSION

results from your CentOS.

glibc 2.3.4

Bug 17733 might be related. But there are 2 differences:

1) The server does not "crash" it just "hangs". Of course this does not make any real difference for a production environment. A crash would almost be more desirable because it's easier to detect and could probably be fixed through a restart.

2)  It does not happen on every binlog switch. It only happens one of every 10-20 times. I'm not sure if I could repeat it by calling flush-logs repeatedly. But there is not disgnostic output that I could provide to you.

Follow-up question:

I'm trying to set the max_binlog_size higher to postpone the potential faliure to a time where it can be handled more graceful. However mysql ignores my setting in my.conf:

max_binlog_size=1500M

When I look at the variable after startup, it still reports the default of 1073741824. Is there a 1 GB maximum?

Please, try to repeat with a newer version, 5.0.22, and inform about the results. Wha is the value of expire_logs_days server variable in you case?

As for your question about binlog size, it is clearly documented (http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html):

"- max_binlog_size

If a write to the binary log causes the current log file size to exceed the value of this variable, the server rotates the binary logs (closes the current file and opens the next one). You cannot set this variable to more than 1GB or to less than 4096 bytes. The default value is 1GB.

A transaction is written in one chunk to the binary log, so it is never split between several binary logs. Therefore, if you have big transactions, you might see binary logs larger than max_binlog_size."

expire_logs_days is set to 0

I switched to nighly shutdowns of the DB to avoid the DB hanging on the flush command. The log switch that occurs during a shutdown / startup cycle does not have the same problem. However my issue remains if the log reaches the 1GB limit before nightly bounce. Also, a limit of less than 1GB did not make any difference.

Do you have any slaves connected to this master server (just to be sure)? 

Can you try to use newer version, 5.0.22? Is this problem repeatable with many FLUSH LOGS commands run sequentially?

I was not able to repeat any hangs with simple script like:

a=0; 
while [ $a -le 1000 ]; 
do
   echo $a;
   bin/mysql -u root test -e "flush logs";
   let a=a+1;
done

with 5.0.25.

Yes there are several slaves connected to the master.

DB load seems to increase the likelyhood of the DB hanging on the flush. I'm not too surprised that a repeated flush on an idle or almost idle DB does not cause the issue.

Andreas,
is there any way we can exchange more info on this "bug". 
We are not in the position to shutdown our database at all so would not mind swapping some more info on this.

my email is brian@soonr.com

Bug #21653 was marked as a duplicate of this one.

Please, try to repeat with a newer version, 5.0.24, and inform about the results. I was not able to repeat neither with mysqladmin flusah-logs in shell script proposed before, nor with C API program that created up to 200 concurrent threads, all performing "flush logs" for up to 300 seconds.

How many slaves do you have connected? Please, send my.cnf content from master and typical slave.

we have upgraded to 5.0.24 on Friday and are waiting now to see if we experience same problem.

my.cnf is as below.

I dont think you can recreate this problem by doing a flush-log many times. We eperience this problem when MySQL is under heavy load - rather than when there are many clients doing flush-logs. 

[mysqld]
#1
lower_case_table_names=1

#2
sql-mode="NO_BACKSLASH_ESCAPES"

#3
default-storage-engine=INNODB

#4
default-character-set=UTF8

#5,6, 112
datadir = /var/lib/mysql
innodb_data_home_dir = /var/lib/mysql
innodb_data_file_path = ibdata1:1282M:autoextend
innodb_autoextend_increment = 500M
innodb_buffer_pool_size=1792M
innodb_log_file_size=128M
innodb_log_group_home_dir = /var/lib/mysql/log
innodb_log_arch_dir = /var/lib/mysql/arch
innodb_log_files_in_group = 2

key_buffer_size = 128M
#7
log-bin=mysql01
max-binlog-size=1024M
server-id=1
binlog-do-db = ir_db_dbo

#8
innodb_file_per_table

#9
# see http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html
innodb_flush_log_at_trx_commit = 0

#10  Replication

#10  only ir_db_dbo to be replicated

# wait timeout = 15 minutes
wait_timeout=900
thread_cache_size=40
max_connections=180
max_user_connections=140

mysql01:~ #

Bug #21965 may be a duplicate of this bug.

Do replication clients issue frequent "show master logs" commands? If so, that could a plausible explanation for this bug and offer a possible workaround solution at the same time:

-stop slaves
-flush logs
-start slaves

Following my last comment above and the two previous comments

I dont believe this to be a bug related to 21965.

reasons
1) Our replication slave is at the moment switched off and has been for several months pending another bug.
2) At the risk of tempting fate - our problem now seems to have gone away.

We upgraded to version 5.0.24 on 24th August. We have not experienced the problem since. BUT I dont think this is what has fixed the problem.

During the upgrade I also created an index on a column in one of our tables that was being used as the only where clause on may update statements
e.g. update account set logincount=logincount+1 where username=?

Unfortunately, for some reason the index on username (we have an id as PK) had been dropped (quite some time ago from what I can see). Whilst we saw no real performance degradation we were getting some deadlocks.

As Innodb will lock the whole table because of this clause (as far as I can understand from the InnoDB documnentation - row locking is done on indexes - no index so table lock).

So I think the situation with the bin long hanging was due to some complex "deadlock" between transactions that were queuing up with each other because of contention for the account table lock - this was causing commits to hang - so the bin log could not rotate.

Obviously the above could be just a wild theory of mine. BUT if you have this problem and cannot upgrade to 5.0.24 or the upgrade does not help - then check for missing indexes on tables that you update frequently.

Now of course we should not have upgraded two things at once but this production downtime for the upgrade is one of the few chances I get to upgrade our production database - so now I am left to guessing what has "fixed" the problem.

I say "fixed" as I am sure now I have written about it the problem will occur again :=))

Please, execute SHOW INNODB STATUS before switching the binlog in case of similar hangs will happen again, and send the results. If we have a "deadlock"/locking problem in InnoDB as a reason, this will help to get more information about it.

hi

I did check the deadlock situation when the database had hung. (in the past - we STILL not experienced the problem again - since making the changes)

There was a deadlock - but it was very old 

but one strange this was that the victim was set to ?

So I guess this might contradict my earlier post about the dadlock being the issue !!!

But I wanted to include that change we had made in case others were still experiencing the problem even after upgrading to 5.0.24.

brian

I'm seeing a similar hanging situation, but my binlogs aren't even 10MB, and the dataserver is practically asleep.  However, I can consistently cause the dataserver to hang if I issue a GRANT statement after not issued one for about a week of running.  After the hang, I have to kill -9 to stop the dataserver as well.  If I first try kill -TERM, I see this in the server_errors.log:

061008 21:20:26 [Note] /opt/mysql/bin/mysqld: Normal shutdown

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109266  user: 'cgi'

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109265  user: 'cgi'

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109264  user: 'cgi'

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109263  user: 'cgi'

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109262  user: 'cgi'

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109261  user: 'cgi'

061008 21:20:28 [Warning] /opt/mysql/bin/mysqld: Forcing close of thread 109258  user: 'db_installer'

And then it hangs and doesn't respond to anything but the kill -9.  I'm currently running 5.0.24a-standard-log, but I've seen the same thing with 5.1.7-beta-log and 5.1.11-beta-log.  I'm also using InnoDB tables.  Anyway, maybe this is related to Andreas' problem, but as I said, my binlogs aren't anywhere near reaching their max of 1GB.

Please, try to repeat with a newer version, 5.0.27. In case of simlar hangs, please, send the results of:

getconf GNU_LIBC_VERSION
getconf GNU_LIBPTHREAD_VERSION

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Please try with current version 5.0.45 and if bug stil exists provide information which Valeriy requested.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Bug can be closed.

We have been running for several months on 5.0.37 and the problem did not re-occur.

The bug is not repeatable any more.

We are experiencing this exact problem currently on our production and dev servers. All that I need to reproduce the bug is to run write statements until the log fills up.

We're on RHEL 64-bit.

I was lucky enough to have an active connection during a hang yesterday and will attach the output of show status, show innodb status, show full processlist, the my.cnf and some system information.

my.cnf

Attachment: mysql_my.cnf (application/octet-stream, text), 1.49 KiB.

innodb status

Attachment: mysql_show_innodb_status.txt (text/plain), 11.99 KiB.

show full processlist

Attachment: mysql_show_processes.txt (text/plain), 8.47 KiB.

show status

Attachment: mysql_show_status.txt (text/plain), 16.61 KiB.

some general system info

Attachment: mysql_sys_info.txt (text/plain), 1.05 KiB.

j c,

thank you for the feedback. Which version of MySQL server do you use?

doh! Knew I'd forget something.

Server version: 5.1.37-ndb-7.0.8a-cluster-gpl-log MySQL Cluster Server (GPL)

We'll be upgrading to 7.0.9 very soon.

Just a quick note:
We've upgraded to the lastest GA release and it's still happening. Frequently.

Server version: 5.1.39-ndb-7.0.9-cluster-gpl-log MySQL Cluster Server (GPL)

Thank you for the feedback.

I can not repeat described behavior with generic test. Please send us output of SHOW CREATE TABLE medianet_track and album_credit_update

Please also send CREATE statement for table connectioncheck.

Show create table statements for the tables asked for.

Attachment: mysql_show_create_tables.txt (text/plain), 6.40 KiB.

Yes,

there are private messages.

Thank you for the feedback.

Yes, I can do it almost every weekday (Mon - Fri) from 8.00 till 13.00 Helsinki time. If this time is not convenient for you let us know and we schedule special time.

IDs sent via email.

As yesterday days show this is duplicate of bug #41751

Still experiencing this bug as of 5.1.56-ndb-7.1.18-cluster-gpl-log

The symptoms match http://bugs.mysql.com/bug.php?id=21653 almost exactly, normal startup and operation with expire_log_days=7, our binlog size is set to 1GB. When a log becomes full and a new log is created, the system will hang every time. 

By hang I mean, all processing will stop, but the server will continue to accept connections. 

We have a mixed cluster and innoDB master/slave environment with 8 servers. Our load is quite high which may explain why it's happening to us every time. 

We cannot repeat this on our test systems, only in production. In January we upgraded to the version mentioned above to try and get past this bug which was happening previously. 

Our current workaround is to set expire_log_days = 0, and manually delete old logs and re-generate the binlog.index file. This works well, but I thought it may help someone to know that this doesn't seem to be fixed.

5.1.56 Status

Attachment: 20120322_mysql_issue_status.txt (text/plain), 140.98 KiB.

5.1.56 variables

Attachment: 20120322_mysql_issue_variables.txt (text/plain), 96.81 KiB.