MySQL Bugs: #61562: Replication slows down after network glitch, needs stop/start slave to recover

Bug #61562	Replication slows down after network glitch, needs stop/start slave to recover
Submitted:	19 Jun 2011 8:46	Modified:	4 Jul 2011 8:38
Reporter:	Kees Jan Koster	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.5.8	OS:	FreeBSD
Assigned to:		CPU Architecture:	Any

Description:
I have a simple master/slave setup. Master database is used for every work, slave is used for daily backups (made using mysqldump -x). The machines are in different data centers. The network of the datacenter that the slave is in is not terribly stable. Every three days to a week I see reconnects of the various services on the slave machine, mysql being among them.

Every week (usually in the night from Saturday to Sunday) the slave suddenly starts to lag behind the master. Of note is that Munin query volume monitoring shows that replication does not stop. It merely slows down. Under normal circumstances the slave processes about 12 qps. In the failure condition it processes about 6 qps. After I manually restart the slave it picks up the slack at a rate of about 30 qps, which tells me that it is not the slave machine's hardware that limits the processing rate. This is show in this Munin graph: http://java-monitor.com/misc/mysql_queries-day.png Note how the qps never drops to zero, as it would do when replication failed completely.

I have a cron job that restarts the slave and gives the slave status. Script and output below. my.cnf and mysql error log are below.

*) How can I configure mysql to restart the connection automatically? I tried slave_net_timeout=30, but that does not make a difference.

*) What additional information can I provide to debug this further? 

I am happy to run experimental versions of MySQL, or compile in extra debugging code if that might help.

Kees Jan

== error log ==

110619  2:00:30 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
110619  2:00:30 [Note] Slave I/O thread killed while reading event
110619  2:00:30 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000650', position 348660066
110619  2:00:41 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000650' at position 342235799, relay log './saffron-relay-bin.000818' position: 342040526
110619  2:00:41 [Note] Slave I/O thread: connected to master 'slave@194.109.206.50:3306',replication started in log 'mysql-bin.000650' at position 348660066

== my.cnf ==

[mysqld]
server-id=5
replicate-do-db="vbulletin"
replicate-do-db="lemongrass"
thread_cache_size=64
innodb_file_per_table
innodb_buffer_pool_size=128M
max_heap_table_size=256M
tmp_table_size=256M
table_open_cache=2048
table_definition_cache=2048
key_buffer_size=512M
slow-query-log
long_query_time=1.000
join_buffer_size=256k
query_cache_size=32M
query_cache_limit=512k
query_cache_min_res_unit=1k
relay-log=saffron-relay-bin
expire_logs_days=30
performance_schema
slave_net_timeout=30

How to repeat:
Set up a slave in a bad network environment, then wait.

Munin graph for mysql qps on slave

Attachment: mysql_queries-day.png (image/png, text), 19.27 KiB.

cron job and cron job output

Attachment: mysql-slave-restart.txt (text/plain), 10.28 KiB.

The graph and the cron job are in files, attached to this issue.

Interesting. I updated the slave database to MySQL 5.5.13 and it has been running without these strange slowdowns for an entire week (previously unheard of).

I have reasonably good hopes that this problem has been resolved with the version upgrade. I'll keep you posted.

Thank you for status update. Please, inform about any results of your further testing with 5.5.13.

I am closing this bug report. Upgrading the slave to version 5.5.13 solved the problem for me. Thanks.