MySQL Bugs: #41811: stopping slave on quiesced server causes longterm IO spike.

Bug #41811	stopping slave on quiesced server causes longterm IO spike.
Submitted:	31 Dec 2008 19:29	Modified:	26 Feb 2009 23:19
Reporter:	Laine Campbell	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S5 (Performance)
Version:	5.0.68-enterprise-gpl-log	OS:	Linux (centos 2.6.18-92.1.10.el5 x86_64)
Assigned to:		CPU Architecture:	Any
Tags:	replication performance

Description:
We are in a master/master solution with all reads/writes going to one master (primary) and second server used for backups and failover (secondary). 

On 2 of our 31 master/master pairs, when we issue stop slave, both slave threads stop immediately, but IOwait on the CPU doubles - triples, and continues until "slave start" is issued.  No other threads are connected (except the replication connection from the primary master) or performing work.  The same issue occurs if the DB is brought up with slave skip start.  Once the slave thread is restarted, IO goes back to normal.

Centos 5, running mysql 5.0.68 ent
Box 1: 1 dual core CPU, 8 GB RAM,  

I have reduced global innodb_max_dirty_pages_pct and gotten dirty buffers down to 0 and the io still runs hot.

Example iostat -x during this period:
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     8.50 16.00 101.50   832.00  6644.00    63.63     2.04   17.52   8.44  99.15
dm-0              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00 16.00 110.50   832.00  6868.00    60.87     2.04   16.28   7.84  99.15

Using show innodb status, I see that the log_buffer still seems to be flushing  and sequence # incrementing.  Here are two samples, with ~60 seconds between while slave is stopped.

---
LOG
---
Log sequence number 97 2015658440
Log flushed up to   97 2015646953
Last checkpoint at  97 1926146586
0 pending log writes, 0 pending chkp writes
205885465 log i/o's done, 2.33 log i/o's/second

---
LOG
---
Log sequence number 97 2017042788
Log flushed up to   97 2017017580
Last checkpoint at  97 1927895762
0 pending log writes, 0 pending chkp writes
205885643 log i/o's done, 2.21 log i/o's/second

Pertinent my.cnf parameters (consistent across other boxes on other clusters that do not have these problems)

+---------------------------------+------------------------+
| innodb_additional_mem_pool_size | 33554432               |
| innodb_autoextend_increment     | 8                      |
| innodb_buffer_pool_awe_mem_mb   | 0                      |
| innodb_buffer_pool_size         | 5368709120             |
| innodb_checksums                | ON                     |
| innodb_commit_concurrency       | 0                      |
| innodb_concurrency_tickets      | 500                    |
| innodb_data_file_path           | ibdata1:10M:autoextend |
| innodb_data_home_dir            | /var/lib/mysql         |
| innodb_adaptive_hash_index      | ON                     |
| innodb_doublewrite              | ON                     |
| innodb_fast_shutdown            | 1                      |
| innodb_file_io_threads          | 4                      |
| innodb_file_per_table           | ON                     |
| innodb_flush_log_at_trx_commit  | 2                      |
| innodb_flush_method             | O_DIRECT               |
| innodb_force_recovery           | 0                      |
| innodb_lock_wait_timeout        | 120                    |
| innodb_locks_unsafe_for_binlog  | OFF                    |
| innodb_log_arch_dir             |                        |
| innodb_log_archive              | OFF                    |
| innodb_log_buffer_size          | 8388608                |
| innodb_log_file_size            | 1363148800             |
| innodb_log_files_in_group       | 3                      |
| innodb_log_group_home_dir       | ./                     |
| innodb_max_dirty_pages_pct      | 25                     |
| innodb_max_purge_lag            | 0                      |
| innodb_mirrored_log_groups      | 1                      |
| innodb_open_files               | 300                    |
| innodb_rollback_on_timeout      | OFF                    |
| innodb_support_xa               | ON                     |
| innodb_sync_spin_loops          | 20                     |
| innodb_table_locks              | ON                     |
| innodb_thread_concurrency       | 0                      |
| innodb_thread_sleep_delay       | 10000                  |

So, why does IO go through the roof on these two servers when the slave is stopped and no other connections occur?

Laine

How to repeat:
It only happens on two of our 31 secondary servers, but happens everytime we stop the slave, and continues until slave is started.

Additional comments, our snapshots always fill very quickly on these two servers while taking a snapshot and running backups, indicating that there is io activity even with no dirty pages and no connected threads.

reducing innodb log file size from 1300 to 256 makes no significant change.  IO spikes still go to 50% iowait for the period that the replication threads are down.

this was due to an insert buffer that could not keep up (on a passive slave).  Please close.