Bug #41811 stopping slave on quiesced server causes longterm IO spike.
Submitted: 31 Dec 2008 20:29 Modified: 27 Feb 0:19
Reporter: Laine Campbell
Status: Not a Bug
Category:Server Severity:S5 (Performance)
Version:5.0.68-enterprise-gpl-log OS:Linux (centos 2.6.18-92.1.10.el5 x86_64)
Assigned to: Target Version:
Tags: replication performance

[31 Dec 2008 20:29] Laine Campbell
Description:
We are in a master/master solution with all reads/writes going to one master (primary)
and second server used for backups and failover (secondary). 

On 2 of our 31 master/master pairs, when we issue stop slave, both slave threads stop
immediately, but IOwait on the CPU doubles - triples, and continues until "slave start"
is issued.  No other threads are connected (except the replication connection from the
primary master) or performing work.  The same issue occurs if the DB is brought up with
slave skip start.  Once the slave thread is restarted, IO goes back to normal.

Centos 5, running mysql 5.0.68 ent
Box 1: 1 dual core CPU, 8 GB RAM,  

I have reduced global innodb_max_dirty_pages_pct and gotten dirty buffers down to 0 and
the io still runs hot.

Example iostat -x during this period:
Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await 
svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00  
0.00   0.00
sdb               0.00     8.50 16.00 101.50   832.00  6644.00    63.63     2.04   17.52 
 8.44  99.15
dm-0              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00  
0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00  
0.00   0.00
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00  
0.00   0.00
dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00  
0.00   0.00
dm-4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00  
0.00   0.00
dm-5              0.00     0.00 16.00 110.50   832.00  6868.00    60.87     2.04   16.28 
 7.84  99.15

Using show innodb status, I see that the log_buffer still seems to be flushing  and
sequence # incrementing.  Here are two samples, with ~60 seconds between while slave is
stopped.

---
LOG
---
Log sequence number 97 2015658440
Log flushed up to   97 2015646953
Last checkpoint at  97 1926146586
0 pending log writes, 0 pending chkp writes
205885465 log i/o's done, 2.33 log i/o's/second

---
LOG
---
Log sequence number 97 2017042788
Log flushed up to   97 2017017580
Last checkpoint at  97 1927895762
0 pending log writes, 0 pending chkp writes
205885643 log i/o's done, 2.21 log i/o's/second

Pertinent my.cnf parameters (consistent across other boxes on other clusters that do not
have these problems)

+---------------------------------+------------------------+
| innodb_additional_mem_pool_size | 33554432               |
| innodb_autoextend_increment     | 8                      |
| innodb_buffer_pool_awe_mem_mb   | 0                      |
| innodb_buffer_pool_size         | 5368709120             |
| innodb_checksums                | ON                     |
| innodb_commit_concurrency       | 0                      |
| innodb_concurrency_tickets      | 500                    |
| innodb_data_file_path           | ibdata1:10M:autoextend |
| innodb_data_home_dir            | /var/lib/mysql         |
| innodb_adaptive_hash_index      | ON                     |
| innodb_doublewrite              | ON                     |
| innodb_fast_shutdown            | 1                      |
| innodb_file_io_threads          | 4                      |
| innodb_file_per_table           | ON                     |
| innodb_flush_log_at_trx_commit  | 2                      |
| innodb_flush_method             | O_DIRECT               |
| innodb_force_recovery           | 0                      |
| innodb_lock_wait_timeout        | 120                    |
| innodb_locks_unsafe_for_binlog  | OFF                    |
| innodb_log_arch_dir             |                        |
| innodb_log_archive              | OFF                    |
| innodb_log_buffer_size          | 8388608                |
| innodb_log_file_size            | 1363148800             |
| innodb_log_files_in_group       | 3                      |
| innodb_log_group_home_dir       | ./                     |
| innodb_max_dirty_pages_pct      | 25                     |
| innodb_max_purge_lag            | 0                      |
| innodb_mirrored_log_groups      | 1                      |
| innodb_open_files               | 300                    |
| innodb_rollback_on_timeout      | OFF                    |
| innodb_support_xa               | ON                     |
| innodb_sync_spin_loops          | 20                     |
| innodb_table_locks              | ON                     |
| innodb_thread_concurrency       | 0                      |
| innodb_thread_sleep_delay       | 10000                  |

So, why does IO go through the roof on these two servers when the slave is stopped and no
other connections occur?

Laine

How to repeat:
It only happens on two of our 31 secondary servers, but happens everytime we stop the
slave, and continues until slave is started.
[31 Dec 2008 20:31] Laine Campbell
Additional comments, our snapshots always fill very quickly on these two servers while
taking a snapshot and running backups, indicating that there is io activity even with no
dirty pages and no connected threads.
[7 Jan 17:48] Laine Campbell
reducing innodb log file size from 1300 to 256 makes no significant change.  IO spikes
still go to 50% iowait for the period that the replication threads are down.
[27 Feb 0:19] Laine Campbell
this was due to an insert buffer that could not keep up (on a passive slave).  Please
close.