MySQL Bugs: #117740: Please consider adding Slow IO Counters.

Bug #117740	Please consider adding Slow IO Counters.
Submitted:	18 Mar 12:35	Modified:	18 Mar 13:33
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	9.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution

Description:
Hi,

I am opening this feature request for suggesting adding slow IO Counters. Below, I start by giving context about why I think this is needed, and then I list counters that could fulfill this need.

When running MySQL on "complex" block devices (in the cloud on remote storage, or on prem on SANs, or others), it is not uncommon to get degraded query performance because of convoluted failure mode of the IO subsystem (shared devices overloaded, RAID rebuilding, increased tail latencies, rate limiting, ...). Pinpointing these degradations to the IO subsystem is usually not straightforward, because these complex block device do not always report such failure, and even if they do, MySQL Engineers do not always have access to this information because these devices are in the scope of another team (AWS or GCP for EBS or PV, a different storage team on prems, ...). The goal of the suggested Slow IO Counters of this feature request is to provide MySQL Engineers with evidences of such IO subsystem failure, and allow automation to deal with them (failover of a degraded primary to a standby, removing a degraded read replica from a load balancer, ...).

I see the following counters as useful to diagnose and react to misbehaving IO subsystem:

1. Slow InnoDB Sync Reads : for fetching a page in the Buffer Pool (for a SELECT or a DML). I use the name "Sync" Reads for excluding read-ahead and read-ahead random (I thought of naming this "Direct" instead of "Sync", but I choose Sync because Direct brings confusion with O_DIRECT, even though Sync can be confused with cache flushing / sync-ing). Read-aheads are excluded because reading a whole extent is not the same as reading a single page, and the definition of "Slow" might be different for each case.

2. Slow Binary Log Writes / Flushes : for writes / flushes to the binary logs. Maybe we do not need write counters because these are usually cached. Maybe this should exclude Flushes for binary log rotation, because these are "bigger" than Flushes for transaction (sync_binlog = 1). Arguably, the definition of slow is complicated here, as transactions are not equal : some are small, and some are big, which can impact IO latency; and Group Commit might make this even more complicated because of a single flush for many transactions.

3. Slow InnoDB Redo Log Writes / Flushes : same as #2 above, but for the Redo Logs.

4. Slow InnoDB Page Cleaning Writes / Flushes : for writes and flushes during Page Cleaning. This would probably include the double write buffer, but this might need refinement. This counter might not be absolutely needed because the majority of misbehaving IO subsystem should be caught by #1, #2 and #3, but it might be more reliable than #2 and #3 because the definition of slow if easier here.

5. Slow InnoDB Single Page Flush Writes / Flushes : same as #4 above, but for the Single Page Flush (needing a free page while none are in the free list).

6. Slow InnoDB Read-Aheads : same as #1 above, but for read-ahead. This counter might not be absolutely needed because the majority of misbehaving IO subsystem should be caught by #1.

7. Maybe more...

Setting this as Category Server instead of InnoDB because some counters (binlog writes / flushes) are not in InnoDB.

I am planning to submit a patch for Slow InnoDB Sync Reads very soon (I just need the bug number to complete things). I might submit other patches for other counters, but I might also leave this to someone else (I have not yet started work on other counters).

Many thanks for considering this feature request,

Jean-François Gagné

How to repeat:
(N/A because feature request)

Suggested fix:
(See description and the patch I will contribute very soon)

Hello Jean-François,

Thank you for the feature request!

regards,
Umesh

Hello Jean-François,

For now I'm setting category "Innodb" but I understood this feature request cover other modules as well. Thank you. 

regards,
Umesh

More about this contribution in https://github.com/jfg956/mysql-server/pull/17 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug117740.patch (application/octet-stream, text), 24.63 KiB.

Some notes about my above contribution extracted from https://github.com/jfg956/mysql-server/pull/17

This PR merges on 9.2.0. Adapting it to 8.4 and 8.0 should be little work.

For implementing counters for Slow InnoDB Sync Reads, this PR introduces a new global variable : innodb_buffer_pool_read_sync_slow_io_threshold_usec. The default value is 1 hour, which should not trigger any increase of the slow counters. For monitoring slow InnoDB Sync Reads, this threshold should be set in such a way that the counters do not increase most of the time, and increase significantly when the IO subsystem is misbehaving (I cannot tell you exactly how to set this up because it will depend on your IO subsystem, but a value close to p99 of its IO latency might be good). Note that occasional increase of the counters should not be interpreted as a misbehaving IO subsystem because tail latencies will always happen.

The names of the InnoDB Metric of the four counters introduced by this PR, with the matching global status in parentheses, are :

- buf_pool_reads_sync_io_count (innodb_buffer_pool_reads_sync_io_count);

- buf_pool_reads_sync_io_wait_usec (innodb_buffer_pool_reads_sync_io_wait_usec);

- buf_pool_reads_sync_io_slow_count (innodb_buffer_pool_reads_sync_io_slow_count);

- buf_pool_reads_sync_io_slow_wait_usec (innodb_buffer_pool_reads_sync_io_slow_wait_usec).

The first two counters increase for all Sync Read IOs, and the last two only when the wait time of an IO is above the threshold.

I think that this PR should not bring any significant performance degradation, but I am yet to fully validate this. I might do this in the next weeks / months and add details in the bug and in this PR. Also in the next weeks / months and if time allows, I might submit an improved version of this PR with tests, and with adjustment taking into account the feedback I might have received.

(adding the contribution tag)

JFG - I support this change.

1) I also want monitoring for fsync/fdatasync, and even better if binlog/InnoDB usage of that is split into separate counters because binlog fsync has intermittently high latency with some of the ext family

2) I prefer response time histograms rather than one counter for high latency responses. Although, histograms introduce other problems:
a) how do you display them
b) how do you avoid too many buckets

For 2a), how to display them, they can be flattened with one counter per bucket. And/or they can be stored in an information_schema table

For 2b) this is less of an issue now that spinning disks are not frequently used, but the concern is that the spread in latencies can be large if the buckets are defined statically (local SSD is fast, cloud SSD is somewhat fast, disk is slow). Perhaps a my.cnf option to adjust the boundaries would help.