Bug #117740 | Please consider adding Slow IO Counters. | ||
---|---|---|---|
Submitted: | 18 Mar 12:35 | Modified: | 18 Mar 13:33 |
Reporter: | Jean-François Gagné | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S4 (Feature request) |
Version: | 9.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Contribution |
[18 Mar 12:35]
Jean-François Gagné
[18 Mar 12:42]
MySQL Verification Team
Hello Jean-François, Thank you for the feature request! regards, Umesh
[18 Mar 12:47]
MySQL Verification Team
Hello Jean-François, For now I'm setting category "Innodb" but I understood this feature request cover other modules as well. Thank you. regards, Umesh
[18 Mar 13:29]
J-F Aiven Gagné
More about this contribution in https://github.com/jfg956/mysql-server/pull/17 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: bug117740.patch (application/octet-stream, text), 24.63 KiB.
[18 Mar 13:30]
J-F Aiven Gagné
Some notes about my above contribution extracted from https://github.com/jfg956/mysql-server/pull/17 This PR merges on 9.2.0. Adapting it to 8.4 and 8.0 should be little work. For implementing counters for Slow InnoDB Sync Reads, this PR introduces a new global variable : innodb_buffer_pool_read_sync_slow_io_threshold_usec. The default value is 1 hour, which should not trigger any increase of the slow counters. For monitoring slow InnoDB Sync Reads, this threshold should be set in such a way that the counters do not increase most of the time, and increase significantly when the IO subsystem is misbehaving (I cannot tell you exactly how to set this up because it will depend on your IO subsystem, but a value close to p99 of its IO latency might be good). Note that occasional increase of the counters should not be interpreted as a misbehaving IO subsystem because tail latencies will always happen. The names of the InnoDB Metric of the four counters introduced by this PR, with the matching global status in parentheses, are : - buf_pool_reads_sync_io_count (innodb_buffer_pool_reads_sync_io_count); - buf_pool_reads_sync_io_wait_usec (innodb_buffer_pool_reads_sync_io_wait_usec); - buf_pool_reads_sync_io_slow_count (innodb_buffer_pool_reads_sync_io_slow_count); - buf_pool_reads_sync_io_slow_wait_usec (innodb_buffer_pool_reads_sync_io_slow_wait_usec). The first two counters increase for all Sync Read IOs, and the last two only when the wait time of an IO is above the threshold. I think that this PR should not bring any significant performance degradation, but I am yet to fully validate this. I might do this in the next weeks / months and add details in the bug and in this PR. Also in the next weeks / months and if time allows, I might submit an improved version of this PR with tests, and with adjustment taking into account the feedback I might have received.
[18 Mar 13:33]
Jean-François Gagné
(adding the contribution tag)
[26 Mar 16:35]
Mark Callaghan
JFG - I support this change. 1) I also want monitoring for fsync/fdatasync, and even better if binlog/InnoDB usage of that is split into separate counters because binlog fsync has intermittently high latency with some of the ext family 2) I prefer response time histograms rather than one counter for high latency responses. Although, histograms introduce other problems: a) how do you display them b) how do you avoid too many buckets For 2a), how to display them, they can be flattened with one counter per bucket. And/or they can be stored in an information_schema table For 2b) this is less of an issue now that spinning disks are not frequently used, but the concern is that the spread in latencies can be large if the buckets are defined statically (local SSD is fast, cloud SSD is somewhat fast, disk is slow). Perhaps a my.cnf option to adjust the boundaries would help.