Bug #117740 Please consider adding Slow IO Counters.
Submitted: 18 Mar 12:35 Modified: 8 May 14:47
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:9.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[18 Mar 12:35] Jean-François Gagné
Description:
Hi,

I am opening this feature request for suggesting adding slow IO Counters.  Below, I start by giving context about why I think this is needed, and then I list counters that could fulfill this need.

When running MySQL on "complex" block devices (in the cloud on remote storage, or on prem on SANs, or others), it is not uncommon to get degraded query performance because of convoluted failure mode of the IO subsystem (shared devices overloaded, RAID rebuilding, increased tail latencies, rate limiting, ...).  Pinpointing these degradations to the IO subsystem is usually not straightforward, because these complex block device do not always report such failure, and even if they do, MySQL Engineers do not always have access to this information because these devices are in the scope of another team (AWS or GCP for EBS or PV, a different storage team on prems, ...).  The goal of the suggested Slow IO Counters of this feature request is to provide MySQL Engineers with evidences of such IO subsystem failure, and allow automation to deal with them (failover of a degraded primary to a standby, removing a degraded read replica from a load balancer, ...).

I see the following counters as useful to diagnose and react to misbehaving IO subsystem:

1. Slow InnoDB Sync Reads : for fetching a page in the Buffer Pool (for a SELECT or a DML).  I use the name "Sync" Reads for excluding read-ahead and read-ahead random (I thought of naming this "Direct" instead of "Sync", but I choose Sync because Direct brings confusion with O_DIRECT, even though Sync can be confused with cache flushing / sync-ing).  Read-aheads are excluded because reading a whole extent is not the same as reading a single page, and the definition of "Slow" might be different for each case.

2. Slow Binary Log Writes / Flushes : for writes / flushes to the binary logs.  Maybe we do not need write counters because these are usually cached.  Maybe this should exclude Flushes for binary log rotation, because these are "bigger" than Flushes for transaction (sync_binlog = 1).  Arguably, the definition of slow is complicated here, as transactions are not equal : some are small, and some are big, which can impact IO latency; and Group Commit might make this even more complicated because of a single flush for many transactions.

3. Slow InnoDB Redo Log Writes / Flushes : same as #2 above, but for the Redo Logs.

4. Slow InnoDB Page Cleaning Writes / Flushes : for writes and flushes during Page Cleaning.  This would probably include the double write buffer, but this might need refinement.  This counter might not be absolutely needed because the majority of misbehaving IO subsystem should be caught by #1, #2 and #3, but it might be more reliable than #2 and #3 because the definition of slow if easier here.

5. Slow InnoDB Single Page Flush Writes / Flushes : same as #4 above, but for the Single Page Flush (needing a free page while none are in the free list).

6. Slow InnoDB Read-Aheads : same as #1 above, but for read-ahead.  This counter might not be absolutely needed because the majority of misbehaving IO subsystem should be caught by #1.

7. Maybe more...

Setting this as Category Server instead of InnoDB because some counters (binlog writes / flushes) are not in InnoDB.

I am planning to submit a patch for Slow InnoDB Sync Reads very soon (I just need the bug number to complete things).  I might submit other patches for other counters, but I might also leave this to someone else (I have not yet started work on other counters).

Many thanks for considering this feature request,

Jean-François Gagné

How to repeat:
(N/A because feature request)

Suggested fix:
(See description and the patch I will contribute very soon)
[18 Mar 12:42] MySQL Verification Team
Hello Jean-François,

Thank you for the feature request!

regards,
Umesh
[18 Mar 12:47] MySQL Verification Team
Hello Jean-François,

For now I'm setting category "Innodb" but I understood this feature request cover other modules as well. Thank you. 

regards,
Umesh
[18 Mar 13:29] J-F Aiven Gagné
More about this contribution in https://github.com/jfg956/mysql-server/pull/17

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug117740.patch (application/octet-stream, text), 24.63 KiB.

[18 Mar 13:30] J-F Aiven Gagné
Some notes about my above contribution extracted from https://github.com/jfg956/mysql-server/pull/17

This PR merges on 9.2.0. Adapting it to 8.4 and 8.0 should be little work.

For implementing counters for Slow InnoDB Sync Reads, this PR introduces a new global variable : innodb_buffer_pool_read_sync_slow_io_threshold_usec. The default value is 1 hour, which should not trigger any increase of the slow counters. For monitoring slow InnoDB Sync Reads, this threshold should be set in such a way that the counters do not increase most of the time, and increase significantly when the IO subsystem is misbehaving (I cannot tell you exactly how to set this up because it will depend on your IO subsystem, but a value close to p99 of its IO latency might be good). Note that occasional increase of the counters should not be interpreted as a misbehaving IO subsystem because tail latencies will always happen.

The names of the InnoDB Metric of the four counters introduced by this PR, with the matching global status in parentheses, are :

- buf_pool_reads_sync_io_count (innodb_buffer_pool_reads_sync_io_count);

- buf_pool_reads_sync_io_wait_usec (innodb_buffer_pool_reads_sync_io_wait_usec);

- buf_pool_reads_sync_io_slow_count (innodb_buffer_pool_reads_sync_io_slow_count);

- buf_pool_reads_sync_io_slow_wait_usec (innodb_buffer_pool_reads_sync_io_slow_wait_usec).

The first two counters increase for all Sync Read IOs, and the last two only when the wait time of an IO is above the threshold.

I think that this PR should not bring any significant performance degradation, but I am yet to fully validate this. I might do this in the next weeks / months and add details in the bug and in this PR. Also in the next weeks / months and if time allows, I might submit an improved version of this PR with tests, and with adjustment taking into account the feedback I might have received.
[18 Mar 13:33] Jean-François Gagné
(adding the contribution tag)
[26 Mar 16:35] Mark Callaghan
JFG - I support this change.

1) I also want monitoring for fsync/fdatasync, and even better if binlog/InnoDB usage of that is split into separate counters because binlog fsync has intermittently high latency with some of the ext family

2) I prefer response time histograms rather than one counter for high latency responses. Although, histograms introduce other problems:
a) how do you display them
b) how do you avoid too many buckets

For 2a), how to display them, they can be flattened with one counter per bucket. And/or they can be stored in an information_schema table

For 2b) this is less of an issue now that spinning disks are not frequently used, but the concern is that the spread in latencies can be large if the buckets are defined statically (local SSD is fast, cloud SSD is somewhat fast, disk is slow). Perhaps a my.cnf option to adjust the boundaries would help.
[5 Apr 3:34] Rick James
Instead of having an instantaneous number (or a histogram), keep an "exponential moving average".  That stores only one number, but it changes over time to indicate that the timings are changing.  The math requires only a subtract, multiply, and add.

As a person who is often given a single snapshot of STATUS and VARIABLES, I beg for enough information to make an educated guess of whether a STATUS value is "too high" and then look at things like "Questions/Uptime" (etc) to decide where to look next.

That is, please don't invent a new STATUS without providing a simple cookbook that says (1) what value is too high (or too low) and (2) what to do to investigate further.
[5 Apr 3:38] Rick James
I suspect "buf_pool_reads_sync_io_wait_usec" is the value to do the averaging over.  (See my previous comment.)
[7 May 17:46] Jakub Lopuszanski
Hello JFG,
thank you for very nice description, motivation, code comments and code.

I have a few high-level questions, though.

1. Have you seen https://dev.mysql.com/doc/refman/9.3/en/performance-schema-file-summary-tables.html - I realise there is no differentiation between slow vs fast - all you get is MIN,MAX,AVG,CNT. But I was thinking, that perhaps this is the table which we should extend with new features like "99th percentile", instead? Or perhaps simply monitoring MAX, or AVG alone would already suffice for your needs?

2. Have you seen
https://dev.mysql.com/doc/refman/9.3/en/performance-schema-setup-consumers-table.html and 
https://dev.mysql.com/doc/refman/9.3/en/performance-schema-events-waits-current-table.html ?

AFAIU you can do:

UPDATE performance_schema.setup_consumers SET ENABLED="YES" WHERE NAME="events_waits_current";

And then query the performance_schema.events_waits_current for individual "events" related to IO - AFAIU each single read or write on each file is reported this way:
            THREAD_ID: 15
             EVENT_ID: 67
         END_EVENT_ID: 67
           EVENT_NAME: wait/io/file/innodb/innodb_data_file
               SOURCE: fil0fil.cc:7823
          TIMER_START: 224652271154586
            TIMER_END: 224652310505066
           TIMER_WAIT: 39350480
                SPINS: NULL
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: C:\ade\mysql\2\mysql-bin\data\ibdata1
           INDEX_NAME: NULL
          OBJECT_TYPE: FILE
OBJECT_INSTANCE_BEGIN: 2046180445888
     NESTING_EVENT_ID: NULL
   NESTING_EVENT_TYPE: NULL
            OPERATION: write
      NUMBER_OF_BYTES: 16384
                FLAGS: NULL
This is so fine-grained that you can reconstruct not just the whole distribution, but the actual timeline of what was happening.
I am not sure what is the performance impact of enabling it, though. Have you tried?

3. Why did you have to resort to use `innodb_session_t`? I mean, why not just do the simplest possible thing like:
```
+ const auto before = now();
  count = buf_read_page_low(&err, true, 0, BUF_READ_ANY_PAGE, page_id,
                            page_size, false);
+ const auto after = now();
+ if (count) {
...
```
One reason I could imagine is that you really want to measure just the IO, not the time it takes to acquire a page from BP, or to decompress the raw bytes.
But...if your goal is to just focus on really slow IOs then does this small overhead really matter? 
I ask, because I find this usage of very high level concept of user session in the lowest level function dealing with files a bit "ugly", and wonder if there is good rationale for this sacrifice.
(An alternative, if we are forced, could be to use `thread_local`)

As for innodb_metrics vs Global Statuses, we agree - we add new stuff to innodb_metrics only.
This means we should not use special `case MONITOR_OVLD_*` which gets value from `srv_stats.buf_pool_reads_sync_io_*`. Instead we should just use MONITOR_INC (or beter MONITOR_ATOMIC_INC) etc.

As for OTEL, well, yes, we should probably expose it there, too - I really appreciate you've taken time to do so even though it's not a community feature, thank you. BTW. Would you be interested in OTEL being exposed in Community?

As for duplicated #includes - I once tried to get rid of them in automated way, removed 2000+ `#include` directives, but the impact on compilation speed was negligible, so compared to the effort to review all that, I've ditched the effort. I just fix them one by one as I spot them - thank you for helping with this.

As for the auto-tuning aspect of this (so that one doesn't have to guess the right threshold value), and giving the use a fuller picture of what is going on (histograms, percentiles, whole distribution)... I don't know really. Looks like it should be relatively easy for anyone really interested in monitoring this problem, to let the system run for a while, collect the "usual" average value and then set the threshold to be some multiple of it, and then set alarms for the counter of slow ios. Once the alarm hits and they want to really drill down, then I think the proper way to do that is using performance_schema - at least I don't see how we could ever agree which exact way of summarising the situation is "good enough" and efficient enough.
What I like in the current patch is that it is relatively simple and with neglible impact on performance (as compared to, say, trying to implement some hyperloglog/sketches/histograms etc. just for this single usage).

Perhaps instead of adding more and more ad-hoc counters to InnoDB, we should somehow make the performance_schema more usable.
For example, I see we already have SUM,CNT summaries for individual files and operations, which is nice, as you can see how avg value changed over time (if you note down values of SUM and CNT).
If this is insufficient, then what is a minimal extension of this mechanism, which would be helpful?
Would it, for example, help to have SUM_SQUARED, so that you can figure out standard deviation?
Would it help, to be able to reset MIN, and MAX?
[8 May 14:46] Jean-François Gagné
Thanks for reviewing the patch Jakub.

I am taking the opportunity of this comment -- I did not want to overload the bug report with feature discussions -- to mention that Mark's and Rick's comments have been moved to GitHub issues ([1a] and [1b] respectively).

[1a]: https://github.com/jfg956/mysql-server/issues/18

[1b]: https://github.com/jfg956/mysql-server/issues/19

About your questions Jakub...

> 1. Have you seen performance-schema-file-summary-tables ?

Yes, I have.  I tried to implement the feature there, but P_S interface lacks flexibility and is complex, so I fell back to InnoDB Metrics.  Also, file-summary-tables aggregates all sizes of IOs, which makes it complicated to define "slow read": a 10 ms latency for reading an InnoDB page might be slow, but a 100 ms large read on an InnoDB Data File might be normal.  Such large read can happen in [2a], and I do not see how to differentiate them with those from [2b].

[2a]: https://github.com/jfg956/mysql-server/blob/mysql-9.0.1/storage/innobase/include/os0file.i...

[2b]: https://github.com/jfg956/mysql-server/blob/mysql-9.0.1/storage/innobase/include/os0file.i...

Also note that P_S accounting is wrong in the [2a] case above: in both read and write instruments, the latency of read and write are accounted "together" (in addition to read, the latency of write is accounted in read -- and vice versa).

> 2. Have you seen ... performance-schema-events-waits-current-table ?

No, I have not, thanks for pointing me there.  But ...

> I am not sure what is the performance impact of enabling it, though. Have you tried?

Generating a full P_S event for each IO looks like it would incur a big CPU and Memory overhead.  Also, the raw data would need to be aggregated to count Slow InnoDB Sync Reads.  It is a easier here to know which reads are on a single page (size 16 KB), but still, converting a firehose of events to the aggregated counters looks like an approach which is not lean enough to be enabled all the time.  Also, I do not see how to do this aggregation in P_S, and doing it "outbound" of MySQL would incur additional query costs, and implementation complexity.  And this is without the "bug" above of writes sometimes being accounted in reads (and as pointed-out below, false positive are bad).

> 3. Why did you have to resort to use `innodb_session_t`?

Good question, I should have included more comments on this.  ...

> One reason I could imagine is that you really want to measure just the IO

Exactly !

> does this small overhead really matter? 

I am not sure the overhead is always small.  What comes to mind is a single page flush to acquire a free page, which might incur a significant delay if there is contention on the Free List Mutex.  Accounting such Mutex Delay might generate a Slow IO false positive, and it would confuse DBAs.

> I find this usage of very high level concept of user session in the lowest level function dealing with files a bit "ugly"

I completely agree that this is ugly, but it was the only way I found to carry information between os_aio_func and buf_read_page.

> An alternative, if we are forced, could be to use `thread_local`

Unclear what you mean here.  Is there another "session scoped" object we can use (I only found innodb_session_t) ?  From what I understand, innodb_session_t is the right place to put InnoDB data in current_thd, but my knowledge of all this is limited with my little experience as a MySQL Developer.

> As for innodb_metrics vs Global Statuses, we agree - we add new stuff to innodb_metrics only.

Noted.  What is the best way to more forward here ?  Is this change something that can be done by Oracle while integrating my work, or should I update the patch ?

> BTW. Would you be interested in OTEL being exposed in Community?

Yes, OTEL in Community would be interesting (the more there is in Community, the better it is IMHO).  But I have not thought about this much.

> As for duplicated #includes [...]

Noted, thanks for the explanation.

> As for the auto-tuning aspect of this [...]

I agree: auto-tuning is complicated to implement in a way that would be one-size-fit-all.  I think the current implementation of the feature already provides a lot of value and is simple and flexible enough to use.  Auto-tuning can be implemented later if needed.  And if I may add a sarcastic comment, this auto-tuning could be implemented at the same time as auto-tuning of innodb_io_capacity: they both have similar challenges, and innodb_io_capacity not currently being auto-tuned shows this is not trivial.

> histograms, percentiles, whole distribution

Same logic as above, this can be future work.

> Perhaps instead of adding more and more ad-hoc counters to InnoDB, we should somehow make the performance_schema more usable.

This is an option, but it is a much bigger project, and the consequence might be to delay the feature availability.  I am not sure it is the best way to provide value to MySQL Users.

> Would it, for example, help to have SUM_SQUARED, so that you can figure out standard deviation?

I am not sure adding the possibility to compute a standard deviation would be enough to detect slow IOs: see my comment above about the definition of slow and the aggregation of small and large IOs in wait/io/file/innodb/innodb_data_file.

> Would it help, to be able to reset MIN, and MAX?

This would be an interesting feature, but I am not sure it solves the problem.  Even after isolating small IOs and IMHO, just having MAX is not enough to detect miss-behaving IO Subsystems.  In a distributed system, there is always tail latency.  The challenge here is _not_ detecting the existence of tail latency (this is what MAX allows), it is providing visibility on the rate of these tail latency (which I do not see how a single MAX can achieve).

If I may, I would like to flip the question: instead of looking at how to implement this in P_S, what is wrong with adding InnoDB Metrics ?

The bottom line about P_S is that IMHO, making it suitable for detecting such event (degraded IO Subsystem) would need giving much more context to P_S, and building a complex aggregators interface.  This basically is building a full Event sub-system in MySQL, and building a low overhead aggregators toolbox.  I do not see how to do this.  I am not against asking P_S people if there is an easy way to _reliably_ detect degraded IO Subsystem, but as I write above, waiting for a full new version of P_S for doing that would delay the feature availability, and I am not sure it is the best way to provide value to MySQL Users (I might be wrong though).
[8 May 16:29] Jakub Lopuszanski
Hello JFG, thanks for quick and deep response.

The pfs_os_file_copy_func(..) is only used in one place: the CLONE PLUGIN.
The os_file_copy_func(..) that it wraps, has two implementations, one of them using `sendfile(dest_fd, src_fd,..)` to shovel data from one file descriptor to another without involvement of mysqld process, so I don't see a better way to accounting for this than the current one.

AFAIU all other accesses to tablespaces use UNIV_PAGE_SIZE.
Accesses to redo log vary in size. 
I think double-write buffer accesses might also vary.
But performance_schema.events_waits_current has NUMBER_OF_BYTES, so one can use it to drill down.
I don't know what's the impact on performance of enabling " controlled globally by a single sys-var expressed in seconds" consumer.
I don't think it is easy to guess from an armchair - note that, while memory allocation sounds like a big deal in general, we are talking about doing it not more frequently than O(1) per IO so its cost should pale in comparison.

By thread_local, I've meant C++ feature, https://en.cppreference.com/w/cpp/keyword/thread_local.
That's one way to work-around the situation like this one, where you need to pass some information across deep call-stack, but do not really want to modify the signatures of all the functions along the way.
I could imagine having one such thread_local variable dedicated to "measuring duration of the most recent IO operation done by this thread".
Arguably it's not very ugly, given that the semantic of the variable sounds reasonable to the low-level os0file module, and does not couple it tightly with "session" nor "P_S".
Then whoever is interested in learning this value could, as you do, read it after a call which results in IO.
In case they aren't sure if it will result in IO, then as you do, they can reset it to 0 before the call, and compare to 0 afterwards.
(BTW, isn't the `count` returned by `buf_read_page_low` telling you if the IO has happened or not?)

> Also, I do not see how to do this aggregation in P_S
I am not sure how complicated kind of aggregation you had in mind, but if you want something like "count the number of 16KB page reads, which took longer than 0.01 seconds since last time I've checked", then, wouldn't something like:
```
SELECT COUNT(*), MAX(TIMER_END) as this_time_I_ve_checked
FROM performance_schema.events_waits_current
WHERE EVENT_NAME="wait/io/file/innodb/innodb_data_file"
  AND TIMER_WAIT > 10000000000
  AND TIMER_START > ${last_time_I_ve_checked}
  AND NUMBER_OF_BYTES = 16384
```
do the trick?

> Is this change something that can be done by Oracle while integrating my work, or should I update the patch ?
The former, but realistically the more production-ready the contribution, the bigger chance that someone will find time to integrate it.

> I think the current implementation of the feature already provides a lot of value and is simple and flexible enough to use.
I think we don't want anything more complicated than this patch.
I am thinking on how to simplify it even further.
Why? Because you've mentioned you'd like to add something similar for a bunch of other IO types, so whatever we will do here, will get multiplied.
Thus, ideally, I'd prefer to have a single solution for all of that in one go - such as extending existing performance_schema in some generic way.
If that's impossible, then at least I'd like to avoid having boiler-plate code copy&pasted into all the places where we do IO.

> I am not sure adding the possibility to compute a standard deviation would be enough to detect slow IOs: see my comment above about the definition of slow and the aggregation of small and large IOs in wait/io/file/innodb/innodb_data_file.

OK, but as I've said this pfs_os_file_copy_func(..) is only used in one place in CLONE PLUGIN. Does it change your perspective?
Just to add a bit more details to my proposal.
Suppose that in addition to CNT and SUM, you also have SUM_SQUARED which tracks sum of squares of durations.
Then your monitoring could note down these three values periodically, and by subtracting the 3 values seen this time from those seen last time, figure out what was the CNT, SUM and SUM_SQUARED within a given time period.
From this you can compute standard deviation as:
SQRT((SUM/CNT)^2 - SUM_SQUARES/CNT)
If the number of slow IO operations increased, that should impact SUM_SQUARES a lot, and also increase stddev, right?
(And if we posit that the number of such slow ios is so small as to not be visible on SUM_SQUARES/stddev, yet somehow is still important for your business, then another option would be to track SUM_CUBES to obtain higher moments of the distribution.)

> [resetting MIN and MAX] would be an interesting feature, but I am not sure it solves the problem.  Even after isolating small IOs and IMHO, just having MAX is not enough to detect miss-behaving IO Subsystems.  In a distributed system, there is always tail latency.  The challenge here is _not_ detecting the existence of tail latency (this is what MAX allows), it is providing visibility on the rate of these tail latency (which I do not see how a single MAX can achieve).

Thanks for sharing this perspective.

> If I may, I would like to flip the question: instead of looking at how to implement this in P_S, what is wrong with adding InnoDB Metrics ?
I am afraid it is a road to have a lot of ad-hoc code (you've listed "6 or more" use-cases), which is not only maintenance problem for the identified 6+ places, but also something we will have to remember whenever adding yet another place.
I'd prefer something more generic.

To give one example of "something more generic": What if we add CNT_ABOVE_THRESHOLD column to https://dev.mysql.com/doc/refman/9.3/en/performance-schema-file-summary-tables.html where the threshold is 
a) controlled globally by a single sys-var expressed in seconds
b) controlled globally by a single sys-var expressed in "multiples of AVG"
c) controlled per EVENT_NAME via some helper table (I don't know what's the performance impact of having to consult/cache/maintain it)