Bug #112252 InnoDB page compression impacts checkpointing
Submitted: 4 Sep 2023 16:20 Modified: 30 Oct 2023 14:13
Reporter: Przemyslaw Malkowski Email Updates:
Status: Analyzing Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.33, 8.0.34 OS:Any
Assigned to: CPU Architecture:Any
Tags: flushing, page compression

[4 Sep 2023 16:20] Przemyslaw Malkowski
Description:
With enabled page compression for a table, during a very write-intensive (append/insert) workload, flushing becomes less optimized as compared to the same table not using compression. 

Tested workload leads to quick checkpoint age growth, reaching maximum age level. For uncompressed tables, checkpointing is handled mostly by adaptive flushing, reaching sync flushing only occasionally while the same workload and settings with the table using COMPRESSION='LZ4' trigger a long period of sync flushing, making the instance stuck for competing clients. 

Both tests provide similar InnoDB rows written levels, similar redo log throughput, etc. The most visible difference is higher checkpoint age and sync flushing for the compressed table.

How to repeat:
All settings default, except:

innodb_buffer_pool_size = 4G
innodb_log_file_size=1G
innodb_monitor_enable = all
innodb_io_capacity = 1000
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=2
sync_binlog=0

I tested on a box with Core i7 11th gen, and NVMe drive (much more than 1k IOPS capable).

Table definition:

CREATE TABLE test_1 (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `rId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `cId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `aId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `newV` longtext,
  `oldV` longtext,
  `uId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `acId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `IncrId` bigint NOT NULL,
  `dTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `IDX1` (`IncrId`),
  KEY `IDX2` (`aId`,`acId`),
  KEY `IDX3` (`aId`,`rId`),
  KEY `IDX4` (`aId`,`cId`,`rId`)
) ENGINE=InnoDB AUTO_INCREMENT=6000001 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION='none';

To test random inserts, I used https://github.com/Percona-Lab/mysql_random_data_load tool as below:

$ mysql_random_data_load --bulk-size=50 --max-threads=2 -u msandbox -p msandbox --host="127.0.0.1" -P 8040 test test_1 1500000 & mysql_random_data_load --bulk-size=50 --max-threads=2 -u msandbox -p msandbox --host="127.0.0.1" -P 8040 test test_1 1500000 & mysql_random_data_load --bulk-size=50 --max-threads=2 -u msandbox -p msandbox --host="127.0.0.1" -P 8040 test test_1 1500000 & mysql_random_data_load --bulk-size=50 --max-threads=2 -u msandbox -p msandbox --host="127.0.0.1" -P 8040 test test_1 1500000 &

Which resulted in:
4m6s [====================================================================] 100%
INFO[2023-09-04T17:41:59+02:00] 1500000 rows inserted                        

At the end of the test, the table has exactly 6M rows:
mysql > select count(*) from test_1;
+----------+
| count(*) |
+----------+
|  6000000 |
+----------+
1 row in set (0.18 sec)

Before the second test, wait till checkpoint age goes down to 0 and run:

truncate table test_1;
alter table test_1 compression="LZ4";

Suggested fix:
Allow adaptive flushing when possible similar to an uncompressed table.
[4 Sep 2023 16:21] Przemyslaw Malkowski
Example graphs showing different flushing behavior for compression vs non compression

Attachment: FLushing_compression.png (image/png, text), 77.30 KiB.

[5 Sep 2023 9:02] Jakub Lopuszanski
Hi! Thanks for interesting info.

> similar redo log throughput

Hm, how was that established? Could you show results of
SELECT COUNT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME IN ('log_lsn_current', 'log_lsn_last_checkpoint');
before and after each test?

The reason I ask, is because I'd like to know if the amount of redo log produced for compressed pages is similar to the amount for uncompressed pages.

> The most visible difference is higher checkpoint age and sync flushing for the compressed table.

Looking at the lower chart in FLushing_compression.png it looks to me that the red line (described as Sync Flushing) is actually higher for the later run (described as compression=none).
What measurement do you mean by "higher sync flushing for compressed table"?
[6 Sep 2023 14:22] Przemyslaw Malkowski
Hi Jakub!
Thank you for looking into it.

>> similar redo log throughput

>Hm, how was that established? Could you show results of
>SELECT COUNT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME IN ('log_lsn_current', 'log_lsn_last_checkpoint');
>before and after each test?

>The reason I ask, is because I'd like to know if the amount of redo log produced for compressed pages is similar to the amount for uncompressed pages.

OK, you are right; the redo log throughput was a bit higher with compression enabled. I'm attaching an example graph, but I'll try to gather exact metrics on another test.

>> The most visible difference is higher checkpoint age and sync flushing for the compressed table.

>Looking at the lower chart in FLushing_compression.png it looks to me that the red line (described as Sync Flushing) is actually higher for the later run (described as compression=none).
>What measurement do you mean by "higher sync flushing for compressed table"?

You need to read the graphs by areas, not the lines. The red area represents sync flushing and the green adaptive one. Attaching again with single metrics enabled only for a clean view.
[6 Sep 2023 14:22] Przemyslaw Malkowski
adaptive only

Attachment: FLushing_adaptive.png (image/png, text), 33.32 KiB.

[6 Sep 2023 14:22] Przemyslaw Malkowski
sync only

Attachment: FLushing_sync.png (image/png, text), 30.55 KiB.

[7 Sep 2023 10:32] Jakub Lopuszanski
OK, sorry for misinterpretting the chart, it now makes sense and looks more serious.

Couple of ideas for prodding this a bit more:
1. have you tried COMPRESSION="zlib" instead of "lz4". It could have a different speed/size profile and hence hint us if the problem is more with one or the other
2. the Page Compression works by compressing the image of the page, then writing the compressed image at the expected offset (say page_no * 16kb), finally "punching a hole" for the unused part of it (say 16kb  minus compressed size). Each step (compression, write, punching a hole) could potentially be the reason for slow down. The performance of punching holes might depend on file system being used. What fs are you using? Can you try a different one?
3. What compression ratio are you getting? It should be visible with `ls` and `du` see https://unix.stackexchange.com/a/436318/88901 
4. There are some "curious interactions" between compression and encryption. I just want to confirm you are not using ENCRYPTION="Y".
5. The write path requires writing the page twice due to double-write buffering and the code tries hard to avoid compressing the same page twice by reusing the compressed image. But perhaps there's something wrong in this area. Could you check how COMPRESSION impacts your numbers depending on https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_doublewrite setting?
6. It seems optimal to match the number of page cleaner threads https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_page_cleaners with the number of buffer pool instances https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_buffer_pool_i... . Puzzingly one has default value of 4 while the other has 8. Consider checking the impact of making them both 4 or both 8.
[7 Sep 2023 16:08] Jakub Lopuszanski
OK I have a theory based on producer-consumer imbalance between redo log writer and page cleaners caused by compressed pages using up less IO bandwidth, while still technically counting as 1 IO operation, freeing up this bandwith to redo log writer, which means that now more mini-transactions can rush ahead much faster than the page cleaners can advance checkpoint as they are tied to innodb_io_capacity IO operations per second (well, as we see on chart they actually exceed it due to various other mechanisms, but they still are affected by the config) while redo log writer isn't (AFAIR). Under this theory, to let the page cleaners benefit from shorter IO operations by firing more of them, we should raise the innodb_io_capacity limit.
Hence the question:

7. What happens if you change both innodb_io_capacity and innodb_io_capacity_max to 10k instead of 1k?
[7 Sep 2023 16:14] Jakub Lopuszanski
Przemysław, earlier you've mentioned:

> OK, you are right; the redo log throughput was a bit higher with compression enabled. I'm attaching an example graph, but I'll try to gather exact metrics on another test.

but I don't see any graph of redo log throughput. It would be nice to see such graph to test the theory from my 7th question.
[8 Sep 2023 13:56] Przemyslaw Malkowski
Hi Jakub!

Thanks for diving deep into this!

I repeated my test using mysqlslap this time, as the random tool has severe performance limits on its own.

1. When I checked with ZLIB, the situation was even worse. 

2. I am testing with EXT4, but I currently don't have a test environment where I could test something else.

3. Compress ratio for LZ4:
$ du -h msb_8_0_33/data/test/test_1.ibd
3.8G	msb_8_0_33/data/test/test_1.ibd

$ ls -lh msb_8_0_33/data/test/test_1.ibd 
-rw-r----- 1 przemek przemek 7.2G Sep  8 15:26 msb_8_0_33/data/test/test_1.ibd

4. No encryption used
5. I don't think disabling double-write is a good idea for production. Nonetheless, with innodb_doublewrite=OFF the test runs slightly faster, but the amount of sync flushing is very similar.

6. Using 8 innodb page cleaners (and 8 BP instances) does not seem to bring any improvement at all.

7. I don't have a fast enough disk for 10k, but changing both to 5k only allowed me to run the test faster, but a very similar sync flushing pattern happens.
More IO capacity definitely makes adaptive flushing more aggressive at the levels where the checkpoint age is far from maximum. But it does not seem to impact that much the flushing rate at the top checkpoint age spike.

I will attach a more complete dump of the InnoDB details dashboard from the last three tests, as well as corresponding details in the txt file. I hope this helps!
[8 Sep 2023 13:58] Przemyslaw Malkowski
Graphs from 3 tests, LZ4, ZLIB and none

Attachment: PMM11.pdf (application/pdf, text), 584.71 KiB.

[8 Sep 2023 13:59] Przemyslaw Malkowski
Test details

Attachment: tests_mysqlslap_1.txt (text/plain), 7.95 KiB.

[8 Sep 2023 14:00] Przemyslaw Malkowski
insert script

Attachment: insert1.sql (application/sql, text), 88.97 KiB.

[11 Sep 2023 18:29] Jakub Lopuszanski
Hello Przemysław,
thanks for the new portion of interesting data.

I was a bit puzzled about apparent contradiction of 

> 6. Using 8 innodb page cleaners (and 8 BP instances) does not seem to bring any improvement at all.

vs

> 7. [...] changing both to 5k only allowed me to run the test faster [...]

until I've realised I've completely forgot (obvious in retrospect) fact that the io capacity limit is shared by all page cleaner threads, so simply spawning more of them wouldn't let one write much faster.

Thus, I think it might be interesting to see what happens if you combine 6&7, i.e. run with more page cleaners WITH larger io capacity (and io capacity max).

As for 
> I don't have a fast enough disk for 10k
I don't think it matters what disk you have. This is just some number used by some heuristics. Setting it to a value larger than supported by your drive, AFAICT, will not cause any problems in this scenario, but it will achieve the goal of virtually eliminating any artificial limits on using it to the maximum, which I wanted to see happening.

Also, it looks a bit strange that the PMM11.pdf shows that "InnoDB IO Capacity Max" was configured to "2000", yet the third run achieved 2.5K ops/s in Adaptive flushing. While I am aware of bugs in flush heuristics (see Bug#34543464 	"adaptive flushing works completely different than documented") I am not sure what causes this particular behaviour.

W.r.t. to tests_mysqlslap_1.txt it looks like in 2nd and 3rd run the LSN range you look at covers execution of ALTER and TRUNCATE. Honestly I am not sure how much redo log they produce (in particular, note the truncate took above 1 second, so plausibly could have done some non-trivial work). To be sure, better to SELECT 'log_lsn_current', 'log_lsn_last_checkpoint' right before mysqlslap.

Also, one interpretation of what we see in the charts is that adaptive flushing ramps up to slowly and to too low level in case of compression.
Thus it might be interesting to see what happens if you:
8. set innodb_flushing_avg_loops to 1 (so that the freshest estimates are used by heuristics each time)
9. set innodb_adaptive_flushing_lwm to 0 (so that it starts from 0%)
[14 Sep 2023 11:53] Przemyslaw Malkowski
Hello Jakub,

OK, I tested again with:
| innodb_io_capacity     | 10000 |
| innodb_io_capacity_max | 10000 |
| innodb_buffer_pool_instances   | 8     |
| innodb_page_cleaners | 8     |

I don't see any significant change when testing uncompressed vs. LZ4 - still, the checkpoint age grows faster, and sync flushing kicks in with compression.

Sorry about the confusion wrt to log LSN. Now I made sure it read it right before the mysqlslap and right after (as soon as checkpoint age is back 0).
The results are:
- no compression - 9.407 GB
- LZ4 - 9.446 GB

After setting innodb_flushing_avg_loops=1 & innodb_adaptive_flushing_lwm=0, there is only a slight difference - a little bit more sharp adaptive flushing and a bit shorter sync flushing for the compressed table. But again, checkpoint age goes far higher than the same test with an uncompressed table.

Having said that, in all my tests, regardless of additional tuning, checkpoint age goes faster to levels that impact the server's throughput stability when page compression is used. 
Is that what you observe in your tests as well?
[14 Sep 2023 11:56] Przemyslaw Malkowski
Tests with page_cleaners=BP instances, and io_capacity at 10k; first run is no compression, second run LZ4 compression

Attachment: FLushing_none_vs_compression_high_iocapacity.png (image/png, text), 209.36 KiB.

[21 Sep 2023 9:26] MySQL Verification Team
HI Mr. Malkowski,

Your bug report is very interesting.

Can you provide us with a test case with a lower number of rows, with configuration changes that would lead to the same behaviour ???

Since we use only our tools, we would appreciate that you upload the full dump of the table, possibly with smaller number of rows.

We are very interested in analysing and fixing this, but we do need your further feedback.
[25 Sep 2023 14:02] Przemyslaw Malkowski
Hello,

I think there is a little confusion here. I provided a simplified test case later in the comments. You don't need a 3rd party tools - only mysqlslap. Also, the initial table is empty (no dump needed). I used enough loops via --number-of-queries option to see the behavior better, but if you can't run the test for whatever reason that long, you still should be able to get similar behavior with less data inserted.

The test case is very simple - just create the table and populate it via myslqslap using the example SQL file I attached, which is only 89kB. Check out the attachments tests_mysqlslap_1.txt and insert1.sql.

Again, I used these custom settings and all others default:

innodb_buffer_pool_size = 4G
innodb_log_file_size=1G
innodb_monitor_enable = all
innodb_io_capacity = 1000
innodb_page_cleaners = 8
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=2
skip-log-bin

Test:

mysql > show create table test_1\G
*************************** 1. row ***************************
       Table: test_1
Create Table: CREATE TABLE `test_1` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `rId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `cId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `aId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `newV` longtext,
  `oldV` longtext,
  `uId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `acId` char(17) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `IncrId` bigint NOT NULL,
  `dTime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `IDX1` (`IncrId`),
  KEY `IDX2` (`aId`,`acId`),
  KEY `IDX3` (`aId`,`rId`),
  KEY `IDX4` (`aId`,`cId`,`rId`)
) ENGINE=InnoDB AUTO_INCREMENT=1428876 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION='LZ4'
1 row in set (0.02 sec)

mysql > select count(*) from test_1;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

$ mysqlslap --password=msandbox --host=127.0.0.1 --user=msandbox --port=8040 --delimiter=";" --number-of-queries=500000 --create-schema=test --concurrency=16 --query=insert1.sql
mysqlslap: [Warning] Using a password on the command line interface can be insecure.
Benchmark
	Average number of seconds to run all queries: 210.651 seconds
	Minimum number of seconds to run all queries: 210.651 seconds
	Maximum number of seconds to run all queries: 210.651 seconds
	Number of clients running queries: 16
	Average number of queries per client: 31250

mysql > select count(*) from test_1;
+----------+
| count(*) |
+----------+
| 12500000 |
+----------+
1 row in set (11.01 sec)

The second test with the table changed to uncompressed via:

alter table test_1 compression="none";
truncate table test_1;

Is the test case doable now?
[25 Sep 2023 14:10] Przemyslaw Malkowski
Small correction to the previous reply. With innodb_io_capacity = 10000, the difference is also visible.
[29 Sep 2023 14:35] MySQL Verification Team
HI Mr. Malkowski,

This is just to inform you that we are now loading your data.

When we finish that we shall have to install a special tool that will provide us with info on what is going on ....
[4 Oct 2023 10:25] MySQL Verification Team
Hi,

Just to inform you that we have ran your tests against our diagnostic tools and that we are now analysing those ....
[9 Oct 2023 10:20] Marcin Babij
Hi Przemysław,
Another theory is that you don't have enough CPU to handle all compression requests in the volume required. LZ4 seems faster than ZLIB, so it would explain "When I checked with ZLIB, the situation was even worse." you saw and slower rate of flushing overall. I can't see any mention of CPU usage in this discussion nor I can't find any such graph or data in files you've attached. We may have assumed that it is powerful server and has spare CPU power. Maybe it is not the case?
Could you please share CPU usage/load graphs please?
[9 Oct 2023 10:21] Marcin Babij
Also, in any case, if you could share a Flame Graphs generated by https://github.com/brendangregg/FlameGraph from `perf` profiling, it would be great!
[9 Oct 2023 10:23] MySQL Verification Team
Hi Mr. Malkowski,

We are waiting on your feedback.
[18 Oct 2023 12:07] Przemyslaw Malkowski
checkpoint graphs

Attachment: chkp_flushing.png (image/png, text), 58.01 KiB.

[18 Oct 2023 12:07] Przemyslaw Malkowski
CPU utilization during the same test

Attachment: CPU.png (image/png, text), 38.78 KiB.

[18 Oct 2023 12:09] Przemyslaw Malkowski
Hi Marcin, MySQL Verification Team,

When I do a stress test in order to trigger as high checkpoint age as possible, it is expected that with the number of threads used for writes equal to the number of CPU cores, the overall CPU utilization will be high.
Having said that, CPU utilization is high in both compressed and uncompressed tests, but the checkpointing age is definitely higher for page compression test.
I am sending new graphs to illustrate it better and the flamegraphs as you requested.

The difference in checkpoint age is present also when I use fewer write threads to insert data (i.e. half of the total CPU capacity is idle), but it is less likely to reach sync flushing this way.
[18 Oct 2023 12:13] MySQL Verification Team
Hi Mr. Malkowski,

Thank you for your feedback.

Did we understood you correctly that you have 16 cores on that computer ???

Also, we are. doing some quite thorough testing of our own, but we shall share it with you later, when we are finished with them.
[18 Oct 2023 12:43] Przemyslaw Malkowski
Yes, my test machine has 16 CPU cores.
[18 Oct 2023 17:32] Jakub Lopuszanski
Hi Przemysław! 
Thank you very much for flame graphs!
Inspecting them carefully I've noticed that ib_pg_flush-1 ... ib_pg_flush-8 threads spend much more time in `__sched_yield`.
This suggests that they have trouble to dequeue a double-write buffer segment from the pool of free segments:
```
  while (!segments->dequeue(batch_segment)) {
    std::this_thread::yield();
  }
```
The reason it might be more difficult to find an available segment when compression is enabled might be because each segment is in use for a longer period of time due to two factors:
1. LZ4_compress_fast is called while the segment is already in use
2. After a page is written to a tablespace file the io completer thread has to "punch a hole" in it, before it informs double write buffer system that another page from the batch has completed. Only when all writes are acknowledged the segment can be returned back to the pool. So these additional syscall to punch holes defer the moment the segment can be reused

(For the record: I suspect that 1. doesn't matter as much as 2. here, because at hand-waving level, 1. is done in the context of a page cleaner thread, so each page cleaner can "extend the use" of just one segment because of this and hurts only itself by it. OTOH 2. is done in a "background" when no page cleaner is involved, and the longer it takes, the longer page cleaners need to wait for a free segment)

This theory predicts, that it would help to increase the number of segments in the pool relative to the number of page cleaners.
(It also retrodicts that increasing the number of page cleaners from default 4 to 8 didn't improve things much, because the computed number of segments changed from 5 to 9 because of that change, and 5/4 is actually better than 9/8)

Unfortunately, there is no explicit sysvar one could use to fine tune the number of segments directly in a reasonable and ergonomic way.

But, carefully analysing the math expressions used to compute the number of segments, I've figured out how to exploit "rounding" in division operation to raise the number of segments from 9 to 16. 

Basically, assuming that you are using 
```
--innodb_buffer_pool_instances=8 
--innodb_buffer_pool_size=4G 
```
(so that the number of BP instances is indeed 8) all you have to do is to additionally set:
```
--innodb_doublewrite_files=16
```
This way the logic in `dberr_t dblwr::open()` should end up establishing:
```
s_files.size() == 16
segments_per_file == 2
```

Of these 16 files, half will be used for "LRU flushing" and the other half for "flush_list flushing".
So, you should have 16 segments available to 8 page cleaners, which is hopefully a substantial improvement over 9 you had before.

1. Could you please rerun the experiment with --innodb_doublewrite_files=16?
2. Ideally, please collect flame graphs, so we can see if the change made the predicted impact

If this turns out to help, then it will mean we have a bug in how the automatic configuration of double-write buffers picks parameters.
[23 Oct 2023 13:58] Przemyslaw Malkowski
new examples with innodb_doublewrite_files tuning

Attachment: 2023-10-23_15-25.png (image/png, text), 88.90 KiB.

[23 Oct 2023 14:01] Przemyslaw Malkowski
Hi Jakub!

All right, I tested with innodb_doublewrite_files=2 (default) vs innodb_doublewrite_files=16, but I can see a barely visible difference. 
As a comparison, I ran the 3rd test with innodb_doublewrite_files=16 and no compression. See the graphs for how checkpoint age was handled in those three cases.
Attached are also flame graphs. My settings were:
mysql > select @@version,@@innodb_buffer_pool_size,@@innodb_doublewrite_files,@@innodb_buffer_pool_instances,@@innodb_io_capacity\G
*************************** 1. row ***************************
                     @@version: 8.0.34
     @@innodb_buffer_pool_size: 4294967296
    @@innodb_doublewrite_files: 16
@@innodb_buffer_pool_instances: 8
          @@innodb_io_capacity: 8000
[23 Oct 2023 14:04] Przemyslaw Malkowski
innodb_doublewrite_files: 2, LZ4

Attachment: 8_0_34_lz4_dblwr-2.svg (image/svg+xml, text), 388.79 KiB.

[23 Oct 2023 14:06] Przemyslaw Malkowski
innodb_doublewrite_files: 16, LZ4

Attachment: 8_0_34_lz4_dblwr-16.svg (image/svg+xml, text), 365.56 KiB.

[23 Oct 2023 14:07] Przemyslaw Malkowski
innodb_doublewrite_files: 16, no compression

Attachment: 8_0_34_dblwr-16-nocompr.svg (image/svg+xml, text), 339.21 KiB.

[24 Oct 2023 15:56] Jakub Lopuszanski
Ok, so the "good news" is that the change to 16 files indeed made the __sched_yield go away.

The "bad news" are two:
- it didn't influence checkpoint age
- after reading the buf0dblwr.cc a bit more I am no longer sure how 8 i/o page cleaner threads could have used up all 9 segments, because AFAIU the logic, flush_to_disk() calls wait_for_pending_batch() to wait for previous batch to finish, so one page cleaner should never use more than one segment. I am also aware that the lock-free queue has a design flaw in that it can "spuriously" report its empty, even though it contains some elements, in a case where a thread which was inserting an element got scheduled out before it finalised the insert - but this "bug" shouldn't really get affected by changing 9 to 16, so I am really puzzled about what's going on here. The only explanation I have right now is that there is a small race: `Double_write::write_complete` calls `batch_segment->completed()` which calls `m_dblwr->batch_completed()` which does ` m_batch_running.store(false, std::memory_order_release)` for which the `wait_for_pending_batch()` awaits, and then `flush_to_disk()` calls `write_pages()` which calls `write_dblwr_pages()` which attempts `segments->dequeue()` but this may be too early, because even though `Double_write::write_complete` has already called `batch_segment->completed()`, it hasn't yet returned this batch_segment back to the queue! It still has to call `fil_flush_file_spaces()` before doing so, because only when the data is safely persisted to destination tablespaces it is safe to reuse (overwrite) the same segment of double write buffer. The puzzling part here (perhaps a bug?) is why do we call `batch_segment->completed()` before `fil_flush_file_spaces()`? This seems premature. And it's like that since initial commit f6903aac8f1c.

Meanwhile I was consulting this issue with a team, and Dimitri has offered that the double-write buffer indeed might be the bottleneck, not just because of the number of files, but because of the too small number of pages in each segment.
Could you please check what happens if you set innodb_doublewrite_pages=256?

However I don't have big expectations given that you've already reported:
> 5. I don't think disabling double-write is a good idea for production. Nonetheless, with innodb_doublewrite=OFF the test runs slightly faster, but the amount of sync flushing is very similar.

Also, Dimitri was able to reproduce the issue that we also see in your flamegraphs, namely that ut_delay() in Buf_fetch<Buf_fetch_normal>::single_page is consuming a lot of time.
But it is not clear to me how this impacts the story given that all three flamegraphs show large CPU waste here:
49.74% no compression
49.44% lz4 16 files
48.35% lz4 2 files
[30 Oct 2023 14:13] Przemyslaw Malkowski
Hello Jakub,

I can confirm that using innodb_doublewrite_pages=256 vs the default (4), indeed made the difference! 
With that change, the test runs almost as fast as without compression, and the checkpoint age is way more stable, I can't repeat the peak close to the maximum any longer. I used the following settings:
mysql > select @@version,@@innodb_buffer_pool_size,@@innodb_doublewrite_files,@@innodb_buffer_pool_instances,@@innodb_io_capacity,@@innodb_doublewrite_pages\G
*************************** 1. row ***************************
                     @@version: 8.0.34
     @@innodb_buffer_pool_size: 4294967296
    @@innodb_doublewrite_files: 16
@@innodb_buffer_pool_instances: 8
          @@innodb_io_capacity: 8000
    @@innodb_doublewrite_pages: 256
1 row in set (0.00 sec)

So, I think we can call it a workaround! Thank you for all the efforts; really helpful!