MySQL Bugs: #112717: Undo log truncation does not finalize when super_read

Bug #112717	Undo log truncation does not finalize when super_read_only=1
Submitted:	13 Oct 2023 15:14	Modified:	25 Oct 2023 8:10
Reporter:	Przemyslaw Malkowski	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	8.0.34	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	super_read_only, undo space

Description:
When a replica runs with super_read_only=1, and hits a condition to trigger implicit truncation of undo logs (innodb_undo_log_truncate=ON), the process does not finish and leaves the instance in a weird state.

For example, the only message printed in error log:
2023-10-13T14:44:32.944934Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_001
2023-10-13T14:46:28.981665Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_002

And truncation logs are kept on disk:

slave2 [localhost:22437] {msandbox} ((none)) > \! ls -lh rsandbox_8_0_34/node2/data/undo*
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_001
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_002
-rw-r----- 1 przemek przemek 16K Oct 13 16:44 rsandbox_8_0_34/node2/data/undo_1_trunc.log
-rw-r----- 1 przemek przemek 16K Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_2_trunc.log

How to repeat:
Setup simple async replication and configure the replica with:
super_read_only = 1
innodb_max_undo_log_size = 20M

Prepare example tables with data on the source and then start consistent read transaction on the replica, to make the undo space to grow.
Run write workload on the same tables the transaction use.
Eventually, once the undo spaces increase beyond the max size, commit the transaction, so that after a while purging thread will start truncating undo spaces.
I used sysbench with oltp_update_index.lua on four tables, and did this on replica:
slave2 [localhost:22437] {msandbox} ((none)) > select @@version,@@super_read_only;
+-----------+-------------------+
| @@version | @@super_read_only |
+-----------+-------------------+
| 8.0.34    |                 1 |
+-----------+-------------------+
1 row in set (0.00 sec)

begin;
select * from sbtest1 limit 1;
select * from sbtest2 limit 1;
select * from sbtest3 limit 1;
select * from sbtest4 limit 1;

Initially, the undo spaces were:
slave2 [localhost:22437] {msandbox} (db1) > SELECT NAME, STATE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME LIKE '%undo%';
+-----------------+--------+-----------+----------------+
| NAME            | STATE  | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active |  16777216 |       16777216 |
| innodb_undo_002 | active |  16777216 |       16777216 |
+-----------------+--------+-----------+----------------+
2 rows in set (0.00 sec)

After a while of sysbench run, they grew to:

slave2 [localhost:22437] {msandbox} (db1) > SELECT NAME, STATE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME LIKE '%undo%';
+-----------------+--------+-----------+----------------+
| NAME            | STATE  | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active |  33554432 |       33554432 |
| innodb_undo_002 | active |  50331648 |       50331648 |
+-----------------+--------+-----------+----------------+
2 rows in set (0.00 sec)

Then, I committed the transaction. 
To speed up truncation, you can do:
set global innodb_purge_rseg_truncate_frequency=1;

After a moment, the "Truncating UNDO tablespace innodb_undo_001" messages appear in error log, and the state changed to:

slave2 [localhost:22437] {msandbox} ((none)) > SELECT NAME, STATE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME LIKE '%undo%';
+-----------------+--------+-----------+----------------+
| NAME            | STATE  | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active |         0 |              0 |
| innodb_undo_002 | active |         0 |              0 |
+-----------------+--------+-----------+----------------+
2 rows in set (0.00 sec)

slave2 [localhost:22437] {msandbox} ((none)) > select name,count,status from information_schema.innodb_metrics where name like "%undo%";
+-----------------------------------+--------+---------+
| name                              | count  | status  |
+-----------------------------------+--------+---------+
| buffer_page_read_undo_log         |   6223 | enabled |
| buffer_page_written_undo_log      |   8514 | enabled |
| trx_undo_slots_used               |    379 | enabled |
| trx_undo_slots_cached             |    -69 | enabled |
| purge_undo_log_pages              |  88369 | enabled |
| undo_truncate_count               |      2 | enabled |
| undo_truncate_start_logging_count |      2 | enabled |
| undo_truncate_done_logging_count  |      0 | enabled |
| undo_truncate_usec                | 414642 | enabled |
+-----------------------------------+--------+---------+
9 rows in set (0.00 sec)

The truncate logs are there the whole time:
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_001
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_002
-rw-r----- 1 przemek przemek 16K Oct 13 16:44 rsandbox_8_0_34/node2/data/undo_1_trunc.log
-rw-r----- 1 przemek przemek 16K Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_2_trunc.log

And no message about finishing the truncation process!

slave2 [localhost:22437] {msandbox} ((none)) > show variables like '%undo%';
+--------------------------+----------+
| Variable_name            | Value    |
+--------------------------+----------+
| innodb_max_undo_log_size | 20971520 |
| innodb_undo_directory    | ./       |
| innodb_undo_log_encrypt  | OFF      |
| innodb_undo_log_truncate | ON       |
| innodb_undo_tablespaces  | 2        |
+--------------------------+----------+
5 rows in set (0.00 sec)

Upon replica restart, I can see:

2023-10-13T15:09:53.623125Z 1 [Note] [MY-012902] [InnoDB] Undo tablespace number 1 was being truncated when mysqld quit.
2023-10-13T15:09:53.623236Z 1 [Note] [MY-013252] [InnoDB] Using undo tablespace './undo_001'.
2023-10-13T15:09:53.623278Z 1 [Note] [MY-012902] [InnoDB] Undo tablespace number 2 was being truncated when mysqld quit.
2023-10-13T15:09:53.624579Z 1 [Note] [MY-013252] [InnoDB] Using undo tablespace './undo_002'.
2023-10-13T15:09:53.624817Z 1 [Note] [MY-013040] [InnoDB] Will create 2 new undo tablespaces.
2023-10-13T15:09:53.624858Z 1 [Note] [MY-012915] [InnoDB] Created 2 undo tablespaces.
...
2023-10-13T15:09:53.862703Z 1 [Note] [MY-012904] [InnoDB] Reconstructing undo tablespace number 1.
2023-10-13T15:09:53.865381Z 1 [Note] [MY-012896] [InnoDB] Creating UNDO Tablespace ./undo_001
2023-10-13T15:09:53.865401Z 1 [Note] [MY-012897] [InnoDB] Setting file ./undo_001 size to 16 MB
2023-10-13T15:09:53.865409Z 1 [Note] [MY-012898] [InnoDB] Physically writing the file full
2023-10-13T15:09:53.897512Z 1 [Note] [MY-013018] [InnoDB] Created 128 and tracked 128 new rollback segment(s) in undo tablespace number 1. 128 are now active.
2023-10-13T15:09:54.144964Z 1 [Note] [MY-012904] [InnoDB] Reconstructing undo tablespace number 2.
2023-10-13T15:09:54.147574Z 1 [Note] [MY-012896] [InnoDB] Creating UNDO Tablespace ./undo_002
2023-10-13T15:09:54.147604Z 1 [Note] [MY-012897] [InnoDB] Setting file ./undo_002 size to 16 MB
2023-10-13T15:09:54.147612Z 1 [Note] [MY-012898] [InnoDB] Physically writing the file full
2023-10-13T15:09:54.175132Z 1 [Note] [MY-013018] [InnoDB] Created 128 and tracked 128 new rollback segment(s) in undo tablespace number 2. 128 are now active.

This issue can trigger clone plugin failure later, as reported in https://bugs.mysql.com/bug.php?id=112714

Suggested fix:
The super_read_only setting should not prevent the replica from doing implicit undo log truncations.

Hi Mr. Malkowski,

Thank you for your bug report.

However, it seems that this is a report about problems with replication.

Are we correct ???

In that case, please change the category of this report.

Hi MySQL Verification Team,

No, it is not about the replication problem exactly. Replication is working just fine; no errors there. Even when "Truncating UNDO tablespace" starts and gets stuck, applying replicated events keeps working.

I was reproducing this on a replica since this is the only way to have the InnoDB engine do writes when super_read_only is enabled.
The problem is that truncating undo space gets stuck.
Again, this is how it works with super_read_only=0:

2023-10-13T17:43:41.744680Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_001
2023-10-13T17:43:41.921587Z 0 [Note] [MY-013000] [InnoDB] Completed truncate of undo tablespace innodb_undo_001.

+-----------------+--------+-----------+----------------+
| NAME            | STATE  | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active |  33554432 |       33554432 |
| innodb_undo_002 | active |  50331648 |       50335744 |
+-----------------+--------+-----------+----------------+

And no undo_1_trunc.log is left on the disk after this.

If I keep testing but with super_read_only=1, the next truncating looks like this:
2023-10-13T17:45:37.829686Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_002
2023-10-13T17:47:31.495367Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_001

+-----------------+--------+-----------+----------------+
| NAME            | STATE  | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active |         0 |              0 |
| innodb_undo_002 | active |         0 |              0 |
+-----------------+--------+-----------+----------------+

slave2 [localhost:22437] {msandbox} (db1) > \! ls -lh rsandbox_8_0_34/node2/data/undo*
-rw-r----- 1 przemek przemek 16M Oct 13 19:47 rsandbox_8_0_34/node2/data/undo_001
-rw-r----- 1 przemek przemek 16M Oct 13 19:48 rsandbox_8_0_34/node2/data/undo_002
-rw-r----- 1 przemek przemek 16K Oct 13 19:47 rsandbox_8_0_34/node2/data/undo_1_trunc.log
-rw-r----- 1 przemek przemek 16K Oct 13 19:45 rsandbox_8_0_34/node2/data/undo_2_trunc.log

And it stays stuck like that until I restart the service.

Hi Mr. Malkowski,

We are changing the category to replication, since if what you observe can be exhibited only on replica, it has to be reproduced with a team that is using replication for bug verification. This does not change the fact that the problem is in InnoDB storage engine.

Thank you for your patience.

Hi,

Thank you for the report