Description:
When a replica runs with super_read_only=1, and hits a condition to trigger implicit truncation of undo logs (innodb_undo_log_truncate=ON), the process does not finish and leaves the instance in a weird state.
For example, the only message printed in error log:
2023-10-13T14:44:32.944934Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_001
2023-10-13T14:46:28.981665Z 0 [Note] [MY-012994] [InnoDB] Truncating UNDO tablespace innodb_undo_002
And truncation logs are kept on disk:
slave2 [localhost:22437] {msandbox} ((none)) > \! ls -lh rsandbox_8_0_34/node2/data/undo*
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_001
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_002
-rw-r----- 1 przemek przemek 16K Oct 13 16:44 rsandbox_8_0_34/node2/data/undo_1_trunc.log
-rw-r----- 1 przemek przemek 16K Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_2_trunc.log
How to repeat:
Setup simple async replication and configure the replica with:
super_read_only = 1
innodb_max_undo_log_size = 20M
Prepare example tables with data on the source and then start consistent read transaction on the replica, to make the undo space to grow.
Run write workload on the same tables the transaction use.
Eventually, once the undo spaces increase beyond the max size, commit the transaction, so that after a while purging thread will start truncating undo spaces.
I used sysbench with oltp_update_index.lua on four tables, and did this on replica:
slave2 [localhost:22437] {msandbox} ((none)) > select @@version,@@super_read_only;
+-----------+-------------------+
| @@version | @@super_read_only |
+-----------+-------------------+
| 8.0.34 | 1 |
+-----------+-------------------+
1 row in set (0.00 sec)
begin;
select * from sbtest1 limit 1;
select * from sbtest2 limit 1;
select * from sbtest3 limit 1;
select * from sbtest4 limit 1;
Initially, the undo spaces were:
slave2 [localhost:22437] {msandbox} (db1) > SELECT NAME, STATE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME LIKE '%undo%';
+-----------------+--------+-----------+----------------+
| NAME | STATE | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active | 16777216 | 16777216 |
| innodb_undo_002 | active | 16777216 | 16777216 |
+-----------------+--------+-----------+----------------+
2 rows in set (0.00 sec)
After a while of sysbench run, they grew to:
slave2 [localhost:22437] {msandbox} (db1) > SELECT NAME, STATE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME LIKE '%undo%';
+-----------------+--------+-----------+----------------+
| NAME | STATE | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active | 33554432 | 33554432 |
| innodb_undo_002 | active | 50331648 | 50331648 |
+-----------------+--------+-----------+----------------+
2 rows in set (0.00 sec)
Then, I committed the transaction.
To speed up truncation, you can do:
set global innodb_purge_rseg_truncate_frequency=1;
After a moment, the "Truncating UNDO tablespace innodb_undo_001" messages appear in error log, and the state changed to:
slave2 [localhost:22437] {msandbox} ((none)) > SELECT NAME, STATE, FILE_SIZE, ALLOCATED_SIZE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE NAME LIKE '%undo%';
+-----------------+--------+-----------+----------------+
| NAME | STATE | FILE_SIZE | ALLOCATED_SIZE |
+-----------------+--------+-----------+----------------+
| innodb_undo_001 | active | 0 | 0 |
| innodb_undo_002 | active | 0 | 0 |
+-----------------+--------+-----------+----------------+
2 rows in set (0.00 sec)
slave2 [localhost:22437] {msandbox} ((none)) > select name,count,status from information_schema.innodb_metrics where name like "%undo%";
+-----------------------------------+--------+---------+
| name | count | status |
+-----------------------------------+--------+---------+
| buffer_page_read_undo_log | 6223 | enabled |
| buffer_page_written_undo_log | 8514 | enabled |
| trx_undo_slots_used | 379 | enabled |
| trx_undo_slots_cached | -69 | enabled |
| purge_undo_log_pages | 88369 | enabled |
| undo_truncate_count | 2 | enabled |
| undo_truncate_start_logging_count | 2 | enabled |
| undo_truncate_done_logging_count | 0 | enabled |
| undo_truncate_usec | 414642 | enabled |
+-----------------------------------+--------+---------+
9 rows in set (0.00 sec)
The truncate logs are there the whole time:
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_001
-rw-r----- 1 przemek przemek 16M Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_002
-rw-r----- 1 przemek przemek 16K Oct 13 16:44 rsandbox_8_0_34/node2/data/undo_1_trunc.log
-rw-r----- 1 przemek przemek 16K Oct 13 16:46 rsandbox_8_0_34/node2/data/undo_2_trunc.log
And no message about finishing the truncation process!
slave2 [localhost:22437] {msandbox} ((none)) > show variables like '%undo%';
+--------------------------+----------+
| Variable_name | Value |
+--------------------------+----------+
| innodb_max_undo_log_size | 20971520 |
| innodb_undo_directory | ./ |
| innodb_undo_log_encrypt | OFF |
| innodb_undo_log_truncate | ON |
| innodb_undo_tablespaces | 2 |
+--------------------------+----------+
5 rows in set (0.00 sec)
Upon replica restart, I can see:
2023-10-13T15:09:53.623125Z 1 [Note] [MY-012902] [InnoDB] Undo tablespace number 1 was being truncated when mysqld quit.
2023-10-13T15:09:53.623236Z 1 [Note] [MY-013252] [InnoDB] Using undo tablespace './undo_001'.
2023-10-13T15:09:53.623278Z 1 [Note] [MY-012902] [InnoDB] Undo tablespace number 2 was being truncated when mysqld quit.
2023-10-13T15:09:53.624579Z 1 [Note] [MY-013252] [InnoDB] Using undo tablespace './undo_002'.
2023-10-13T15:09:53.624817Z 1 [Note] [MY-013040] [InnoDB] Will create 2 new undo tablespaces.
2023-10-13T15:09:53.624858Z 1 [Note] [MY-012915] [InnoDB] Created 2 undo tablespaces.
...
2023-10-13T15:09:53.862703Z 1 [Note] [MY-012904] [InnoDB] Reconstructing undo tablespace number 1.
2023-10-13T15:09:53.865381Z 1 [Note] [MY-012896] [InnoDB] Creating UNDO Tablespace ./undo_001
2023-10-13T15:09:53.865401Z 1 [Note] [MY-012897] [InnoDB] Setting file ./undo_001 size to 16 MB
2023-10-13T15:09:53.865409Z 1 [Note] [MY-012898] [InnoDB] Physically writing the file full
2023-10-13T15:09:53.897512Z 1 [Note] [MY-013018] [InnoDB] Created 128 and tracked 128 new rollback segment(s) in undo tablespace number 1. 128 are now active.
2023-10-13T15:09:54.144964Z 1 [Note] [MY-012904] [InnoDB] Reconstructing undo tablespace number 2.
2023-10-13T15:09:54.147574Z 1 [Note] [MY-012896] [InnoDB] Creating UNDO Tablespace ./undo_002
2023-10-13T15:09:54.147604Z 1 [Note] [MY-012897] [InnoDB] Setting file ./undo_002 size to 16 MB
2023-10-13T15:09:54.147612Z 1 [Note] [MY-012898] [InnoDB] Physically writing the file full
2023-10-13T15:09:54.175132Z 1 [Note] [MY-013018] [InnoDB] Created 128 and tracked 128 new rollback segment(s) in undo tablespace number 2. 128 are now active.
This issue can trigger clone plugin failure later, as reported in https://bugs.mysql.com/bug.php?id=112714
Suggested fix:
The super_read_only setting should not prevent the replica from doing implicit undo log truncations.