MySQL Bugs: #110663: Pagetracking return null data if checkpoint LSN is less than page tracking start

Bug #110663	Pagetracking return null data if checkpoint LSN is less than page tracking start
Submitted:	12 Apr 2023 3:50	Modified:	17 Apr 2023 10:59
Reporter:	Rahul Malik	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0.32	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution, mysqlbackup

Description:
Background
When we start page tracking,  it is started from the current system LSN A. This means all the pages that are flushed after that are tracked. 
After some time If I am interested in knowing the count of modified pages at the time I started page tracking till the current moment(current checkpoint LSN B) .  MySQL provides an interface
Select mysqlbackup_page_track_get_changed_page_count( A,B); 
Where A is the LSN when the page tracking was started.  B is checkpoint LSN. If we provide B=0, B would take the current checkpoint LSN. 
Problem
MySQL has put a restriction that checkpoint LSN B > A.  For a system with a huge redo current system LSN and checkpoint LSN could be apart. 
A could be greater than B even after some time. 
If our interest in knowing all modified pages to the moment.  Why do we need this restriction? 
I tried removing the restriction. I can get changed pages successfully.  I checked the clone which also need pagetracking data and there is no restriction. 
Please see if there is a reason for this restriction or if we can remove that. 

How to repeat:
start page-tracking on the server where checkpoint LSN(B) is way behind the current system LSN. Assume it is page tracking is started at LSN A. so LSN A >> B

Afer some time if checkpoint LSN is progress to C. where A > C. 
we can not query page-tracking data unless A<= C. 

which is an artificial limit.

Suggested fix:
Remove this limitation with the attached patch.

Even if the pagetracking was started before the full backup once can not take an incremental backup until system checkpoint LSN is more than pagetracking start LSN. 

This would allow un necessary failure of incremental backups using pagetracking.

patch to remove restriction

Attachment: remove_condition_page_tracking.patch (application/octet-stream, text), 1.40 KiB.

Hello Rahul,

Thank you for the report and contribution.
Please ensure to re-send the patch via "Contribution" tab. Otherwise we would not be able to accept it.

regards,
Umesh

fixed (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: remove_condition_page_tracking.patch (application/octet-stream, text), 1.40 KiB.

The way I understand https://dev.mysql.com/blog-archive/innodb-clone-and-page-tracking/ is that Page Tracking observes the modifications to the page only when the page is eventually written to the disc, which happens later (in wall-clock time).
The "moment of modification" is defined by LSN.
When we ask "which pages were modified between A and B" the A and B are LSNs, and modifications are mini-transactions, which occur on the LSN axis.
The "moment of write to disc" is something which can happen a bit later in wall-clock time, but the implementation uses that (wall-clock) moment to learn about what has happened at some LSN.
In other words: the page can be modified at LSN X, and current "system LSN" is much larger then X, but the Page Tracker might still not know about that modification at all, because the page was not yet written back to disc.
The change is already made to the in-memory buffer of that page in the Buffer Pool, and was already redo logged at LSN=X, but the Page Tracker doesn't know it yet.
Eventually, a page cleaner will pick this page from a flush list (or LRU, or some other thread will do that), and write the page to disc.
At this moment (in wall-clock axis) the Page Tracker mechanism will learn, that the page was modified at X.
Or, more accurately, it will learn that the page's newest_modification_lsn (which is larger than X by our assumptions) is larger than the "start LSN" (the LSN at which tracking has started), so there was at least one new change made to this page since the tracking started (but the Page Tracker doesn't really learn the exact value of X).
(It also learns that previous newest_modification_lsn stored in that page in tablespace was smaller than "tracking LSN", so this is the first time Page Tracker is handling this page so it should report it (and will not have to report it for subsequent modifications).)
So, the page tracker can correctly answer the question "was this page modified in the window (start LSN, end LSN)", but only after all pages modified by mini-transactions from this range were already written back to disc.
If you ask this question too early, the answer must be "I don't know".
And how can we know if all changes up to "end LSN" were already written to disc?
This is how checkpoint LSN is defined: if checkpoint LSN >= end LSN, then we know all such changes were reflected back to tablespaces.

What am I missing?

My understanding is that once page tracking is started any page flushed after that is definitely tracked. 

From the blog post. 
Page being flushed thereon, whose page LSN is less than the tracking LSN, is tracked. The idea is that we do not want pages already tracked to be tracked again, and page LSN >= tracking LSN would mean that the page was already flushed and tracked after tracking was enabled since page LSN refers to the latest LSN on disk.

But do you mean the server has an additional check it only tracks pages whose LSN is more page-tracking start LSN? This looks opposite to the statement in the blog post which says it adds pages id to the list only if modified_LSN is less than page-tracking start LSN. Can you please explain? 

Let me give one example.  
Started the page tracking on the server. ( atm checkpoint LSN 80, page-tracking-start LSN 100)
Took the full backup. 
Started incremental backup( atm checkpoint LSN 90 )

Now my understanding is that all modified pages after the page-tracking start LSN(100) and up to checkpoint LSN 90 should be tracked and I can ask the server to provide me modified pages up to checkpoint LSN 90. 

Here server checkpoint LSN >= end LSN in this case the end LSN is 90. so we are asking only up to pages that are flushed. 
But the server has imposed a condition that end_LSN should be more than the page-tracking start LSN. 
But page-tracking start LSN is the system LSN. 

I understand if we have a condition to only track a page if it LSN > page-tracking start LSN we have to wait for until the checkpoint is more than the page-tracking start LSN. 
But my understanding the server starts tracking all the pages that are flushed after we start page tracking. 

Can you please confirm? 

Also, the clone works fine and does not require checkpoint LSN to cross page-tracking start LSN. This limitation is only for myqlbackup component.

> But do you mean the server has an additional check it only tracks pages whose LSN is more page-tracking start LSN? 

First of all, let me clarify that there are two different meanings of "page LSN" 
(1) the LSN of most recent change made to it in Buffer Pool (something which you can check in buf_page_t::newest_modification field in memory object), or 
(2) the LSN of the most recent change already saved to disc earlier (something you can check in the on-disc page's header at offset FIL_PAGE_LSN).

In my sentence
> Or, more accurately, it will learn that the page's newest_modification_lsn (which is larger than X by our assumptions) is larger than the "start LSN" (the LSN at which tracking has started), so there was at least one new change made to this page since the tracking started (but the Page Tracker doesn't really learn the exact value of X).
I've definitely meant the first thing.
I've now checked in our source code, and it looks like we are missing this optimization.
So, apparently I was wrong about this, sorry.

> This looks opposite to the statement in the blog post which says it adds pages id to the list only if modified_LSN is less than page-tracking start LSN. Can you please explain? 

In this sentence and in the blog post they use the second meaning (2), and I do not contradict this sentence. I actually also said the same thing in my sentence:
> (It also learns that previous newest_modification_lsn stored in that page in tablespace was smaller than "tracking LSN", so this is the first time Page Tracker is handling this page so it should report it (and will not have to report it for subsequent modifications).)
where I've definitely used this second meaning.

So there's no contradiction between the two conditions I've shared earlier, and one of them agrees with blogpost, it's just that the other one of them is apparently not used in our code, sorry about that.

Here's how I understand the code. When we write a page to disc, we will rewrite it on disc, and the header of the page on disc will change the LSN value at the FIL_PAGE_LSN offset from the old value to a new value.
I *thought* that the page tracker will check that the old value was smaller than the 'tracking LSN', and that the new value is larger than the 'tracking LSN', i.e. the sheep has jumped over the fence into the area we care about.
What *actually happens* is just the first part: the tracker will check that the old value was smaller than the 'tracking LSN'. So the sheep might have jumped forward, but we don't check how far, so it might have landed before the fence, yet we still count it for a reason which I've only understood today, thanks to you, and I'll share later.

Take a look at this fragment of the code in 
```
bool buf_flush_page(buf_pool_t *buf_pool, buf_page_t *bpage,
                    buf_flush_t flush_type, bool sync) {
...

    if (!fsp_is_system_temporary(bpage->id.space()) &&
        buf_pool->track_page_lsn != LSN_MAX) {
      auto frame = bpage->zip.data;

      if (frame == nullptr) {
        frame = ((buf_block_t *)bpage)->frame;
      }
      lsn_t frame_lsn = mach_read_from_8(frame + FIL_PAGE_LSN);

      arch_page_sys->track_page(bpage, buf_pool->track_page_lsn, frame_lsn,
                                false);
    }
...
    buf_flush_write_block_low(bpage, flush_type, sync);
```
The `buf_flush_write_block_low` will modify the FIL_PAGE_LSN field, by calling:
```
      buf_flush_init_for_writing(
          reinterpret_cast<const buf_block_t *>(bpage),
          reinterpret_cast<const buf_block_t *>(bpage)->frame,
          bpage->zip.data ? &bpage->zip : nullptr, bpage->get_newest_lsn(),
          fsp_is_checksum_disabled(bpage->id.space()),
          false /* do not skip lsn check */);
```
which does:
```
void buf_flush_init_for_writing(const buf_block_t *block, byte *page,
                                void *page_zip_, lsn_t newest_lsn,
                                bool skip_checksum, bool skip_lsn_check) {
...
  mach_write_to_8(page + FIL_PAGE_LSN, newest_lsn);
```
So, the page tracker will first take a look at the old FIL_PAGE_LSN value, compare it to `buf_pool->track_page_lsn` and make a decisions:
if(old.page.FIL_PAGE_LSN <  buf_pool->track_page_lsn){
  store its page id into a data structure for current group tracking group
} else {
  no need to do anything, as this page is already 'tracked' in the current group
}

It was only after I recalled the concept of 'groups' when it dawned on me why we don't use the optimization I've mentioned.
We permit multiple calls to "start tracking" to create several points on the lsn timeline - various clients might be interested about different "fences".
For space efficiency, we group all of them and use a single append-only storage of page ids for all of them, and just mark the starting point for each of them, so they share the common suffix of this list of ids.
Whenever a new 'start' request comes, we add a new marker, and bump the *global* buf_pool->track_page_lsn.
That means we only remember the "right-most fence" in buf_pool->track_page_lsn.
What we really promised to our clients is to track sheeps jumping inside any of their fences. But we decided for some reason to remember only the right-most fence's position.
Therefore, if a sheep made a jump which starts before that right-most fence, we can't rule out the possibility that it has jumped inside one of the earlier fences to the left, as we don't remember their positions.
Therefore, we have to track this page/sheep, to err on the safe side.
I think, this could be optimized by remembering what is the left-most fence position in our group, and not track jumps which fall to the left of it, but we don't - perhaps I am missing some correctness issue with this, or it was just deemed overcomplicated?

I don't understand this phrase:
"after the page-tracking start LSN(100) and up to checkpoint LSN 90 "

Do you mean an empty range 100<= x <= 90 ?

Perhaps the confusion lays somewhere in how you understand the concept of "a modified page"?
Perhaps you think we are tracking "IO writes to tablespace files over wall-clock time"?
That's not what we are doing. We are tracking "modifications performed (mostly by mini-transactions) to the pages over LSN axis".
This is about the "conceptual modifications" (mini-transactions) of "conceptual pages" (so we don't care if they are on disc or Buffer Pool).
We are talking about things like modifications of individual bytes of a block done through mini-transaction.
Such an even takes a place "at LSN=X" if the mtr which did it, has mtr.commit_lsn = X.
So, when you ask us, "what pages have been modified in range [LSN_A, LSN_B]" we will basically give you the same answer, which you would get if you:
a) scanned the redo log from LSN_A to LSN_B and noted down all page_ids in it
b) captured all changes done to Buffer Pool pages by minitransactions which have committed in range LSN_A,LSN_B
modulo several discrepancies, like:
- we will probably err on the safe side, by providing you a supper set of "a)", because page tracking might have to round LSN_A and LSN_B to some points on the timeline for which it knows the answer, and write from LRU can cause a page to be written to disc because of changes which were above stop_lsn/LSN_B/checkpoint_lsn
- some pages get modified bypassing redo log, or bypassing mini-transactions infrastructure

This is provided Page Tracker will give you any answer at all.
Because it might answer "I don't know yet, some changes from range [LSN_A,LSN_B] have not yet reached the IO layer, and thus I am not aware of all the pages modified in this range yet, please wait, until all of them get processed by page cleaners".
And once all changes with LSN smaller than LSN_B done to Buffer Pool pages by mini-transactions get written back to disc, the checkpoint_lsn will be bumped to at least LSN_B, and then the Page Tracker will know that it can now answer your question, and then it will answer it same way a) or b) would (modulo the caveats I've listed).

Looking at your example again, I think that perhaps what you really wanted to know is the set of pages which were modified in the LSN range [80,90]?
The page tracker can not answer that, because it has no idea about some of the modifications below 100, and will never learn about it no matter how long you will wait.
This is simply because at the wall-clock moment when you asked it to start, some pages modified at LSN=87 could have already been written back to disc, for example due to LRU pressure, and they will never be written to disc again, so the Page Tracker will have no opportunity to learn about them.

Thank you for the clarification.

The part which is causing confusion is what we get when we start the page-tracking. 

when we start page tracking it is started from the current LSN not from a checkpoint LSN.
So in the example, the current LSN was 100 and the checkpoint LSN was 80. but no interest in checkpoint LSN 80. 

My understanding of the reason behind returning the current system LSN is to ensure we can track all the pages from the wall clock. Because the current system LSN would be the maximum LSN at that time. so every page would be tracked which gets flushed. 

that's why the pagetracking interface would return 100 and not 80. let's say the wall clock time 00. 

With the current logic, we are sure every page would be tracked from wall clock time 00. every page LSN would be less than the current LSN 100 as it the maximum LSN.

at Wall clock time 01, The checkpoint LSN is 90, 
Now I am interested in getting modified pages from wall clock 00-01. so I have to pass page-tracking-start-lsn 100 and current checkpoint LSN 90 to the interface. 

select get_count(page_tracking_start_LSN, current_checkpoint_LSN); 
select get_count(100,90). 
so 90 should be less than the server checkpoint LSN.

If you expect a functionality like "tell me, what pages were written to disc between these two wall-clock times" then we don't have such functionality.

It looks like you try to "emulate" it by converting your imaginary "count_pages_written_between(time_then, time_now)" into 
"mysqlbackup_page_track_get_changed_page_count( system_lsn_at(time_then), checkpoint_lsn_at(time_now))" and expect it to work, but this makes no sense.
In particular it leads you to calling our API with an empty range which starts at 100 and ends at 90.
The API we have, and offer, is: "tell me (an upper bound on) how many pages were modified in the range between these two LSNs".
This naturally requires the first argument to be smaller than the later to make any sense.

You can see how it is implemented in:

```
int Arch_Page_Sys::get_num_pages(lsn_t &start_id, lsn_t &stop_id,
                                 uint64_t *num_pages) {
```
In particular it calls:
int Arch_Page_Sys::fetch_group_within_lsn_range(lsn_t &start_id, lsn_t &stop_id,
                                                Arch_Group **group) {
...
  if (start_id >= stop_id || start_id == LSN_MAX || stop_id == LSN_MAX) {
    return (ER_PAGE_TRACKING_RANGE_NOT_TRACKED);
  }
```
otherwise, it finds a smallest group which contains the range you've asked about.
Then it finds the closest start (reset) point which is to the left of start of the range you requested. If there is none, it gives up.
Then it finds the closest stop point to the right of it. I think "stop point" here means any checkpoint:
```
  /** Vector of stop points corresponding to a file.
  Stop point refers to the stop lsn (checkpoint lsn) until which the pages are
  guaranteed to be tracked in a file. Each block in a file maintains this
  information.
  Protected by Arch_Page_Sys::m_oper_mutex.
  @note used only by the page archiver */
  std::vector<lsn_t> m_stop_points;
```
Then, AFAICT it simply counts how many records are there, between those two points, not even trying to exclude duplicates.

I do not expect to get_pages from time_then and time_now. maybe It was a fail attempt to explain the problem :)

Please correct me where I get wrong here.  

my understanding/expectations from 
mysqlbackup_page_track_get_changed_page_count(page_tracking_start_LSN, current_checkpoint_LSN) is to return pages when I started the page tracking to checkpoint LSN I have provided. 

With every checkpoint server guarantee to track modified pages from the page_tracking start LSN to that checkpoint LSN by the logic we have discussed above. 

SO at any time server can provide me modified pages from the page_tracking_start_LSN to end_LSN. END_LSN should not be more than the checkpoint LSN. As pages get into tracking system only during the checkpoints. 

It takes two different type of argument one is the system LSN and the other is checkpoint LSN. that's why it is not natural to expect checkpoint_LSN to be more than system LSN when we started the page_tracking because a system with redo log. checkpoint LSN and system LSN could be way apart. 

I tried removing the condition with the attached patch and As long as I give end_LSN below the server current checkpoint LSN.  the server gives me modified pages.

as long as the below condition is satisfied server will get a group to work on

```
  if (start_id < (*group)->get_begin_lsn() ||         public: lsn_t get_end_lsn() const
            (!(*group)->is_active() && stop_id > (*group)->get_end_lsn()) ||
            ((*group)->is_active() && stop_id > latest_stop_lsn)) {
```

> my understanding/expectations from 
> mysqlbackup_page_track_get_changed_page_count(page_tracking_start_LSN, current_checkpoint_LSN) is to return pages when I started the page tracking to checkpoint LSN I have provided. 

What do you mean by "pages [from] when I started the page tracking to checkpoint LSN I have provided"?

Can you please define this concept in coherent way?

Sorry, I am able to explain the problem.

But is what backup/clone expects from page tracking.

1) Ask the server to start page-tracking. In our example, the server returns 100 as page-tracking LSN. so we call it page-tracking start LSN. At that moment checkpoint LSN was 80. But that is not our interest.  we would use this page-tracking start LSN to get the list of the modified page later. 

 server now guarantees that  any page flushed after will be tracked( 100 is the maximum LSN at that moment, any page will have LSN < 100, so all the pages flushed followed now will be tracked)

Backup/clone copies ibd without copying the redo logs. 

After copying data files. Backup/clone asks the server to provide the list of modified pages from the moment page tracking was started to the current checkpoint LSN. As they would copy the redo logs after the checkpoint LSN and can rely on them for modified pages.
 
The input to get_changes_pages would be page-tracking start LSN and an LSN less than the current checkpoint. 
get_pages(100,90) Should work. I do not understand why the second argument should be more than first. one is system LSN and the other checkpoint LSN. and comparison of them doesn't make any sense. 

All we need to ensure is the second argument should not be more than the checkpoint LSN of the system. 

Now clone doesn't need to wait for checkpoint LSN to be more than page tracking LSN. But the server has put this restriction on mysqlbackup component.