MySQL Bugs: #44140: Insert buffer operation may destroy the page during its recovery process

Bug #44140	Insert buffer operation may destroy the page during its recovery process
Submitted:	8 Apr 2009 1:30	Modified:	12 Apr 2014 18:50
Reporter:	Yasufumi Kinoshita	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.0 5.1	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	Contribution

Description:
Sometimes (it is really rare), We meet the error message during crash recovery of InnoDB.

090405  4:29:47  InnoDB: ERROR: We were only able to scan the log up to 1319 813701532
InnoDB: but a database page a had an lsn 1319 814039859. It is possible that the
InnoDB: database is now corrupt!

I accept the message as,,
"There have been some logged modification to certain page before the last recv_recover_page() for the page"

And I have caught the modification.
#0  0x00000000004a5b60 in log_close ()
#1  0x00000000004d93b9 in mtr_commit ()
#2  0x000000000042e572 in ibuf_delete_rec ()
#3  0x000000000042f314 in ibuf_merge_or_delete_for_page ()
#4  0x000000000049724e in buf_page_io_complete ()
#5  0x00000000004cbdd7 in fil_aio_wait ()
#6  0x000000000040c678 in io_handler_thread ()

I suspect ibuf_merge_or_delete_for_page() from buf_page_io_complete() (triggered by recv_read_in_area()) and recv_recover_page() from recv_apply_hashed_log_recs() can conflict.

How to repeat:
Sorry, it may be rare case.
I don't have 100% procedure...

Suggested fix:
recv_recover_page() of log0recv.c:

==================================
        if ((recv_addr == NULL)
            || (recv_addr->state == RECV_BEING_PROCESSED)
+           || (recv_addr->state == RECV_BEING_READ && !just_read_in)
            || (recv_addr->state == RECV_PROCESSED)) {

                mutex_exit(&(recv_sys->mutex));

                return;
        }

==================================

RECV_BEING_READ only by io_thread.

It may have another reason. I will analyze more.

Yasufumi, thank you for the bug report. While you are looking further to it,
I'm assigning it to Heikki to review your initial findings.

Yasufumi, how did the server crash? Just abort or SIGKILL, or was it something
hard like power failure? Which versions of MySQL and InnoDB are you using
exactly?

Michael,
The server might not crash at the time by this reason.
But sometimes (rare), I met assertion failure of detecting page corruption and page dump.
(Though at some of our trial, the datafile might be already corrupted by half recovery before the recovery ...)
Anyway, at least, this patch seems to stabilize the recovery.
Is it my misunderstanding?

Yasufumi, I understand that you get this assert during crash recovery,
therefore there must have been a mysqld crash previously. What were the reasons
for the original crash? Or are you doing recovery as part of xtrabackup restore
by any chance? As far as I understand the assert, one way for it to happen
is to simply copy InnoDB files and logs while InnoDB is running.
Regarding:
>I suspect ibuf_merge_or_delete_for_page() from buf_page_io_complete() (triggered by
recv_read_in_area()) and recv_recover_page() from recv_apply_hashed_log_recs() can
conflict.

I don't think they can because buf_page_io_complete() would call recv_recover_page() before calling ibuf_merge_or_delete_for_page(),
then recv_apply_hashed_log_recs() will not call recv_recover_page() again since
executions of recv_recover_page() are well synchronized.

Your proposed change seems a noop on the surface since at present a page is either recovered by buf_page_io_complete() or by recv_apply_hashed_log_recs(), whichever stumbles on the page first. With you patch recv_apply_hashed_log_recs() abandons newly read pages in favor of buf_page_io_complete().
This may have unclear performance consequences (more parallel recovery?) but I don't see how it changes the functionality.

Could you also show the stack backtrace of mysqld at assertion time?

Michael,

As you say, I uses as part of xtrabackup. We are testing it many time. I think it is also very effective debugging of InnoDB recovery process.

I have not reproduced the phenomena intentionally yet...

But I have not understood which mutex/lock synchronize recv_recover_page() yet...
(e.g. block->lock doesn't:
both of recv_recover_page() and ibuf_merge_or_delete_for_page() use rw_lock_x_lock_move_ownership(&(block->lock)).
)

How do you think about?

>I think it is also
very effective debugging of InnoDB recovery process.

it is and we do appreciate your efforts and sharing your findings with us.
However in this case it doesn't seem to be a bug in innodb recovery at least
as code analysis shows, also for lack of reproducing it through a normal
crash, not xtrabackup. Therefore I regretfully have to declare this 'Not a bug'.
In fact the easy way to get the assert in question would be to use OS copy on InnoDB files and logs while InnoDB is running. 
Regarding your other question, my understanding is that recv_recover_page is synchronized by a global mutex that protects recv_addr->state for any page.
There's another relevant code path, you can find it by grepping, a function that ends in 'backup' and mentions 'no latching'. But I understand that this function isn't used in normal InnoDB operation.

Michael,

I'd like to reply once more.

I think RDBMS must aim perfect, not be optimistic.

I think there are certain possibility to hit the bug even at normal using,
though it is rare case.

recv_addr->state does only synchronize but doesn't ordering.
RECV_BEING_READ state is ignored in current version.

The problem can occur from recv_apply_hashed_log_recs(TRUE).
At least, it is called from fsp_header_get_free_limit() at innobase_start_or_create_for_mysql().
(I have met such stack trace before.)

recv_read_in_area() (from recv_apply_hashed_log_recs()) does read ahead the pages in recv_addr in the same 1MB area.

If
the recv_recover_page() from buf_page_io_complete()
and
the recv_recover_page() for the same page from recv_apply_hashed_log_recs()
called at the same time and the later overtakes the former,

what does exclude the execution of
ibuf_merge_or_delete_for_page() from buf_page_io_complete() (next of recv_recover_page())
and
the recv_recover_page() for the same page from recv_apply_hashed_log_recs()
?

Or do you guarantee that there should be no recv_addr at fsp_header_get_free_limit()?

I have been disappointed...

Thank you.

Yasufumi,

>I think RDBMS must aim perfect, not be optimistic.

I agree. But perfection includes supported modes of operation,
it doesn't extend to crash recovery being expected to work on something
which isn't result of a crash. 

>I think there are certain possibility to hit the bug even at normal using,
though it is rare case.

when this happens or when you have a clear explanation as to what is wrong
with the code to such a degree that it could lead to a crash, 
please reopen the bug. 

>what does exclude the execution of
ibuf_merge_or_delete_for_page() from buf_page_io_complete() (next of recv_recover_page())
and
the recv_recover_page() for the same page from recv_apply_hashed_log_recs()
?

first recv_recover_page() will set recv_addr->state = RECV_BEING_PROCESSED,
then to RECV_PROCESSED, under recv_sys->mutex.

second recv_sys->mutex(), also under recv_sys->mutex, checks recv_addr->state
and if it finds it to be either of the above, it exits immediately.

The ibuf processing is done when recv_addr->state == RECV_PROCESSED so recv_recover_page wouldn't try to recover the page.

regarding
>RECV_BEING_READ state is ignored in current version.

it doesn't seem to be true

 if (recv_addr->state == RECV_NOT_PROCESSED) {
                                recv_addr->state = RECV_BEING_READ;

                                page_nos[n] = page_no;

                                n++;
                        }

it aims to prevent the same page being included multiple times to the read array.

All of the above is so simple that either you're looking at the wrong place at the source of the perceived problem, or there's no problem, or there's something trivial I'm missing which is entirely possible since this is the first time I look at this code closely. I don't think bugdb is a proper venue to discuss the code in such detail, there are forums for that, email etc.

>I have been disappointed...

I'm sorry you feel that way.

Michael,

Why don't you understand? You didn't answer to my questions at all.

There are two ways recv_recover_page() to be called.
There are possibility to be called at the same time from different thread.

(1) by read IO thread
recv_recover_page()
        buf_page_io_complete()
                fil_aio_wait()
                        io_handler_thread()

In buf_page_io_complete(), ibuf_merge_or_delete_for_page() is also called
after calling recv_recover_page().

ibuf_merge_or_delete_for_page() "must" be after recv_recover_page(),
because it destroys the page.

The IO requests are caused by
buf_read_recv_pages()
        recv_read_in_area()
                recv_apply_hashed_log_recs()

recv_read_in_area() does read ahead, requests pages in the same 1MB.

(2) by main thread
recv_recover_page()
        recv_apply_hashed_log_recs()

Because recv_read_in_area() does read ahead,
and block->lock doesn't work during the two threads,
The following situation can be occur.

-----

1. (read IO for the page is completed and registered to buf_page_hash)
        recv_addr->state == RECV_BEING_READ at this phase

2. buf_page_peek() and recv_recover_page() from "main thread"
        recv_addr->state = RECV_BEING_PROCESSED;
        and start processing the page.

3. recv_recover_page() from IO thread
        do nothing because recv_addr->state == RECV_BEING_PROCESSED.

4. ibuf_merge_or_delete_for_page() from IO thread
        if during recv_apply_hashed_log_recs(allow_ibuf:TRUE),
        recv_no_ibuf_operations == FALSE, and it is called.
        If the page of ibuf exists in the buf_pool already,
        the record is processed at once.

Now, "2. recv_recover_page()" and "4. ibuf_merge_or_delete_for_page()"
seems not to be synchronized.

I think "2. recv_recover_page()" is problem.
So I have suggested the fix not to processed at
recv_recover_page() from main thread when recv_addr->state == RECV_BEING_READ.

#####################################################
How do you explain its safety?
Why do you conclude that it doesn't occur at mysqld?
#####################################################

Thank you.

Michael,

Why don't you understand? You didn't answer to my questions at all.

There are two ways recv_recover_page() to be called.
There are possibility to be called at the same time from different thread.

(1) by read IO thread
recv_recover_page()
        buf_page_io_complete()
                fil_aio_wait()
                        io_handler_thread()

In buf_page_io_complete(), ibuf_merge_or_delete_for_page() is also called
after calling recv_recover_page().

ibuf_merge_or_delete_for_page() "must" be after recv_recover_page(),
because it destroys the page.

The IO requests are caused by
buf_read_recv_pages()
        recv_read_in_area()
                recv_apply_hashed_log_recs()

recv_read_in_area() does read ahead, requests pages in the same 1MB.

(2) by main thread
recv_recover_page()
        recv_apply_hashed_log_recs()

Because recv_read_in_area() does read ahead,
and block->lock doesn't work during the two threads,
The following situation can be occur.

-----

1. (read IO for the page is completed and registered to buf_page_hash)
        recv_addr->state == RECV_BEING_READ at this phase

2. buf_page_peek() and recv_recover_page() from "main thread"
        recv_addr->state = RECV_BEING_PROCESSED;
        and start processing the page.

3. recv_recover_page() from IO thread
        do nothing because recv_addr->state == RECV_BEING_PROCESSED.

4. ibuf_merge_or_delete_for_page() from IO thread
        if during recv_apply_hashed_log_recs(allow_ibuf:TRUE),
        recv_no_ibuf_operations == FALSE, and it is called.
        If the page of ibuf exists in the buf_pool already,
        the record is processed at once.

Now, "2. recv_recover_page()" and "4. ibuf_merge_or_delete_for_page()"
seems not to be synchronized.

I think "2. recv_recover_page()" is problem.
So I have suggested the fix not to processed at
recv_recover_page() from main thread when recv_addr->state == RECV_BEING_READ.

#####################################################
How do you explain its safety?
Why do you conclude that it doesn't occur at mysqld?
#####################################################

Thank you.

I'd like to obtain exact opinion rather than quick.
I can wait long for your response.

Thank you.

Yasufumi, I was answering your questions, as a matter of fact, and
I hope I helped you to clearly describe the problem in the code.
I did mention a possibility I could be wrong.
Yes it would seem there's a problem but obviously I have to talk to the experts
as to how to proceed. Nonetheless it's a possible bug resulting from code 
review. As for the crash itself, we can't support problems resulting 
from 3rd party product usage such as xtrabackup. And it would be nice if xtrabackup
had been mentioned at bug filing time, not after my considerable questioning.

To summarize the code issue:

5.1 "buf/buf0buf.c" buf_page_io_complete()

the code below is problematic

                if (recv_recovery_is_on()) {
                        recv_recover_page(FALSE, TRUE, block->frame,
                                          block->space, block->offset);
                }

                if (!recv_no_ibuf_operations) {
                        ibuf_merge_or_delete_for_page(
                                block->frame, block->space, block->offset,
                                TRUE);
                }

recv_recover_page() may abandon recovery because redo apply is recovering the page and as such its recv_addr->state == RECV_BEING_PROCESSED.
then ibuf_merge_or_delete_for_page() jumps at the same page
which is well BEING_PROCESSED by recovery so we may have a problem.

Yasufumi's solution of prohibiting redo apply interfere with buf_page_io_complete by virtue of RECV_BEING_READ looks valid but of course
I may be wrong here :) in any case there are experts for this.

Assigning to Inaam. I'm still not sure why this is not protected
at block level but I believe Yasufumi has an opinion on that.
Yasufumi, could you post or upload the stack backtrace from all threads
at failure time, the result of 'thr a a bt' in gdb?

forgot to set the assignment. Also setting status to Feedback because we need that stack backtrace.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Hello Yasufumi - 

In order for us to continue the process of reviewing your contribution to MySQL - We need you to review and sign the Sun|MySQL contributor agreement (the "SCA")

The process is explained here: 
http://forge.mysql.com/wiki/Sun_Contributor_Agreement

Getting a signed/approved SCA on file will help us facilitate your contribution-- this one, and others in the future.

Thank you ! 

Liz Drachnik  - Program Manager - MySQL

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Is request to sing "Sun|MySQL contributor agreement" 
still actual ?