Bug #64340 race on no_flush event during buf_flush_free_margin
Submitted: 15 Feb 2012 14:07 Modified: 14 Jun 2013 15:35
Reporter: Mark Callaghan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S5 (Performance)
Version:5.1.52, 5.1.61 OS:Any
Assigned to: CPU Architecture:Any
Tags: innodb, performance

[15 Feb 2012 14:07] Mark Callaghan
Description:
This isn't a bug and it isn't horrible for performance but it is unexpected to me. A thread can call buf_flush_wait_batch_end in buf_flush_free_margin when another thread is currently doing that work in buf_flush_batch. And buf_flush_wait_batch_end does os_event_wait on no_flush(BUF_FLUSH_LRU). But the thread doing work in buf_flush_batch doesn't reset that event until it reaches buf_flush_page and this thread has locked/unlocked the buffer pool mutex several times after setting init_flush[BUF_FLUSH_LRU] to TRUE.

        if (buf_pool->n_flush[flush_type] == 0) {

                os_event_reset(buf_pool->no_flush[flush_type]);
        }

The result of this should be that many threads can call buf_flush_wait_batch_end only to immediately return from os_event_wait because the event is not set.

I don't understand the performance impact of this but it doesn't seem right. I also don't get why callers have to check for init_flush[BUF_FLUSH_LRU] and n_flush[BUF_FLUSH_LRU] to determine that a flush is in progress. If that is the case then this code is too complex. It should be sufficient to check init_flush.

How to repeat:
read the code

Suggested fix:
Don't make waiters check init_flush and n_flush, checking init_flush should be sufficient
Reset the event before buf_flush_page
[18 Mar 2012 18:33] Valeriy Kravchuk
Thank you for the problem report.
[13 Jun 2013 17:48] Inaam Rana
Mark,

As you know that buf_flush_free_margin() is no more in 5.6. I am trying to understand your reasoning around batch end wait. Are you saying that threads calling buf_flush_wait_batch_end can, in some cases, return before the batch has ended? If so, can you explain it a bit further how this can happen?
[14 Jun 2013 15:35] Mark Callaghan
I think this should be closed. The code has changed significantly.