Bug #59163 Single user session performance in InnoDB
Submitted: 24 Dec 2010 17:20 Modified: 5 Jan 2012 0:37
Reporter: Dimitri KRAVTCHUK Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.5.8 OS:Any
Assigned to: Inaam Rana CPU Architecture:Any
Tags: innodb, os_thread_sleep, single session, srv_activity_count

[24 Dec 2010 17:20] Dimitri KRAVTCHUK
Description:
On an OLTP activity with a low number of user sessions (1-2) InnoDB is involving too much write operations.. - more detailed tracing of the Master thread code shows that the internal activity counter (srv_activity_count) is not incremented while there is only one single user session is running, which is making Master thread thinking the server is idle, so it's moving to the "background loop" and involving flushing of available dirty pages.. - such a flushing is creating x10-20 times higher write activity than needed which is abnormal. Since 4 and more active user sessions the activity counter is incremented correctly, and then we observe x10 times lower I/O activity with x2-4 times higher performance level..

How to repeat:
Reproductible with dbSTRESS test, but I think can be easily repeated with sysbench too - just run a single session Read+Write test and observe your I/O activity (iostat) and InnoDB pages writes/sec. Then compare it with 4 user sessions.

Suggested fix:
My "temporary fix" is to check not only "srv_activity_count" changes to see if there was some activity, but also the current LSN - if it was changed there was an activity for sure.. But I think someone well knowing InnoDB code may find a real fix :-)
[4 Jan 2011 4:41] Mark Callaghan
commit and rollback increment srv_activity_count
every 32nd row insert/update/delete also increments it
[5 Jan 2011 16:47] Inaam Rana
Dimitri,

I have tried with sysbench with a single session and did not see the master thread moving to the 'background loop'. I do see a relatively large batch flushed every ten seconds. The vmstat output looks like:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  1    140 7247556 1430388 22024476    0    0  4176   100 1566 2500  1  0 88 12  0
 0  1    140 7243712 1430388 22024476    0    0  3888    76 1520 2300  0  0 87 12  0
 0  1    140 7240116 1430388 22024476    0    0  3472   104 1487 2219  0  0 87 12  0
 0  1    140 7236024 1430388 22024476    0    0  3920 34744 3940 3752  2  0 87 11  0
 0  1    140 7233296 1430388 22024476    0    0  2656    56 1357 1601  0  0 88 12  0
 0  1    140 7229452 1430388 22024480    0    0  3840   104 1523 2456  1  0 88 12  0
 0  1    140 7225732 1430388 22024480    0    0  3712    80 1500 2363  1  0 88 12  0
 0  1    140 7222136 1430388 22024480    0    0  3552    80 1482 2363  0  0 87 12  0
 0  1    140 7218540 1430388 22024480    0    0  3600    80 1485 2435  1  0 88 12  0
 0  1    140 7214324 1430388 22024480    0    0  4144   100 1561 2891  1  0 87 12  0
 0  1    140 7210108 1430388 22024480    0    0  4256   100 1578 2650  1  0 87 12  0
 0  1    140 7206140 1430388 22024480    0    0  3936    92 1536 2655  1  0 88 12  0
 0  1    140 7201800 1430388 22024480    0    0  4240    92 1574 2829  0  0 87 12  0
 0  2    140 7196592 1430388 22024480    0    0  5152 31404 3831 4205  2  0 86 11  0
 0  1    140 7193740 1430388 22024480    0    0  2848  6520 1829 2276  1  0 88 12  0
 0  1    140 7190764 1430388 22024480    0    0  2944    88 1405 1993  0  0 87 12  0

The large batch that we do above every ten seconds is encoded in the 10_second loop. We don't do a similar large batch when running with 4 or more sessions because of the following condition:

2797         if (n_pend_ios < SRV_PEND_IO_THRESHOLD
2798             && (n_ios - n_ios_very_old < SRV_PAST_IO_ACTIVITY)) {

What this means is that do a batch iff currently pending IOs are less than 3% of the innodb_io_capacity and the number of IOs performed in last 10 seconds are less than 200% of innodb_io_capacity. This condition is likely met when we are running a single session (there are only so many reads you can do with a single thread) and therefore a full 100% innodb_io_capacity flush batch is triggered.

But again the above batch is triggered only once per ten seconds. If you are sure that you see continuously aggressive flushing activity happening from background_loop then please give me some more pointers to reproduce or tracing data.
[5 Jan 2011 16:50] Inaam Rana
You can figure out where your master thread is spending time by the following output from 'show innodb status':

-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 182 1_second, 182 sleeps, 18 10_second, 3 background, 3 flush
srv_master_thread log flush and writes: 190

If the master thread is going to the background and/or flush loop then the above counters should get incremented while the server is active.
[6 Jan 2011 13:39] Dimitri KRAVTCHUK
Inaam,

just replayed my test again and here is the output with counters:

mysql> pager grep -i master
...

mysql> show engine innodb status \G
srv_master_thread loops: 278764 1_second, 278764 sleeps, 1221 10_second, 266559 background, 266219 flush
srv_master_thread log flush and writes: 44
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 280022 1_second, 280022 sleeps, 1256 10_second, 267467 background, 267127 flush
srv_master_thread log flush and writes: 45
1 row in set (0.01 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 283122 1_second, 283121 sleeps, 1350 10_second, 269626 background, 269284 flush
srv_master_thread log flush and writes: 51
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 284678 1_second, 284678 sleeps, 1394 10_second, 270743 background, 270397 flush
srv_master_thread log flush and writes: 56
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 287202 1_second, 287202 sleeps, 1462 10_second, 272587 background, 272233 flush
srv_master_thread log flush and writes: 59
1 row in set (0.00 sec)
...

As you can see, Master thread is going more often to background loop rather to 10sec while there is a constant dbSTRESS Read+Write workload with one user session.. As well, IO capasity is set to 2000, but I have 10,000 writes/sec constantly during this issue..

Seems to me I have to prepare a tarball with a dbSTRESS test kit to play with.

Rgds,
-Dimitri

Seems to me I have to pre
[6 Jan 2011 14:32] Inaam Rana
Dimitri,

Can you try setting the following to 1 and see if that fixes the problem:

#define INNOBASE_WAKE_INTERVAL  32

I think your 'temporary fix' is actually a good fix. srv_activity_count looks like a hack to tell the master thread that there is some work. It makes sense to decouple the activity_count from master thread wakeup calls. I think a dirty read of log_sys::lsn is enough to tell us whether or not there has been any interesting activity in the server. The only part that I have to make sure is that there is no read_only operation that relies on srv_activity_count in some way.
[6 Jan 2011 17:07] James Day
Inaam, would that cause the server to be considered idle when most of the work it's doing is reads with only occasional writes? If yes, then a check of buffer pool pages read from disk might be useful also.
[6 Jan 2011 17:10] Dimitri KRAVTCHUK
Inaam,

recompiled now with 

#define INNOBASE_WAKE_INTERVAL 1

and it did not help..

So far, on the Read-Only single session:

mysql> show engine innodb status \G
srv_master_thread loops: 77407 1_second, 77407 sleeps, 268 10_second, 74731 background, 74660 flush
srv_master_thread log flush and writes: 7
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 79188 1_second, 79188 sleeps, 278 10_second, 76412 background, 76337 flush
srv_master_thread log flush and writes: 7
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 80810 1_second, 80810 sleeps, 287 10_second, 77944 background, 77869 flush
srv_master_thread log flush and writes: 7
1 row in set (0.00 sec)

Then on the Read+Write single session:

mysql> show engine innodb status \G
srv_master_thread loops: 292334 1_second, 292334 sleeps, 1215 10_second, 280188 background, 279869 flush
srv_master_thread log flush and writes: 39
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 294179 1_second, 294178 sleeps, 1276 10_second, 281421 background, 281099 flush
srv_master_thread log flush and writes: 47
1 row in set (0.00 sec)

mysql> show engine innodb status \G
srv_master_thread loops: 296496 1_second, 296496 sleeps, 1338 10_second, 283120 background, 282790 flush
srv_master_thread log flush and writes: 52
1 row in set (0.00 sec)

So, in both cases Master thread is going to the background loop as the activity count variable remains unchanged..

Now regarding a final fix:

- checking LSN changes worked for me as in my case transactions were very short and and committed fast.. - not sure if the same will work also for longer transactions.. - so probably it's better to check a DML counter here

- same for read-only: read or SELECT counter(s) should be checked - also, don't forget that reads may involve hart use of temp tables, so background flushing may impact reads too..

- then fo background flushing: it's writing without any limits! - while on a server shutdown it's ok, but on the normal work don't think it's ok, it should rather respect IO capacity here.. (and probably in case of read-only still flush but on 20-30% of the IO capacity)..

- kernel_mutex: I dont't think use of kernel_mutex is justified here just to set or compare the activity count variable.. - we don't need the exactly value, but just to verify if it was not become different from the previously saved state.. - so why should we use a mutex here??

Rgds,
-Dimitri
[6 Jan 2011 17:27] Inaam Rana
1) log_sys::lsn got incremented on each mtr_commit() not trx_commit(). Even if we have long transactions the mtr remains a small unit. I think checking LSN should work fine.

2) Yes, it makes sense to check read activity (as suggested by James as well). My only concern is that this can potentially impact master thread working significantly. I am not sure if we should do this in a GA release. Do we have many clients affected by this in a serious manner? If not, I'll then defer that part to 5.6. What do you suggest?

3) This is already addressed in 5.6 where page_cleaner thread flushes within bounds of innodb_io_capacity during background work and it does flush without sleeping during shutdown.

4) Agree. I have removed mutex calls in LSN based patch
[6 Jan 2011 18:18] Mark Callaghan
Can this be a problem on replication slaves where at most one session changes non-temp innodb tables?
[6 Jan 2011 18:52] James Day
Mark, this was discovered as a result of me tuning for a 5.1/plugin case where slaves used only for failover were having trouble keeping up. No reads at all and no local writes, just the single replication thread changing things. So yes.

I just couldn't get write activity down as low as I wanted it and that caused Dimitri to wonder about my odd-seeming request to be able to set innodb_io_capacity below 100. When he investigated he found this interesting situation.

Inaam, replication throughput that this affects is a substantial bottleneck in our scale-out architecture so I'd much prefer to see it done early in 5.5's life. Replication lag is a common problem that this should help with. While the case that led to this discovery was write only the more normal case is replication only writes with a lot of reads in scale-out.

Testing first so the common tests don't show trouble would be good, of course. Maybe those will be so clear in result that there's no need for a setting.

What worries you about checking reads that wasn't there already? Are you worried that there may be no background work done when a server is always busy with reads but would have had the work done before?

It would be helpful if there's a setting to control the read part, defaulting to the new way, so we can check for any customer where we think there might be a problem. Maybe have it set the maximum number of times to skip based only on the read activity measure and default to 25. Decrement each time the read activity causes the background work to be skipped. Reset the counter each time there's write activity blocking the work. Do the work anyway if there's no write activity and the counter is 0. If it looks like a problem we could set it to 0 and prove it or adjust the setting to eliminate the problem. 25 because that's quite a long wait that would greatly reduce background activity. But short enough so that some work will be done - say catching up slowly after a burst of writes - and any regular cycle of trouble should show up on server monitoring graphs.
[6 Jan 2011 19:41] Inaam Rana
Dimitri,

Something is wrong here:

mysql> show engine innodb status \G
srv_master_thread loops: 292334 1_second, 292334 sleeps, 1215 10_second, 280188
background, 279869 flush

292334 1_second loops mean that we have been running for around 81 hours. Are you running 5.5 or 5.1? 

We also get same number of sleeps as well. Given the following:
2695                 srv_main_1_second_loops++;
2696 
2697                 if (next_itr_time > cur_time
2698                     && srv_shutdown_state == SRV_SHUTDOWN_NONE) {
2699 
2700                         /* Get sleep interval in micro seconds. We use
2701                         ut_min() to avoid long sleep in case of
2702                         wrap around. */
2703                         os_thread_sleep(ut_min(1000000,
2704                                         (next_itr_time - cur_time)
2705                                          * 1000));
2706                         srv_main_sleeps++;
2707                 }

It is hard to explain how can we possibly get so many iterations done is such little time.
[6 Jan 2011 20:57] Inaam Rana
James,

In issue #49787 the show innodb status tells:

----------
BACKGROUND THREAD
----------
srv_master_thread loops: 209633 1_second, 209632 sleeps, 20890 10_second, 955 background, 954 flush srv_master_thread log flush and writes: 226103  log writes only: 346675

It appears we are not going to the background or flush loop that often. We are doing 100% innodb_io_capacity flush every ten seconds but given that the value is set to 200 (or is it 100) it is hard to explain the write activity.
[7 Jan 2011 0:32] Sunny Bains
Dimitri you asked

"- kernel_mutex: I dont't think use of kernel_mutex is justified here
just to set or compare the activity count variable.. - we don't need
the exactly value, but just to verify if it was not become different
from the previously saved state.. - so why should we use a mutex
here??"

No one knows, I asked Heikki this during the kernel mutex split early
days and he didn't remember either. It was removed early on in the
re-factoring from 5.6.
[7 Jan 2011 11:12] Dimitri KRAVTCHUK
Inaam:

it's 5.5.8 for sure :-))

I don't really know what is wrong with these counters.. - is it possible that Master thread is not sleeping??.. - to be honest, I did not monitor them when discovered the problem, but just added extended log messages into background loop to trace activity count state..

I'll try to trace it down again..

Mark: 

yes, I think the issue *may* impact any workload with a low number of active sessions.. - but on the same time there may be some exceptions too as Inaam, for ex. did not reproduce the issue ob sysbench workload..

however, analyzing the code, you may see that potentially nothing really protecting InnoDB from the described problem.. - so it should be just fixed, that's all :-)

Rgds,
-Dimitri
[7 Jan 2011 14:39] Inaam Rana
James,

OK, lets factor in the read activity as well. After staring at the code I realized that the number of tasks performed by the master thread during active state i.e.: 1_second and 10_second loops are exactly the same as were performed during background/flush_loop. So by factoring in read-only activity we won't put a plug on any task assigned to the master thread, we'll only change the frequency and quantum of work. I believe we should be fine as long as it is guaranteed that we are making progress on every task.

Sunny has come up with the idea of having an activity counter in mtr_commit(). I think it is a nice idea as it will capture both read and write activity. Let us try that. We'll need Dimitri's help to test the patch.
[12 Jan 2011 14:38] Dimitri KRAVTCHUK
Inaam,

I've found the real source of this bug...
it's hard to believe, but the Master thread is simply never sleeping within sleep loops while the load is low.. - and the problem is coming from the buggy implementation of the "os_thread_sleep" function within InnoDB..

I don't know since which time it's so (seems before it was based on the select() call), but now it's based on the call of the "nanosleep" function and doing things wrongly..

let's get a look (sorry for formatting):

-----------------------------------------------------------------------------
os_thread_sleep(
        ulint   tm)     /*!< in: time in microseconds */
{
  struct  timespec        rqtp; 

/* Ignore overflow etc.*/
    rqtp.tv_nsec = tm *1000;
    rqtp.tv_sec  = tm / 1000000000;

    return( (void)nanosleep(&rqtp, NULL) ); 
}
-----------------------------------------------------------------------------

what is wrong here:
  - the argument is a number in microseconds!
  - tv_nsec is not verified to out-pass 999999999 max value..
  - tv_sec is receiving 1000 times lower value that is should..

the result:
  - any sleep for 1sec or more will get an error from nanosleep()..
  - means for any call to sleep for one second or more InnoDB will never sleep..

in my case:
  - on a low load Master thread is finishing its loop in less than 10ms (1 tick)
  - so function giving time in ms will return the same value again
  - Master thread will call os_thread_sleep( 1000000 );
  - and will never sleep..
  - activity count will not be yet changed..
  - so Master thread is going to the background loop..

the fix is simple:
    rqtp.tv_nsec = tm *1000;
    rqtp.tv_sec  = tm / 1000000000;

change to:
    rqtp.tv_nsec = (tm *1000) % 1000000000;
    rqtp.tv_sec  = tm / 1000000;

Don't know what is a global impact of this bug... - it may make many thing working not as supposed to be.. - so should be fixed really urgently I think..

Rgds,
-Dimitri
[12 Jan 2011 14:54] Mark Callaghan
os_thread_sleep uses select on 5.5.8. Where does it use nanosleep?

/*****************************************************************//**
The thread sleeps at least the time given in microseconds. */
UNIV_INTERN
void
os_thread_sleep(
/*============*/
        ulint   tm)     /*!< in: time in microseconds */
{
#ifdef __WIN__
        Sleep((DWORD) tm / 1000);
#else
        struct timeval  t;

        t.tv_sec = tm / 1000000;
        t.tv_usec = tm % 1000000;

        select(0, NULL, NULL, NULL, &t);
#endif
}
[12 Jan 2011 15:09] Dimitri KRAVTCHUK
Mark,

seems you're right.. 
have to check my source tree and then replay the test again..

Rgds,
-Dimitri
[20 Jan 2011 8:49] Dimitri KRAVTCHUK
Inaam,

retested now with a "right" 5.5.8 source tree, the problem is way less present now as we're not going to the background loop, but still present due a flush within a 10sec loop:
  - yes, within 10sec we're doing less than 200% of IO capacity
  - however the real I/O load is 3x times higher on 1-4 sessions comparing to 16..

Why?..

the problem is coming from the way how the IO operations are accounted - we also count here the REDO log writes, while when the innodb_flush_logs_at_trx_commit=2 is used there are way less physical writes involved due REDO log writes rather page flushing :-)  so with a higher load we have more "logical" REDO writes which is increasing the IO counter significally that over 10sec we're doing more than 200% of IO capacity, while the real writes are page related.. - and skipping this every 10sec 100% IO capacity writes resulting in x3 times lower writes..

Possible suggestions:
 - reduce 10sec loop writes only to 10% of IO capacity
 - skip 10sec loop flushing if adaptive flushing is enabled

Regarding activity count: I think it should be improved anyway, because as you see, the current implementation may have some issues.. (potential or not)

Rgds,
-Dimitri
[20 Jan 2011 15:25] Inaam Rana
Dimitri,

I don't think a full batch every 10 seconds is going to hurt performance in any serious way but I think the idea of disabling the batch in case where adaptive_flushing is set does make sense. IMO this batch was meant to cover cases where we have idle capacity to utilize but with the introduction of adaptive_flushing this part should actually be taken care of in the one second loop.

I have posted a newer version of the patch internally at rb://562. Please try it and let me know if it works as advertised.
[5 Jan 2012 0:37] John Russell
Added to 5.6.3 changelog:

The mechanism that InnoDB uses to detect if the MySQL server is idle
was made more accurate, to avoid slowdown due to flush operations
that normally occur when no other activity is taking place. The
mechanism now considers that the server is not idle if there are
pending read requests for the InnoDB buffer pool.