MySQL Bugs: #74408: srv_conc_enter_innodb() is not optimal

Bug #74408	srv_conc_enter_innodb() is not optimal
Submitted:	15 Oct 2014 19:52	Modified:	4 Jul 2018 15:15
Reporter:	Inaam Rana (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.6, 5.7, 8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	performance

Description:
srv_conc_enter_innodb_with_atomics() has back off heuristics to control number of threads running inside InnoDB. Relevant code:

240 
241                                 if (srv_adaptive_max_sleep_delay > 0) {
242                                         if (srv_thread_sleep_delay > 20
243                                             && n_sleeps == 1) {
244 
245                                                 --srv_thread_sleep_delay;
246                                         }
247 
248                                         if (srv_conc.n_waiting == 0) {
249                                                 srv_thread_sleep_delay >>= 1;
250                                         }
251                                 }
...
...
...
281                 sleep_in_us = srv_thread_sleep_delay;
282 
283                 /* Guard against overflow when adaptive sleep delay is on. */
284 
285                 if (srv_adaptive_max_sleep_delay > 0
286                     && sleep_in_us > srv_adaptive_max_sleep_delay) {
287 
288                         sleep_in_us = srv_adaptive_max_sleep_delay;
289                         srv_thread_sleep_delay = static_cast<ulong>(sleep_in_us);
290                 }
291 
292                 os_thread_sleep(sleep_in_us);
293 
294                 trx->op_info = "";
295 
296                 ++n_sleeps;
297 
298                 if (srv_adaptive_max_sleep_delay > 0 && n_sleeps > 1) {
299                         ++srv_thread_sleep_delay;
300                 }

* The amount by which a thread sleeps is controlled by srv_thread_sleep_delay. We start with 10ms and move between 20us - 150ms (or configured value of srv_adaptive_max_sleep_delay). It is incremented if a thread has to sleep more than once (line 299). It is decremented if the thread has to sleep once (line 245). However, it is not decremented if the thread doesn't have to sleep at all. I think we should consider a 'no sleep' as a sign of well oiled system and decrement the count. 

* srv_thread_sleep_delay is reduced to half (line 249) if there are no waiting threads. Typically, in a loaded system we'll have a continuous inflow of requests meaning there always will be some waiting thread i.e.: we're unlikely to hit line 249.

If we consider the above two conditions and assume a loaded server forcing threads to sleep and always having some threads in waiting state we'll end up in a spiral where srv_thread_sleep_delay will keep getting incremented and will never get decremented (or even if decrement happens it will be only for the few cases where n_sleeps == 1). Eventually our sleep time will climb to 150ms. Typically our stay inside InnoDB is much less than 150ms. This can potentially cause serious performance regression when compared with 5.5.

How to repeat:
I've tried simple sysbench in following way:
* 2 instances of sysbench. One running 100 threads doing random inserts and other running 200 threads doing random point select.
* Start with an empty table
* Set innodb_thread_concurrency = 8
* Compare results with that of 5.5

Suggested fix:
I think few things need to be considered:
* How much we sleep should be a random selection from a range
* Currently sleep interval is only function of number of sleeps. I think it should be a function of:
  -- number of waiting threads
  -- srv_thread_concurrency
  -- some heuristic value on how much time we spend inside InnoDB during one dive

So, for example, if we have srv_thread_concurrency = 8 and we have 128 waiting threads and we have somehow calculated that threads are spending 500us inside InnoDB on avg. we can then reason that in worst case the thread we are putting to sleep has to wait (128/8) * 500 microseconds. We can then randomize sleep time between zero and this value to get some staggered middle value.

Hi Inaam, my friend,

How are you ??

Well, I have analysed your report and I have also analysed corresponding code in 5.7 and 8.0. There were lots of changes in the last two mentioned versions, but the logic that you described is still there.

I have, henceforth, decided that this is a performance bug, so I am verifying it as such.