MySQL Bugs: #36296: Falcon: commitNoUpdates is sleeping too often

Bug #36296	Falcon: commitNoUpdates is sleeping too often
Submitted:	23 Apr 2008 16:56	Modified:	8 Jan 2009 10:17
Reporter:	Kevin Lewis	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0.4	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Xiang, Kai [mailto:kai.xiang@intel.com]  has detected that Transaction::commitNoUpdate() calls the sleep(1) too often.  He writes

From: Xiang, Kai [mailto:kai.xiang@intel.com] 
Sent: Wednesday, April 23, 2008 3:42 AM
To: Kevin Lewis; Hakan Kuecuekyilmaz; Jim Starkey; Kelly Long
Cc: Duan, Jungle; Hu, Xuekun; Yu, Zhidong
Subject: Questions about Transaction::commitNoUpdates

Hi all:

When we are looking at why 8 threads can’t saturate CPU running Sysbench, we found the working threads often go to sleep.

(We use gdb to attached mysqld in the middle of running, and try this several times, which help us know what's happening inside, And see sleep in working threads pretty often)

Call stack looks like:

--STRIPPED--
#0  0x0000003bd320a567 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00000000008e840d in Synchronize::sleep (this=0x17496a88, milliseconds=1,
    callersMutex=0x0) at Synchronize.cpp:177
#2  0x00000000008e8203 in Synchronize::sleep (this=0x17496a88, milliseconds=1)
    at Synchronize.cpp:135
#3  0x000000000082422c in Transaction::commitNoUpdates (this=0x2aaac3de5320)
    at Transaction.cpp:362
#4  0x00000000008244fe in Transaction::commit (this=0x2aaac3de5320) at Transaction.cpp:243
#5  0x000000000084346b in Connection::commit (this=0x17499d00) at Connection.cpp:269
#6  0x00000000007fbe29 in StorageConnection::commit (this=0x17499c70)
    at StorageConnection.cpp:146
#7  0x00000000007fb8b3 in StorageConnection::endImplicitTransaction (this=0x17499c70)
    at StorageConnection.cpp:327
#8  0x00000000007f74eb in StorageInterface::external_lock (this=0x2aae08005160,
    thd=0x9a28390, lock_type=2) at ha_falcon.cpp:1844
--STRIPPED--

Refer to the source code (Transaction.cpp:362), I found there is a sleep inside dependency checking loop; we suspect this could result in the idle time in system.

Base on this suspect, we do experiments as below:
1) Comments the sleep functions, mysqld crash.
2) Change this sleep time from 1ms to 2ms. We see a system level idle time increase, see below charts.
   ------------------------------------------------------------------------
   |   8 threads |1ms (89608.11 trans/s) | 2ms (87231.69 trans/s)
   ------------------------------------------------------------------------ 
   | avg User%   |      70.33%           |        68.24%
   ------------------------------------------------------------------------
   | avg Kernel% |      18.96%           |        18.73%
   ------------------------------------------------------------------------
 =>| avg Idle%   |      10.71%           |        13.03%
   ------------------------------------------------------------------------
From this, we may say there is something related to the idle time and the sleep.

Code reference:
--STRIPPED--
for (int n = 0; dependencies && n < 10; ++n)
{
    transactionManager->expungeTransaction(this);

    if (!dependencies)
            break;

    // There is a tiny race condition between another thread releasing a dependency and
    // TransactionManager::expungeTransaction doing the same thing.  So spin.

    if (thread)
           {
            thread->sleep(wait);
            wait <<= 1;
            }
    else
           thread = Thread::getThread("Transaction::commitNoUpdates");
}
--STRIPPED--

How to repeat:
Sysbench according to Xiang, Kai [mailto:kai.xiang@intel.com] 

Suggested fix:
Reduce the need to call sleep(1) when another thread still has a dependency.

The affect of this bug is only performance. Thus the low Triage values.  However, Bug#34602 is more critical and is a result of this sleep occurring 10 times in a row.  Clearly, this non-deterministic way of handling this race condition is not the best.  The fix for this bug will probably also fix Bug#34602.

Given Intel has seen contention, the probability of race is not tiny. And spinning does not seem proper implemented here - without locks or atomic operations

Changing to Priority 4.  

>Jim Starkey writes;
>This isn't a bug. There shouldn't be an open bug against it.
>It's a hard problem. The code works. There are mixes where 
>it isn't optimal, but there hasn't been a demonstration 
>that any real-life transaction mix makes this a major 
>problem or even a minor problem.
>
>I'm going to spend a couple of days rethinking the problem, 
>but there isn't an guarantee or existence proof of a better 
>solution. I hope to find one, but it's only hope.
>
>Please put this idiocy in perspective. It isn't a bug.

This 'bug' is really an opportunity to improve performance.

But I would argue that the related Bug#34602 is more important since it is a crash.  If there is no way to make this deterministic without an overall performance cost, then we should close this, and reduce the chance of Bug#34602 occuring by increasing the loop count.  But I don't like that solution.

See bug 34602

This fix is in version 6.0.6  The for loop in Transaction commitnoUpdates was eliminated.

Internal change only. No documentation needed.