Bug #44276 NdbTick assert() in Ndbif.cpp
Submitted: 15 Apr 2009 0:52 Modified: 14 May 2009 13:32
Reporter: Stewart Smith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version: OS:Windows
Assigned to: jack andrews CPU Architecture:Any

[15 Apr 2009 0:52] Stewart Smith
Description:
I hit an assert() in mysqld today:

From src/ndbapi/Ndbif.cpp:
void
Ndb::check_send_timeout()
{
  Uint32 timeout = theImpl->m_transporter_facade->m_waitfor_timeout;
  NDB_TICKS current_time = NdbTick_CurrentMillisecond();
  assert(current_time >= the_last_check_time);

This assert() can be hit on Windows.

This is because of our NdbTick_CurrentMillisecond() implementation there.

From storage/ndb/src/common/portlib/win32/NdbTick.c:

int
NdbTick_CurrentMicrosecond(NDB_TICKS * secs, Uint32 * micros)
{
  LARGE_INTEGER liCount, liFreq;
  QueryPerformanceCounter(&liCount);
  QueryPerformanceFrequency(&liFreq);
  *secs = liCount.QuadPart / liFreq.QuadPart;
  liCount.QuadPart -= *secs * liFreq.QuadPart;
  *micros = (Uint32)((liCount.QuadPart*1000000) / liFreq.QuadPart);
  return 0;
}

It turns out that QueryPerformance counter can:
- on AMD systems, vary between CPUs
- in VirtualBox (and wlad says VMWare), go backwards.
So this should be solved one way or the other. the IRC chat in #wtf
included below for completeness (with a bit of discussion):

Apr 15 01:07:05 <stewart>       anyone ever seen the
QueryPerformanceCounter go
backwards?
Apr 15 01:07:13 <stewart>       iggy, reggie: ^ ?
Apr 15 01:07:17 <stewart>       wlad: ?
Apr 15 01:07:39 <iggy>  stewart: Please explain more.
Apr 15 01:07:40 <wlad>  stewart, yes. I've seen that
Apr 15 01:07:45 <iggy>  I think it can happen.
Apr 15 01:07:46 <stewart>       wlad: ooerrr... cool :)
Apr 15 01:07:50 <wlad>  on AMD multicores
Apr 15 01:07:57 <stewart>       wlad: add "inside VirtualBox" to that
Apr 15 01:08:15 <wlad>  inside the VMWare :)
Apr 15 01:08:35 <stewart>       wlad: so... i think this means our
NdbTick imple
mentation is wrong...
Apr 15 01:08:55 <wlad>  stewart, rather QueryPerformanceCounter is
wrong..
Apr 15 01:09:01 <stewart>       wlad: or... that assert is bad.... but i
wouldn't put money on it not be relied on elsewhere...
Apr 15 01:09:24 <reggie>        stewart, docs say that with multiple
cores you can get different values when it hits different cores or cpus
Apr 15 01:09:31 <stewart>       darn...
Apr 15 01:09:35 <reggie>        this can make it seem to go backward
Apr 15 01:09:43 <stewart>       exactly the assert i hit
Apr 15 01:09:49 <reggie>        you need to use SetThreadAffinity to
force it
Apr 15 01:09:58 <stewart>       kind of ugly :)
Apr 15 01:10:12 <wlad>  MS has had some articles on that. Harware
vendors are tricky and sometimes use RDTSC as timer. to improve
performance
Apr 15 01:10:13 <reggie>        SetThreadAffinityMask rather
Apr 15 01:10:30 <stewart>       probably either need to remove assert,
or use more reliable tick....
Apr 15 01:10:30 <wlad>  there was a boot parameter that could prevent
it.
Apr 15 01:10:59 <reggie>        stewart, is this in a multi-cpu vm?
Apr 15 01:11:32 <stewart>       reggie: no.
Apr 15 01:11:50 <stewart>       afaik virtualbox only supports 1cpu
Apr 15 01:13:33 <wlad>  stewart: http://support.microsoft.com/kb/895980
is the KB article. The essence of their advice is to add /usepmtimer
parameter to the boot.ini
Apr 15 01:16:39 <wlad>  stewart: more reliable but less exact counter is
GetTickCount. The ugly thing about it , it wraps around in 49.7 days
Apr 15 01:17:29 <stewart>       hrrm...
Apr 15 01:17:37 <stewart>       perhaps will have to take to list
Apr 15 01:17:43 <stewart>       see what others thing
Apr 15 01:17:47 <stewart>       s/thing/think/
Apr 15 01:17:58 *       stewart doesn't too much mind it going backwards
a bit...
Apr 15 01:18:03 <stewart>       just have to audit some code
Apr 15 01:18:30 <wlad>  or GetSystemTimeAsFileTime , uses the same timer
as GetTickCount, does not wrap

How to repeat:
run AMD, VirtualBox, VMWare and be "lucky"

Suggested fix:
change to GetSystemTimeAsFileTime
[15 Apr 2009 10:42] jack andrews
<magnus>	yes, 'GetSystemTimeAsFileTime' - you can look in mysys/my_getsystime.c for how it's used there.
[15 Apr 2009 12:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72152

2868 jack andrews	2009-04-15
      bug#44276  NdbTick assert() in Ndbif.cpp
[16 Apr 2009 11:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72269

2868 jack andrews	2009-04-16
      bug#44276  NdbTick assert() in Ndbif.cpp
[16 Apr 2009 11:17] Bugs System
Pushed into 5.1.32-ndb-7.0.6 (revid:jack@sun.com-20090416111503-00o2025g2cav4s78) (version source revid:jack@sun.com-20090416103049-hn283mtm4t2ca7v3) (merge vers: 5.1.32-ndb-7.0.6) (pib:6)
[16 Apr 2009 11:40] Magnus Blåudd
Very good!
[16 Apr 2009 19:10] Jon Stephens
Need to know for documentation:

1. Do we support running Cluster inside a VM for production? (AFAIK we do not.)

2. Could a user encounter this issue without running Cluster in a VM? If so, how?

Thanks.
[11 May 2009 12:52] Magnus Blåudd
Changed to use a more stable function to retrieve current time
[14 May 2009 13:32] Jon Stephens
Documented bugfix in the NDB-7.0.6 changelog as follows:

        An NDB internal timing function did not work correctly on Windows and
        could cause mysqld to fail on some AMD processors or when running inside
        a virtual machine.