MySQL Bugs: #47768: pthread_cond_timedwait() is broken on windows

Bug #47768	pthread_cond_timedwait() is broken on windows
Submitted:	1 Oct 2009 17:15	Modified:	18 Dec 2009 23:45
Reporter:	Kristofer Pettersson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: General	Severity:	S2 (Serious)
Version:	5.0+	OS:	Windows
Assigned to:	Kristofer Pettersson	CPU Architecture:	Any

Description:
The pthread_cond_wait implementations for windows might dead lock in some rare circumstances.

1) One thread (I) enter a timed wait and at a point in time ends up after mutex
   unlock and before WaitForMultipleObjects(...)
2) Another thread (II) enters pthread_cond_broadcast. Grabs the mutex and
   discovers one waiter. It set the broadcast event and closes the broadcast
   gate then unlocks the mutex.
3) A third thread (III) issues a pthread_cond_signal. It grabs the mutex, 
   discovers one waiter, sets the signal event then unlock the mutex.
4) The first threads (I) enters WaitForMultipleObjects and finds out that
   the signal object is in a signalled state and exits the wait.
5) Thread (I) grabs the mutex and checks result status. The number of waiters is
   decreased and becomes equal to 0. The event returned was a signal event
   so the broadcast gate isn't opened. The mutex is released.
6) Thread (II) issues a new broadcast. The mutex is acquired but the number
   of waiters are 0 hence the broadcast gate remains closed.
7) Thread (I) enters the wait again but is blocked by the broadcast gate.

How to repeat:
Run the attached program and insert a sleep(2) just before WaitForMultipleObjects(..) in mysys/my_wincond.c:int pthread_cond_timedwait(..)

Suggested fix:
The following change might be enough to resolve the issues:

mysys/my_wincond.c:int pthread_cond_timedwait(..)

- if (cond->waiting == 0 && result == (WAIT_OBJECT_0+BROADCAST))
+ if (cond->waiting == 0) 

It should be safe to reset the broadcast gate if there are no more waiters after the last exit even if the trigger event is a signal.

bug47768.cpp

Attachment: pthread_test2.cpp (text/plain), 3.55 KiB.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/85423

3144 Kristofer Pettersson	2009-10-01
      Bug#47768 pthread_cond_timedwait() is broken on windows
      
      The pthread_cond_wait implementations for windows might
      dead lock in some rare circumstances.
      
      1) One thread (I) enter a timed wait and at a point in
         time ends up after mutex unlock and before
         WaitForMultipleObjects(...)
      2) Another thread (II) enters pthread_cond_broadcast.
         Grabs the mutex and discovers one waiter. It set
         the broadcast event and closes the broadcast gate
         then unlocks the mutex.
      3) A third thread (III) issues a pthread_cond_signal.
         It grabs the mutex, discovers one waiter, sets the
         signal event then unlock the mutex.
      4) The first threads (I) enters WaitForMultipleObjects
         and finds out that the signal object is in a
         signalled state and exits the wait.
      5) Thread (I) grabs the mutex and checks result status.
         The number of waiters is decreased and becomes equal
         to 0. The event returned was a signal event so the
         broadcast gate isn't opened. The mutex is released.
      6) Thread (II) issues a new broadcast. The mutex is
         acquired but the number of waiters are 0 hence
         the broadcast gate remains closed.
      7) Thread (I) enters the wait again but is blocked by
         the broadcast gate.
      
      This fix resolves the above issue by always resetting
      broadcast gate when there are no more waiters in th queue.
     @ mysys/my_wincond.c
        * Always reset the broadcast gate if there are no more waiters left.

Resolved stacktrace from bug_43758 (Ricardo Gomez) which should show the same issue. (Fedora Linux 2.6.27.5-117.fc10.x86_64)

Attachment: bug_43758_resolved_stacktrace_Ricardo_Gomez.txt (text/plain), 24.56 KiB.

Customer verified that they no longer see the hang when FLUSH TABLES is not executed.

Krisofer, Davi, please check newly uploaded backtrace which should show the same issue, but this time not on Windows but Fedora Linux...

Roel: This bug is very specific to the Windows implementation of pthreads, it has nothing to do with Linux. The uploaded stack trace also unfortunately gives us very little to go on and I think the situation should be investigated further. Are the physical disks working as expected? Is there really a hang in fsync()? Please open yet another bug for the new unknown issue.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/85827

2813 Kristofer Pettersson	2009-10-06
      Bug#47768 pthread_cond_timedwait() is broken on windows
      
      The pthread_cond_wait implementations for windows might
      dead lock in some rare circumstances.
      
      1) One thread (I) enter a timed wait and at a point in
         time ends up after mutex unlock and before
         WaitForMultipleObjects(...)
      2) Another thread (II) enters pthread_cond_broadcast.
         Grabs the mutex and discovers one waiter. It set
         the broadcast event and closes the broadcast gate
         then unlocks the mutex.
      3) A third thread (III) issues a pthread_cond_signal.
         It grabs the mutex, discovers one waiter, sets the
         signal event then unlock the mutex.
      4) The first threads (I) enters WaitForMultipleObjects
         and finds out that the signal object is in a
         signalled state and exits the wait.
      5) Thread (I) grabs the mutex and checks result status.
         The number of waiters is decreased and becomes equal
         to 0. The event returned was a signal event so the
         broadcast gate isn't opened. The mutex is released.
      6) Thread (II) issues a new broadcast. The mutex is
         acquired but the number of waiters are 0 hence
         the broadcast gate remains closed.
      7) Thread (I) enters the wait again but is blocked by
         the broadcast gate.
      
            This fix resolves the above issue by always resetting
            broadcast gate when there are no more waiters in th queue.
     @ mysys/my_wincond.c
        * Always reset the broadcast gate if there are no more waiters left.

Pushed into 5.1.40 (revid:joro@sun.com-20091006095946-9vv2qal7rlot32r4) (version source revid:joro@sun.com-20091006095946-9vv2qal7rlot32r4) (merge vers: 5.1.40) (pib:11)

Hi, Roel, Kristofer.
For begin, thanks for your colaboration. I want know what I may to do for colaborate in the fix the problem. I don't understand what to do or what mean the stacktrace who sent me Roel. I don't be if I have open a new bug or if in this or in 43758 bug may be fix my problem.  
Thanks for help me. 
Thank you very much.

Hi Kristofer,

> This bug is very specific to the Windows implementation of pthreads, it has nothing to do with Linux.

Understood. Interestingly, I see references to aio in the Fedora stack trace - I was previously under the impression that aio was only Windows related, but I see that there's a linux implementation as well (http://lse.sourceforge.net/io/aio.html)

Hi Ricardo,

> I want know what I may to do for colaborate in the fix the problem.

As per the notes from Kristofer, this looks like a completely separate situation.

I have logged a new bug with some questions for you here:
http://bugs.mysql.com/bug.php?id=47768

Could you please follow up on this new bug/those questions?

Ricardo, correction, see bug #47876 instead.

Noted in 5.1.40 changelog.

The pthread_cond_wait() implementations for Windows could deadlock in
some rare circumstances. 

Setting report to NDI pending push into 5.5.x.

Pushed into 6.0.14-alpha (revid:alik@sun.com-20091022063126-l0qzirh9xyhp0bpc) (version source revid:alik@sun.com-20091019135554-s1pvptt6i750lfhv) (merge vers: 6.0.14-alpha) (pib:13)

Pushed into 5.5.0-beta (revid:alik@sun.com-20091022060553-znkmxm0g0gm6ckvw) (version source revid:alik@sun.com-20091014071749-j0wmq9echal73tpe) (merge vers: 5.5.0-beta) (pib:13)

Noted in 5.5.0, 6.0.14 changelogs.

Summary Overview: 

This bug was fixed in: 5.1.40, 5.5.0, 6.0.14

Workarounds: none (except for upgrade)

This bug also occuring at our server mysql server version is 5.1.40 gpl community. MyISAM Tables, 1K query per second. 

http://bugs.mysql.com/bug.php?id=43758 same issues with this bug report, also referenced here.

Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091218102229-64tk47xonu3dv6r6) (version source revid:jonas@mysql.com-20091218095730-26gwjidfsdw45dto) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)

Pushed into 5.1.41-ndb-6.2.19 (revid:jonas@mysql.com-20091218100224-vtzr0fahhsuhjsmt) (version source revid:jonas@mysql.com-20091217101452-qwzyaig50w74xmye) (merge vers: 5.1.41-ndb-6.2.19) (pib:15)

Pushed into 5.1.41-ndb-6.3.31 (revid:jonas@mysql.com-20091218100616-75d9tek96o6ob6k0) (version source revid:jonas@mysql.com-20091217154335-290no45qdins5bwo) (merge vers: 5.1.41-ndb-6.3.31) (pib:15)

Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20091218101303-ga32mrnr15jsa606) (version source revid:jonas@mysql.com-20091218064304-ezreonykd9f4kelk) (merge vers: 5.1.41-ndb-7.0.11) (pib:15)