Bug #47768 pthread_cond_timedwait() is broken on windows
Submitted: 1 Oct 2009 17:15 Modified: 18 Dec 2009 23:45
Reporter: Kristofer Pettersson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: General Severity:S2 (Serious)
Version:5.0+ OS:Windows
Assigned to: Kristofer Pettersson CPU Architecture:Any

[1 Oct 2009 17:15] Kristofer Pettersson
Description:
The pthread_cond_wait implementations for windows might dead lock in some rare circumstances.

1) One thread (I) enter a timed wait and at a point in time ends up after mutex
   unlock and before WaitForMultipleObjects(...)
2) Another thread (II) enters pthread_cond_broadcast. Grabs the mutex and
   discovers one waiter. It set the broadcast event and closes the broadcast
   gate then unlocks the mutex.
3) A third thread (III) issues a pthread_cond_signal. It grabs the mutex, 
   discovers one waiter, sets the signal event then unlock the mutex.
4) The first threads (I) enters WaitForMultipleObjects and finds out that
   the signal object is in a signalled state and exits the wait.
5) Thread (I) grabs the mutex and checks result status. The number of waiters is
   decreased and becomes equal to 0. The event returned was a signal event
   so the broadcast gate isn't opened. The mutex is released.
6) Thread (II) issues a new broadcast. The mutex is acquired but the number
   of waiters are 0 hence the broadcast gate remains closed.
7) Thread (I) enters the wait again but is blocked by the broadcast gate.

How to repeat:
Run the attached program and insert a sleep(2) just before WaitForMultipleObjects(..) in mysys/my_wincond.c:int pthread_cond_timedwait(..)

Suggested fix:
The following change might be enough to resolve the issues:

mysys/my_wincond.c:int pthread_cond_timedwait(..)

- if (cond->waiting == 0 && result == (WAIT_OBJECT_0+BROADCAST))
+ if (cond->waiting == 0) 

It should be safe to reset the broadcast gate if there are no more waiters after the last exit even if the trigger event is a signal.
[1 Oct 2009 17:19] Kristofer Pettersson
bug47768.cpp

Attachment: pthread_test2.cpp (text/plain), 3.55 KiB.

[1 Oct 2009 17:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/85423

3144 Kristofer Pettersson	2009-10-01
      Bug#47768 pthread_cond_timedwait() is broken on windows
      
      The pthread_cond_wait implementations for windows might
      dead lock in some rare circumstances.
      
      1) One thread (I) enter a timed wait and at a point in
         time ends up after mutex unlock and before
         WaitForMultipleObjects(...)
      2) Another thread (II) enters pthread_cond_broadcast.
         Grabs the mutex and discovers one waiter. It set
         the broadcast event and closes the broadcast gate
         then unlocks the mutex.
      3) A third thread (III) issues a pthread_cond_signal.
         It grabs the mutex, discovers one waiter, sets the
         signal event then unlock the mutex.
      4) The first threads (I) enters WaitForMultipleObjects
         and finds out that the signal object is in a
         signalled state and exits the wait.
      5) Thread (I) grabs the mutex and checks result status.
         The number of waiters is decreased and becomes equal
         to 0. The event returned was a signal event so the
         broadcast gate isn't opened. The mutex is released.
      6) Thread (II) issues a new broadcast. The mutex is
         acquired but the number of waiters are 0 hence
         the broadcast gate remains closed.
      7) Thread (I) enters the wait again but is blocked by
         the broadcast gate.
      
      This fix resolves the above issue by always resetting
      broadcast gate when there are no more waiters in th queue.
     @ mysys/my_wincond.c
        * Always reset the broadcast gate if there are no more waiters left.
[6 Oct 2009 2:17] Roel Van de Paar
Resolved stacktrace from bug_43758 (Ricardo Gomez) which should show the same issue. (Fedora Linux 2.6.27.5-117.fc10.x86_64)

Attachment: bug_43758_resolved_stacktrace_Ricardo_Gomez.txt (text/plain), 24.56 KiB.

[6 Oct 2009 2:21] Roel Van de Paar
Customer verified that they no longer see the hang when FLUSH TABLES is not executed.
[6 Oct 2009 2:23] Roel Van de Paar
Krisofer, Davi, please check newly uploaded backtrace which should show the same issue, but this time not on Windows but Fedora Linux...
[6 Oct 2009 7:34] Kristofer Pettersson
Roel: This bug is very specific to the Windows implementation of pthreads, it has nothing to do with Linux. The uploaded stack trace also unfortunately gives us very little to go on and I think the situation should be investigated further. Are the physical disks working as expected? Is there really a hang in fsync()? Please open yet another bug for the new unknown issue.
[6 Oct 2009 7:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/85827

2813 Kristofer Pettersson	2009-10-06
      Bug#47768 pthread_cond_timedwait() is broken on windows
      
      The pthread_cond_wait implementations for windows might
      dead lock in some rare circumstances.
      
      1) One thread (I) enter a timed wait and at a point in
         time ends up after mutex unlock and before
         WaitForMultipleObjects(...)
      2) Another thread (II) enters pthread_cond_broadcast.
         Grabs the mutex and discovers one waiter. It set
         the broadcast event and closes the broadcast gate
         then unlocks the mutex.
      3) A third thread (III) issues a pthread_cond_signal.
         It grabs the mutex, discovers one waiter, sets the
         signal event then unlock the mutex.
      4) The first threads (I) enters WaitForMultipleObjects
         and finds out that the signal object is in a
         signalled state and exits the wait.
      5) Thread (I) grabs the mutex and checks result status.
         The number of waiters is decreased and becomes equal
         to 0. The event returned was a signal event so the
         broadcast gate isn't opened. The mutex is released.
      6) Thread (II) issues a new broadcast. The mutex is
         acquired but the number of waiters are 0 hence
         the broadcast gate remains closed.
      7) Thread (I) enters the wait again but is blocked by
         the broadcast gate.
      
            This fix resolves the above issue by always resetting
            broadcast gate when there are no more waiters in th queue.
     @ mysys/my_wincond.c
        * Always reset the broadcast gate if there are no more waiters left.
[6 Oct 2009 10:01] Bugs System
Pushed into 5.1.40 (revid:joro@sun.com-20091006095946-9vv2qal7rlot32r4) (version source revid:joro@sun.com-20091006095946-9vv2qal7rlot32r4) (merge vers: 5.1.40) (pib:11)
[6 Oct 2009 14:12] Ricardo Gomez
Hi, Roel, Kristofer.
For begin, thanks for your colaboration. I want know what I may to do for colaborate in the fix the problem. I don't understand what to do or what mean the stacktrace who sent me Roel. I don't be if I have open a new bug or if in this or in 43758 bug may be fix my problem.  
Thanks for help me. 
Thank you very much.
[6 Oct 2009 23:57] Roel Van de Paar
Hi Kristofer,

> This bug is very specific to the Windows implementation of pthreads, it has nothing to do with Linux.

Understood. Interestingly, I see references to aio in the Fedora stack trace - I was previously under the impression that aio was only Windows related, but I see that there's a linux implementation as well (http://lse.sourceforge.net/io/aio.html)

Hi Ricardo,

> I want know what I may to do for colaborate in the fix the problem.

As per the notes from Kristofer, this looks like a completely separate situation.

I have logged a new bug with some questions for you here:
http://bugs.mysql.com/bug.php?id=47768

Could you please follow up on this new bug/those questions?
[6 Oct 2009 23:59] Roel Van de Paar
Ricardo, correction, see bug #47876 instead.
[12 Oct 2009 15:55] Paul DuBois
Noted in 5.1.40 changelog.

The pthread_cond_wait() implementations for Windows could deadlock in
some rare circumstances. 

Setting report to NDI pending push into 5.5.x.
[22 Oct 2009 6:34] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091022063126-l0qzirh9xyhp0bpc) (version source revid:alik@sun.com-20091019135554-s1pvptt6i750lfhv) (merge vers: 6.0.14-alpha) (pib:13)
[22 Oct 2009 7:07] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091022060553-znkmxm0g0gm6ckvw) (version source revid:alik@sun.com-20091014071749-j0wmq9echal73tpe) (merge vers: 5.5.0-beta) (pib:13)
[22 Oct 2009 19:53] Paul DuBois
Noted in 5.5.0, 6.0.14 changelogs.
[22 Oct 2009 23:06] Roel Van de Paar
Summary Overview: 

This bug was fixed in: 5.1.40, 5.5.0, 6.0.14

Workarounds: none (except for upgrade)
[15 Nov 2009 17:55] Taylan Karaoglu
This bug also occuring at our server mysql server version is 5.1.40 gpl community. MyISAM Tables, 1K query per second. 

http://bugs.mysql.com/bug.php?id=43758 same issues with this bug report, also referenced here.
[18 Dec 2009 10:31] Bugs System
Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091218102229-64tk47xonu3dv6r6) (version source revid:jonas@mysql.com-20091218095730-26gwjidfsdw45dto) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)
[18 Dec 2009 10:47] Bugs System
Pushed into 5.1.41-ndb-6.2.19 (revid:jonas@mysql.com-20091218100224-vtzr0fahhsuhjsmt) (version source revid:jonas@mysql.com-20091217101452-qwzyaig50w74xmye) (merge vers: 5.1.41-ndb-6.2.19) (pib:15)
[18 Dec 2009 11:02] Bugs System
Pushed into 5.1.41-ndb-6.3.31 (revid:jonas@mysql.com-20091218100616-75d9tek96o6ob6k0) (version source revid:jonas@mysql.com-20091217154335-290no45qdins5bwo) (merge vers: 5.1.41-ndb-6.3.31) (pib:15)
[18 Dec 2009 11:16] Bugs System
Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20091218101303-ga32mrnr15jsa606) (version source revid:jonas@mysql.com-20091218064304-ezreonykd9f4kelk) (merge vers: 5.1.41-ndb-7.0.11) (pib:15)