Bug #29560 InnoDB >= 5.0.30 hangs on adaptive hash rw-lock 'waiting for an X-lock'
Submitted: 5 Jul 2007 2:06 Modified: 24 Sep 2008 16:32
Reporter: Yasufumi Kinoshita Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.0.41 OS:Any (Multi CPU)
Assigned to: Timothy Smith CPU Architecture:Any
Tags: Contribution

[5 Jul 2007 2:06] Yasufumi Kinoshita
Description:
In the both of mutex and lock implementations, there is a little possibility that
"releaser thread" calls os_event_set() before "waiter thread" calls os_event_wait().
In the many of the cases, the other threads try to enter or lock the same mutex or lock
and relief the "missing situations".

But, if unfortunately, there are no "relief threads"
or the "waiter thread" owns btr_search_latch and the all of other threads are influenced and cannot relief,
this "waiter thread" remains paused.
In the worst case, InnoDB may crush with assertion failure..

In my experiences,
these phenomena seem to be caused often on x86_64 architecture,
and the following suggestion seems to be effective to prevent the pauses.

This may be associated with Bug#19081 and Bug#20358 ?

How to repeat:
execute heavy and very long benchmark and watching the throughputs

Suggested fix:
I think it is very difficult to prevent the "missing situations" absolutely.
So we should make master thread to relief this situations.

sync0arr.c:
         	cell = sync_array_get_nth_cell(sync_primary_wait_array, i);
 
 		if (cell->wait_object != NULL && cell->waiting
+		    && difftime(time(NULL), cell->reservation_time) > 10) {
+			if (sync_arr_cell_can_wake_up(cell)) {
+				if (cell->request_type == SYNC_MUTEX) {
+					mutex_t*        mutex;
+
+					mutex = cell->wait_object;
+					os_event_set(mutex->event);
+				} else {
+					rw_lock_t*      lock;
+
+					lock = cell->wait_object;
+					os_event_set(lock->event);
+				}
+			}
+		}
+
+		if (cell->wait_object != NULL && cell->waiting
 		    && difftime(time(NULL), cell->reservation_time) > 240) {
 			fputs("InnoDB: Warning: a long semaphore wait:\n",
 				stderr);
[20 Sep 2007 12:11] Heikki Tuuri
Yasufumi,

if we look at the mutex reservation in 5.0, it goes like this:

If no success at first, do:

  sync_array_reserve_cell(sync_primary_wait_array, mutex,
                          SYNC_MUTEX, file_name, line, &index);
...
  /* The memory order of the array reservation and the change in the
  waiters field is important: when we suspend a thread, we first
  reserve the cell and then set waiters field to 1. When threads are
  released in mutex_exit, the waiters field is first set to zero and
  then the event is set to the signaled state. */

  mutex_set_waiters(mutex, 1);
...
  /* Try to reserve still a few times */
  for (i = 0; i < 4; i++)
  {
    if (mutex_test_and_set(mutex) == 0)
  ...

  }

If no success, do:

  sync_array_wait_event(sync_primary_wait_array, index);

################
Mutex exit works this way:

mutex_exit(
/*=======*/
        mutex_t*        mutex)  /* in: pointer to mutex */
{
...
        if (mutex_get_waiters(mutex) != 0) {

                mutex_signal_object(mutex);
        }

############

sync_array_reserve_cell() does like this:

                        /* Make sure the event is reset */
                        sync_cell_event_reset(type, object);

and sync_array_wait_event() does:

        os_event_wait(event);

##########

It is no problem if the 'releaser thread' calls os_event_set() before our thread calls os_event_wait(). In that case, our thread sees the event as 'set', and proceeds.

Regards,

Heikki
[20 Sep 2007 12:14] Heikki Tuuri
Forgot to add that signal_object() does this:

mutex_signal_object(
/*================*/
        mutex_t*        mutex)  /* in: mutex */
{
        mutex_set_waiters(mutex, 0);

        /* The memory order of resetting the waiters field and
        signaling the object is important. See LEMMA 1 above. */
        os_event_set(mutex->event);
        sync_array_object_signalled(sync_primary_wait_array);
}
[20 Sep 2007 14:45] Yasufumi Kinoshita
Heikki,

Yes, I grasp these mechanisms.
But I'm afraid of the slight possibility
that context switch or the other unfortunate delay may occur 'just before' os_event_wait()
and mutex_signal_object() is executed during this moment.
The more we use CPUs, and the heavier the load is, I think this possibility become stronger.
We cannot assure the execution order of different threads at all without protecting by semaphores.

Acutually, this suggestion seems to prevent the pauses...

Regards,

Yasufumi
[20 Sep 2007 15:03] Heikki Tuuri
Yasufumi,

if the processor works correctly, then it should never happen that a thread gets hung. Even if there is a context switch, the OS should reschedule the thread at some time, and it should see the event in the 'set' state.

But I suspect that a multiprocessor Opteron does not work correctly... Then your patch, which wakes up threads that have been waiting for 10 seconds and which can be woken up, should help.

But some of the hangs in http://bugs.mysql.com/bug.php?id=20358 show some thread holding an S-latch on the adaptive hash index rw-lock, though no one should have it. Those cannot be explained by a thread lingering in the wait array.

Regards,

Heikki
[20 Sep 2007 16:38] Yasufumi Kinoshita
Heikki,

Thank you for your kindness.
I entirely understand the intention of InnoDB source.

However unfortunatery, I have encountered the problem on 64bit Xeon too (openSUSE 10.2)...
And I also experienced the such unexplainable situation.
This suggestion also seems to prevent the situation,
though I cannnot explain it...

Hmm.. Now I suspect pthread_cond_broadcast() in glibc at x86_64..

Regards,

Yasufumi
[21 Sep 2007 4:40] Yasufumi Kinoshita
Heikki,

This "missing situations" may be caused at kernel-level.
A context switch or the other unfortunate delay may occur 'just before' "futex_wait syscall",
and "futex_wake syscall" is executed during this moment....

Anyway, we should always care to the slight possibility that
"pthread_cond_broadcast() doesn't wake the all of waiting threads"

Regards,

Yasufumi
[28 Sep 2007 4:44] Yasufumi Kinoshita
Heikki,

I think it is the glibc's bug.
And it may cause on the both of IA-32 and x86_64 SMP Linux.

I made an easy test program by patchwork from InnoDB source.
In this program, pthread_cond_broadcast() is never called before pthread_cond_wait(), in the same way as InnoDB.
But it may certainly freeze on the most of SMP Linux systems...
(I tested on the both of [32bit Xeon x 2] and [64bit Xeon(quad core) x 2])
This shows that pthread_cond_broadcast() sometimes overtakes pthread_cond_wait() !

Indeed, this may not be the bug of InnoDB.
But I think we should care to the bug by like my suggestion..

Regards,

Yasufumi
[28 Sep 2007 4:46] Yasufumi Kinoshita
It test pthread_cond_wait() and pthread_cond_broadcast()

Attachment: pthread_cond_broadcast_test.c (application/octet-stream, text), 4.30 KiB.

[28 Sep 2007 7:22] Yasufumi Kinoshita
example.

kinoshita@dbhpsv03:~> gcc -g -O -o pthread_cond_broadcast_test pthread_cond_broadcast_test.c -lpthread
pthread_cond_broadcast_test.c: In function 'main':
pthread_cond_broadcast_test.c:58: warning: incompatible implicit declaration of built-in function 'malloc'
pthread_cond_broadcast_test.c:85: warning: incompatible implicit declaration of built-in function 'exit'
kinoshita@dbhpsv03:~> uname -a
Linux dbhpsv03 2.6.18.2-34-default #1 SMP Mon Nov 27 11:46:27 UTC 2006 i686 i686 i386 GNU/Linux
kinoshita@dbhpsv03:~> ./pthread_cond_broadcast_test
counter: 0
counter: 593323
counter: 1193051
counter: 1838001
counter: 2523016
counter: 3106217
counter: 3699310
counter: 3751617
counter: 3751617
signals: 1790305
freezed..
[28 Sep 2007 16:58] Heikki Tuuri
os0sync.c:

/**************************************************************
Sets an event semaphore to the signaled state: lets waiting threads
proceed. */

void
os_event_set(
/*=========*/
        os_event_t      event)  /* in: event to set */
{
#ifdef __WIN__
        ut_a(event);
        ut_a(SetEvent(event->handle));
#else
        ut_a(event);

        os_fast_mutex_lock(&(event->os_mutex));

        if (event->is_set) {
                /* Do nothing */
        } else {
                event->is_set = TRUE;
                event->signal_count += 1;
                ut_a(0 == pthread_cond_broadcast(&(event->cond_var)));
        }

        os_fast_mutex_unlock(&(event->os_mutex));
#endif
}

...

/**************************************************************
Waits for an event object until it is in the signaled state. If
srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS this also exits the
waiting thread when the event becomes signaled (or immediately if the
event is already in the signaled state). */

void
os_event_wait(
/*==========*/
        os_event_t      event)  /* in: event to wait */
{
#ifdef __WIN__
        DWORD   err;

        ut_a(event);

        /* Specify an infinite time limit for waiting */
        err = WaitForSingleObject(event->handle, INFINITE);

        ut_a(err == WAIT_OBJECT_0);

        if (srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) {
                os_thread_exit(NULL);
        }
#else
        ib_longlong     old_signal_count;

        os_fast_mutex_lock(&(event->os_mutex));

        old_signal_count = event->signal_count;

        for (;;) {
                if (event->is_set == TRUE
                    || event->signal_count != old_signal_count) {

                        os_fast_mutex_unlock(&(event->os_mutex));

                        if (srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) {

                                os_thread_exit(NULL);
                        }
                        /* Ok, we may return */

                        return;
                }

                pthread_cond_wait(&(event->cond_var), &(event->os_mutex));

                /* Solaris manual said that spurious wakeups may occur: we
                have to check if the event really has been signaled after
                we came here to wait */
        }
#endif
}

The operating system mutex event->os_mutex protects the code. Do you mean that pthread_cond_broadcast() sometimes fails to wake up a thread waiting for a condition variable? That would be a serious bug in the operating system.
[28 Sep 2007 17:11] Heikki Tuuri
Yasufumi, I looked at your test program. I could not find a 'memory barrier' machine instruction when you set the lock word 0 or 1. Then a hang is possible because the memory order is not deterministic.

A memory barrier, like XCHG in x86, flushes the processor cache to the memory bus.

InnoDB uses a pthread mutex as the 'lock word' in the Linux version. I think pthread_mutex_trylock() and pthread_mutex_release() call XCHG or some other memory barrier instruction.

Your comments?

Regards,

Heikki
[28 Sep 2007 17:12] Heikki Tuuri
Oops, mb() is in your program! Sorry for my oversight.
[28 Sep 2007 17:13] Heikki Tuuri
Have you looked with gdb how the compiler compiles the 'mfence'?
[29 Sep 2007 1:32] Yasufumi Kinoshita
Heikki,

I suspect the following situation.

A: os_event_wait()
B: os_event_set()

...

A:+    os_fast_mutex_lock(&(event->os_mutex))
A:|    pthread_cond_wait(&(event->cond_var), &(event->os_mutex))
A:+        (in glibc) mutex_exit(&(event->os_mutex))

A:(some delay occur?)

B:+    os_fast_mutex_lock(&(event->os_mutex))
B:|    pthread_cond_broadcast(&(event->cond_var))
B:|        (in glibc) syscall (futex_cmp_requeue)-------!
B:+    os_fast_mutex_unlock(&(event->os_mutex))

A:         (in glibc) syscall (futex_wait)--------------!
A:+        (in glibc) mutex_enter(&(event->os_mutex))
A:+    os_fast_mutex_unlock(&(event->os_mutex))

I think it is possible...
[29 Sep 2007 1:36] Yasufumi Kinoshita
And at another environment.

kinoyasu@gauntlet:~> gcc -g -O -o pthread_cond_broadcast_test pthread_cond_broadcast_test.c -lpthread
pthread_cond_broadcast_test.c: In function 'main':
pthread_cond_broadcast_test.c:58: warning: incompatible implicit declaration of built-in function 'malloc'
pthread_cond_broadcast_test.c:74: warning: cast to pointer from integer of different size
pthread_cond_broadcast_test.c:85: warning: incompatible implicit declaration of built-in function 'exit'
kinoyasu@gauntlet:~> uname -a
Linux gauntlet 2.6.18.2-34-default #1 SMP Mon Nov 27 11:46:27 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux
kinoyasu@gauntlet:~> ./pthread_cond_broadcast_test
counter: 0
counter: 444136
counter: 890726
counter: 1335666
counter: 1781416
counter: 2227780
counter: 2673240
counter: 3118804
counter: 3391092
counter: 3391092
signals: 3389865
freezed..
kinoyasu@gauntlet:~> objdump -d pthread_cond_broadcast_test
...
0000000000400b40 <thread_main>:
  400b40:       41 55                   push   %r13
  400b42:       41 54                   push   %r12
  400b44:       55                      push   %rbp
  400b45:       53                      push   %rbx
  400b46:       48 83 ec 08             sub    $0x8,%rsp
  400b4a:       41 bc 00 00 00 00       mov    $0x0,%r12d
  400b50:       41 bd 01 00 00 00       mov    $0x1,%r13d
  400b56:       85 ff                   test   %edi,%edi
  400b58:       74 0c                   je     400b66 <thread_main+0x26>
  400b5a:       41 bc 01 00 00 00       mov    $0x1,%r12d
  400b60:       41 bd 00 00 00 00       mov    $0x0,%r13d
  400b66:       0f ae f0                mfence
  400b69:       48 8b 05 40 15 20 00    mov    2102592(%rip),%rax        # 6020b0 <lock_word>
  400b70:       49 39 c4                cmp    %rax,%r12
  400b73:       74 5a                   je     400bcf <thread_main+0x8f>
  400b75:       bb 00 00 00 00          mov    $0x0,%ebx
  400b7a:       48 bd ab aa aa aa aa    mov    $0xaaaaaaaaaaaaaaab,%rbp
  400b81:       aa aa aa
  400b84:       48 69 3d 01 15 20 00    imul   $0x901dfa9,2102529(%rip),%rdi        # 602090 <ut_rnd_ulint_counter>
  400b8b:       a9 df 01 09
  400b8f:       48 81 c7 9d c7 23 07    add    $0x723c79d,%rdi
  400b96:       48 89 3d f3 14 20 00    mov    %rdi,2102515(%rip)        # 602090 <ut_rnd_ulint_counter>
  400b9d:       48 89 f8                mov    %rdi,%rax
  400ba0:       48 f7 e5                mul    %rbp
  400ba3:       48 c1 ea 02             shr    $0x2,%rdx
  400ba7:       48 8d 14 52             lea    (%rdx,%rdx,2),%rdx
  400bab:       48 01 d2                add    %rdx,%rdx
  400bae:       48 29 d7                sub    %rdx,%rdi
  400bb1:       e8 6a fe ff ff          callq  400a20 <ut_delay>
  400bb6:       0f ae f0                mfence
  400bb9:       48 8b 05 f0 14 20 00    mov    2102512(%rip),%rax        # 6020b0 <lock_word>
  400bc0:       4c 39 e0                cmp    %r12,%rax
  400bc3:       74 0a                   je     400bcf <thread_main+0x8f>
  400bc5:       48 83 c3 01             add    $0x1,%rbx
  400bc9:       48 83 fb 14             cmp    $0x14,%rbx
  400bcd:       75 b5                   jne    400b84 <thread_main+0x44>
  400bcf:       0f ae f0                mfence
  400bd2:       48 8b 05 d7 14 20 00    mov    2102487(%rip),%rax        # 6020b0 <lock_word>
  400bd9:       4c 39 e0                cmp    %r12,%rax
  400bdc:       75 2a                   jne    400c08 <thread_main+0xc8>
  400bde:       48 83 05 c2 14 20 00    addq   $0x1,2102466(%rip)        # 6020a8 <lock_counter>
  400be5:       01
  400be6:       48 8b 3d d3 14 20 00    mov    2102483(%rip),%rdi        # 6020c0 <some_wait>
  400bed:       e8 2e fe ff ff          callq  400a20 <ut_delay>
  400bf2:       4c 89 2d b7 14 20 00    mov    %r13,2102455(%rip)        # 6020b0 <lock_word>
  400bf9:       b8 00 00 00 00          mov    $0x0,%eax
  400bfe:       e8 66 fe ff ff          callq  400a69 <os_event_set>
  400c03:       e9 5e ff ff ff          jmpq   400b66 <thread_main+0x26>
  400c08:       b8 00 00 00 00          mov    $0x0,%eax
  400c0d:       e8 aa fe ff ff          callq  400abc <os_event_reset>
  400c12:       0f ae f0                mfence
  400c15:       48 8b 05 94 14 20 00    mov    2102420(%rip),%rax        # 6020b0 <lock_word>
  400c1c:       4c 39 e0                cmp    %r12,%rax
  400c1f:       75 2a                   jne    400c4b <thread_main+0x10b>
  400c21:       48 83 05 7f 14 20 00    addq   $0x1,2102399(%rip)        # 6020a8 <lock_counter>
  400c28:       01
  400c29:       48 8b 3d 90 14 20 00    mov    2102416(%rip),%rdi        # 6020c0 <some_wait>
  400c30:       e8 eb fd ff ff          callq  400a20 <ut_delay>
  400c35:       4c 89 2d 74 14 20 00    mov    %r13,2102388(%rip)        # 6020b0 <lock_word>
  400c3c:       b8 00 00 00 00          mov    $0x0,%eax
  400c41:       e8 23 fe ff ff          callq  400a69 <os_event_set>
  400c46:       e9 1b ff ff ff          jmpq   400b66 <thread_main+0x26>
  400c4b:       b8 00 00 00 00          mov    $0x0,%eax
  400c50:       e8 9e fe ff ff          callq  400af3 <os_event_wait>
  400c55:       e9 0c ff ff ff          jmpq   400b66 <thread_main+0x26>

...
[29 Sep 2007 1:48] Yasufumi Kinoshita
..Hmm?
We need mb() also after unlock?

...But same results..

kinoyasu@gauntlet:~> diff -u pthread_cond_broadcast_test.c pthread_cond_broadcast_test_.c
--- pthread_cond_broadcast_test.c       2007-09-29 10:25:52.000000000 +0900
+++ pthread_cond_broadcast_test_.c      2007-09-29 10:47:33.000000000 +0900
@@ -125,6 +125,7 @@

            /* unlock */
            lock_word = to_word;
+           mb();
            os_event_set();
            continue;
        }
@@ -141,6 +142,7 @@

            /* unlock */
            lock_word = to_word;
+           mb();
            os_event_set();
            continue;
        }
kinoyasu@gauntlet:~> ./pthread_cond_broadcast_test_
counter: 0
counter: 363615
counter: 722905
counter: 1079729
counter: 1438090
counter: 1568983
counter: 1568983
signals: 1567255
freezed..
[30 Sep 2007 11:13] Yasufumi Kinoshita
This may not freeze...

Attachment: pthread_cond_broadcast_test2.c (text/x-csrc), 4.93 KiB.

[30 Sep 2007 11:31] Yasufumi Kinoshita
Heikki,

I missed the another possibility!
If we keep the consistency of to check lock_word and to call pthread_cond_wait(),
this check program may not freeze..

Sorry, it may not be the glibc's bug...

Now, we should suspect the following situation.

A-- want x_lock
B-- have s_lock, next want x_lock

A: x_lock (set RW_LOCK_WAIT_EX)
A:     os_event_reset()

B: s_unlock
B:     os_event_set()
B: x_lock (but A's RW_LOCK_WAIT_EX)
B:     os_event_reset()  ------------- Oops...

A,B:   os_event_wait() (!!)

Regards,

Yasufumi
[30 Sep 2007 12:00] Yasufumi Kinoshita
And, I think the following is smarter fix.

kinoyasu@gauntlet:~> diff -u pthread_cond_broadcast_test.c pthread_cond_broadcast_test3.c
--- pthread_cond_broadcast_test.c       2007-09-29 10:25:52.000000000 +0900
+++ pthread_cond_broadcast_test3.c      2007-09-30 20:59:47.000000000 +0900
@@ -27,8 +27,8 @@

 unsigned long ut_delay( unsigned long delay );

-void os_event_reset();
-void os_event_wait();
+unsigned long os_event_reset();
+void os_event_wait(unsigned long old_signal_count);
 void os_event_set();

 unsigned long ut_rnd_ulint_counter = 65654363;
@@ -90,7 +90,7 @@
 int thread_main( int thread_num )
 {
     unsigned long from_word, to_word;
-    unsigned long rnd, i;
+    unsigned long rnd, i, old_signal_count;

     switch(thread_num) {
        case 0: /* thread_num == 0 */
@@ -129,7 +129,7 @@
            continue;
        }

-       os_event_reset();
+       old_signal_count = os_event_reset();

        mb(); /* by way of precaution against memory ordering */
        if (lock_word == from_word) {
@@ -146,14 +146,16 @@
        }
        else {
            /* failure */
-           os_event_wait();
+           os_event_wait(old_signal_count);
            goto loop;
        }
     }
 }

-void os_event_reset()
+unsigned long os_event_reset()
 {
+    unsigned long ret;
+
     pthread_mutex_lock(&(event->os_mutex));

     if (!event->is_set) {
@@ -161,17 +163,20 @@
     } else {
        event->is_set = FALSE;
     }
+    ret = event->signal_count;

     pthread_mutex_unlock(&(event->os_mutex));
+
+    return(ret);
 }

-void os_event_wait()
+void os_event_wait(unsigned long old_signal_count)
 {
-    unsigned long old_signal_count;
+    /* unsigned long old_signal_count; */

     pthread_mutex_lock(&(event->os_mutex));

-    old_signal_count = event->signal_count;
+    /* old_signal_count = event->signal_count; */

     for (;;) {
        if (event->is_set == TRUE
[2 Oct 2007 14:25] Yasufumi Kinoshita
Suggestion patch to fix os_event_xxx for rw_lock and mutex.

Attachment: mysql-5.0.41_os_event_fix.patch.gz (application/x-gzip, text), 1.26 KiB.

[2 Oct 2007 14:29] Yasufumi Kinoshita
Heikki,

It's quite an event!
This patch also shows great effect to Bug#26442 !
I'll measure the performances and report in these days.

Regards,

Yasufumi
[2 Oct 2007 16:19] Heikki Tuuri
Yasufumi,

right, this is a bug in Sunny's scalability fix:

"
A-- want x_lock
B-- have s_lock, next want x_lock

A: x_lock (set RW_LOCK_WAIT_EX)
A:     os_event_reset()

B: s_unlock
B:     os_event_set()
B: x_lock (but A's RW_LOCK_WAIT_EX)
B:     os_event_reset()  ------------- Oops...

A,B:   os_event_wait() (!!)
"

http://linux.die.net/man/3/pthread_cond_signal

"
The pthread_cond_broadcast() function shall unblock all threads currently blocked on the specified condition variable cond. 
"

The A, B scenario above can happen if pthread_cond_broadcast() fails to wake up the waiting A thread. There might be a bug in Linux.

The old code in versions < 5.0.30 did not have this potential bug because there each thread waited for its own event in the wait array.

Very intelligent work from you to find this bug! This probably explains many of the hangs on the adaptive hash latch.

Thank you,

Heikki
[2 Oct 2007 16:28] Heikki Tuuri
Oops, I misunderstood your example. Both A and B call os_event_wait() on the last line. Then there is a hang.

This is an InnoDB bug, the pthread library is innocent.

The old code in < 5.0.30 did not have this bug:

/**************************************************************************
Looks for the cells in the wait array which refer to the wait object
specified, and sets their corresponding events to the signaled state. In this
way releases the threads waiting for the object to contend for the object.
It is possible that no such cell is found, in which case does nothing. */

void
sync_array_signal_object(
/*=====================*/
        sync_array_t*   arr,    /* in: wait array */
        void*           object) /* in: wait object */
{
        sync_cell_t*    cell;
        ulint           count;
        ulint           i;

        sync_array_enter(arr);

        arr->sg_count++;

        i = 0;
        count = 0;

        while (count < arr->n_reserved) {

                cell = sync_array_get_nth_cell(arr, i);

                if (cell->wait_object != NULL) {

                        count++;
                        if (cell->wait_object == object) {

                                sync_cell_event_set(cell);
                        }
                }

                i++;
        }

        sync_array_exit(arr);
}
[2 Oct 2007 16:47] Heikki Tuuri
Updated the synopsis.
[17 Oct 2007 13:11] Heikki Tuuri
Inaam now has patch that utilizes the 'signal_count' of an event, like Yasufumi. He still has to put the fix also to the Windows version.
[17 Oct 2007 16:51] Heikki Tuuri
The event->signal_count solution only works on Unix.

On Windows, we can add a second event to an rw-lock: 'WAIT_EX-event'. Only the thread that puts the rw-lock to the state WAIT_EX will wait at this special event.
[24 Oct 2007 5:38] Inaam Rana
Yasufumi,

Can you help us test the following patch? It is built on the suggestions that you provided. It attempts to fix the scenario of os_event_set() getting missed because of a later os_event_reset() (as described by you).

And again, thanks for all the help!

regards,
inaam
[24 Oct 2007 5:39] Inaam Rana
patch based on Yasufumi's idea of fixing missed os_event_set() call in sync0arr.

Attachment: os_event_5.0_v2.patch (text/x-diff), 17.22 KiB.

[25 Oct 2007 3:25] Yasufumi Kinoshita
benchmark results of the patch

Attachment: os_event_5.0_v2.pdf (application/pdf, text), 17.87 KiB.

[25 Oct 2007 3:26] Yasufumi Kinoshita
Inaam,

Thank you.
I've tried the patch to the same benchmark as Bug#26442.
And there seems to be no problems for me.

I think that the patch will also improve the other performances
at various cases of semaphore contention.

Regards,

Yasufumi
[5 Nov 2007 21:51] Ed Lucero
What is the status of this patch?

Is it available?
[6 Nov 2007 19:17] Timothy Smith
Ed,

The fix for this bug hasn't yet been released from the final review process in the InnoDB team.  Once that's done, it will be included in the public MySQL source code and subsequent releases.

For now, it's required to manually apply the preliminary version of the patch, linked in an earlier comment on this bug report, and build MySQL from source.

Regards,

Timothy
[7 Nov 2007 13:43] Valeriy Kravchuk
Bug #32157 was marked as a duplicate of this one.
[8 Nov 2007 23:55] Ed Lucero
I added the patch re-compiled and installed. Im still have problems. I uploaded the log files. bug-data-1.29560.gz, bug-data-2.29560.gz, bug-data-3.29560.gz

Ed
[9 Nov 2007 22:13] Ed Lucero
Hi All

We have had 11 restarts in the the last 2 days.

What kind of SQL code should I be looking for that will cause this?

Ed

071108 03:47:57  mysqld restarted
071108 07:28:59  mysqld restarted
071108 08:08:00  mysqld restarted
071108 08:08:29  mysqld restarted
071108 11:13:26  mysqld sretarted
071109 01:46:18  mysqld restarted
071109 06:30:10  mysqld restarted
071109 06:51:04  mysqld restarted
071109 07:27:35  mysqld restarted
071109 09:30:36  mysqld restarted
071109 13:20:53  mysqld restarted
[10 Nov 2007 13:08] Inaam Rana
Ed,

Where did you upload the files? I don't see them in the uploaded file section.

Why problem you exactly encounter while running this and how do you reach the conclusion that you have hit this bug?

Please upload the entire log file or at least a significant relevant chunk of it.

regards,
inaam
[12 Nov 2007 17:47] Ed Lucero
From bug #20358

[2 Oct 18:40] Heikki Tuuri

Ed Lucero has probably hit this bug:
http://bugs.mysql.com/bug.php?id=29560

I FTP'd the files to ftp://ftp.mysql.com/pub/mysql/upload/
[13 Nov 2007 15:03] Heikki Tuuri
Ed,

if the files are < 3 MB, please attach them to this bug report, or email them to me.

Regards,

Heikki
[14 Nov 2007 13:46] Heikki Tuuri
Ed,

what exact FreeBSD version you are running?

Regards,

Heikki
[14 Nov 2007 14:05] Heikki Tuuri
Ed Lucero's latest .err logs suggest that the WAIT_EX thread does not wake up in pthread_cond_broadcast(), though the waiting S threads do wake up. This is with a patched InnoDB version on FreeBSD.
[15 Nov 2007 15:53] Ed Lucero
FreeBSD 6.0
[15 Nov 2007 17:08] MySQL Verification Team
Heikki,

Sometimes some problems of this type are resolved on FBSD 6.0 by changing a current thread lib. There is a config file in FBSD for choosing a thread lib.
[15 Nov 2007 17:33] Ed Lucero
Currently running

# /etc/libmap.conf
#
# candidate             mapping
#

[mysqld]
libpthread.so.2 libthr.so.2
libpthread.so libthr.so
[16 Nov 2007 8:10] Yasufumi Kinoshita
Test code for signal implementation of pthread.

Attachment: pthread_cond_broadcast_test3.c (text/x-csrc), 4.51 KiB.

[16 Nov 2007 8:28] Yasufumi Kinoshita
Ed,

Could you try this program at your environment?
This test program never freeze at my environments (Linux-SMP).
If the program would freeze in a hour, the pthread implementation might have a problem...
[16 Nov 2007 13:31] Marko Mäkelä
Sent innodb-5.0-ss2095.tar.gz to MySQL
[16 Nov 2007 22:36] Ed Lucero
How do I compile and run this?
[19 Nov 2007 6:05] Yasufumi Kinoshita
It's simple.

(ex.)> gcc -O -o pthread_cond_broadcast_test3 pthread_cond_broadcast_test3.c [ -lthr | -lpthread | (other pthread library) ]
(ex.)> ./pthread_cond_broadcast_test3

Weren't you able to compile it?
[20 Nov 2007 0:51] Ed Lucero
No I was not.
[20 Nov 2007 17:53] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/38158

ChangeSet@1.2582, 2007-11-20 10:53:19-07:00, tsmith@ramayana.hindu.god +12 -0
  Applied InnoDB snapshot innodb-5.0-ss2095
  
  Fixes the following bugs:
  
  - Bug #29560: InnoDB >= 5.0.30 hangs on adaptive hash rw-lock 'waiting for an X-lock'
  
    Fixed a race condition in the rw_lock where an os_event_reset()
    can overwrite an earlier os_event_set() triggering an indefinite
    wait.
    NOTE: This fix for windows is different from that for other platforms.
    NOTE2: This bug is introduced in the scalability fix to the
    sync0arr which was applied to 5.0 only. Therefore, it need not be
    applied to the 5.1 tree. If we decide to port the scalability fix
    to 5.1 then this fix should be ported as well.
  
  - Bug #32125: Database crash due to ha_innodb.cc:3896: ulint convert_search_mode_to_innobase
  
    When unknown find_flag is encountered in convert_search_mode_to_innobase()
    do not call assert(0); instead queue a MySQL error using my_error() and
    return the error code PAGE_CUR_UNSUPP. Change the functions that call
    convert_search_mode_to_innobase() to handle that error code by "canceling"
    execution and returning appropriate error code further upstream.
[20 Nov 2007 18:23] Timothy Smith
Patch queued to the mysql-5.0-build tree (it is NULL MERGED to 5.1 and above)
[21 Nov 2007 18:52] Bugs System
Pushed into 5.0.54
[21 Nov 2007 18:53] Bugs System
Pushed into 5.1.23-rc
[21 Nov 2007 18:53] Bugs System
Pushed into 6.0.4-alpha
[26 Nov 2007 7:02] Ed Lucero
How do I patch my 5.0.45 version from the FreeBSD ports?

Ed
[30 Nov 2007 1:15] Paul DuBois
Noted in 5.0.54 changelog.

InnoDB had a race condition for an adaptive hash rw-lock waiting for
an X-lock.
[5 Dec 2007 4:15] Mark Callaghan
We have begun perf test for this on servers with 4 CPU cores and 20 concurrent queries. The results have been great. Our bug was reported in http://bugs.mysql.com/bug.php?id=30738
[20 Feb 2008 23:45] Timothy Smith
The fix for this bug has been queued for 5.1 (but NOT 6.0) as part of snapshot 5.1-ss2298.  It is null-merged up to 6.0 at this point, but should have a separate 6.0 snapshot from InnoDB developers to handle it there.
[22 Feb 2008 9:40] James Day
Paul,

Please change the release note to read:

InnoDB had a race condition for an adaptive hash rw-lock waiting for
an X-lock that could in theory crash the server. The fix may also give significant speed improvement on systems with a lot of contention for the adaptive hash index.
[22 Feb 2008 12:47] Heikki Tuuri
Paul,

there was no rw-lock bug in 5.1. The bug was ONLY in 5.0.

The big patch queued here ports the event-per-semaphore scalability fix from 5.0 to 5.1 (this time without any bugs, I hope).

Regards,

Heikki
[22 Feb 2008 12:48] Heikki Tuuri
I would again like to thank Yasufumi, who found this bug in 5.0.30 and provided the way to fix it. This was one of the worst bugs in InnoDB in recent years.
[25 Feb 2008 15:58] Bugs System
Pushed into 5.1.24-rc
[25 Feb 2008 16:04] Bugs System
Pushed into 6.0.5-alpha
[25 Feb 2008 18:59] Paul DuBois
Noted in 5.1.24 changelog.

Resetting to Patch Pending waiting for push into 6.0.x.
[25 Feb 2008 19:23] Paul DuBois
Correction: Patch does not apply to 5.1 or 6.0. Closing.
[24 Sep 2008 16:36] Kyle Joiner
Bug was reopened in error.  Closed.

My Apologies.  

Kyle Joiner
[7 Nov 2008 20:57] James Day
To assist in identifying affected servers, one symptom that suggests that this bug is the cause is the following sequence of messages in the error log file:

S-lock on RW-latch at 5166de8 created in file btr0sea.c line 139
a writer (thread id 1010) has reserved it in mode wait exclusive
number of readers 0, waiters flag 1

Generally accompanied by errors like this before the server is shut down:

InnoDB: Warning: a long semaphore wait:
--Thread 6412 has waited at .\btr\btr0sea.c line 916 for 785.00 seconds the semaphore:

If you see this pattern in affected versions of the server, consider this bug as the likely cause and upgrade. The latch number 5166de8 will be different.
[9 Feb 2010 1:44] Wei Zhang
Hello,

When I was looking at the bug fix, and I saw such comments:
"Waits for an event object until it is in the signaled state. If
srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS this also exits the
waiting thread when the event becomes signaled (or immediately if the
event is already in the signaled state).

Typically, if the event has been signalled after the os_event_reset()
we'll return immediately because event->is_set == TRUE.
There are, however, situations (e.g.: sync_array code) where we may
lose this information. For example:

thread A calls os_event_reset()
thread B calls os_event_set()   [event->is_set == TRUE]
thread C calls os_event_reset() [event->is_set == FALSE]
thread A calls os_event_wait()  [infinite wait!]
thread C calls os_event_wait()  [infinite wait!]

Where such a scenario is possible, to avoid infinite wait, the
value returned by os_event_reset() should be passed in as
reset_sig_count. *"

I am wondering if there are some particular queries can trigger this bug in mysql 5.0.41, or just to  make sure that the code paths mentioned above will be covered.(i.e. 3 threads-- A,B,C, and A calls os_event_reset, B calls os_event_set, C calls os_event_reset(), A calls os_event_wait(), and C calls os_event_wait(), while the event object touched in these 5 call sites are the same one ) ?

I am also curious what the situation "(e.g.: sync_array code)" is it about ?

Thank you very much,

Wei
[9 Feb 2010 19:25] Timothy Smith
Wei,

I'm sorry, at this point I can't provide any high-level description (from the perspective of the application) for the scenario where this bug might be in effect.  I recommend that you upgrade to the latest MySQL release if you are concerned about this; also see the comment from James Day which lists the common symptoms (error log messages) associated with this problem.
[9 Feb 2010 20:11] Inaam Rana
Wei,

The fix that you looked at deals with fairly low level datastructures and synchronization primitives inside InnoDB engine. These low level latches should not be confused with high level locks on tables/indexes. From application perspective there is no direct mapping between what the query is doing and in what order these latches are acquired. As Tim mentioned in his note, the best solution is to upgrade to the latest version of 5.0.
[10 Feb 2010 5:53] Wei Zhang
Hi Timothy and Inaam,

Thank you very much for your responses. I am doing some concurrency bug detection research. So I am collecting some benchmark to expose some currency bugs.This bug looks very interesting! So I am not trying to find a more robust "MySQL" but rather I am trying to find a "buggy one". I realized that os_event_set / reset / wait is probably the lowest synchronization MySQL goes to, and mutexes , latches are all based on this implementation. So I am just curious whether there are many ways(or inputs) to potentially trigger this bug ?

My question is about that are there any queries that could potentially trigger this bug? It doesn't need to really trigger the bug, as long as all the code paths that contribute to this bug are touched(i.e., os_event_wait, os_event_set, os_event_reset using the same os_event_t structure by different threads). And then I can do more study to figure what kind of timing or scheduling I need to trigger this bug . I am not sure if I have made myself understood ?

Thank you very much for the attention!

Wei
[11 Feb 2010 16:39] Inaam Rana
Wei,

I think I understand what you mean. This bug has manifested itself mostly on adaptive hash index latch (a.k.a. btr_search_latch). This latch comes under heavy battering when the queries are doing point searches on PK. So probably some in memory type workload (so that contention is not reduced because of I/O) involving extensive PK searches is likely to sensitize this issue.
[12 Feb 2010 21:39] Wei Zhang
Hi Inaam,

Thank you very much for the input! I think I have been able to repeat this bug by sending some queries from two threads, and adding couple of sleeps at places around s_lock, x_lock on btr_search_tree and also where the x_lock is implemented.

It seems to me that there is always a master thread implemented at srv_lock_timeout_and_monitor_thread() that does the deadlock detection work, and can resolve the issue. And it seems that deadlock detection thread monitors every 1 second. I am wondering is it true that if the workload is not heavy, all the deadlocks introduced by this bug will be resolved. But when the workload is heavy, the deadlock are generated faster than they are resolved, then it will lead to serious performance issues ?

Thanks a lot!

Wei