MySQL Bugs: #26442: Adaptive Hash Index's CPU scalability is low.

Bug #26442	Adaptive Hash Index's CPU scalability is low.
Submitted:	16 Feb 2007 6:57	Modified:	12 Aug 2009 17:18
Reporter:	Yasufumi Kinoshita	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.0.54	OS:	Any (All)
Assigned to:	Inaam Rana	CPU Architecture:	Any
Tags:	Contribution

Description:
Even if the number of CPU is increased,
there are no benefits for the certain kind of benchmark that uses the Adaptive Hash Index heavily.

How to repeat:
Executing the such kind of benchmarks on 4 or more CPU-machine.
(e.g. DBT-1)

Suggested fix:
## The following experimental patch is for x86 or x86_64 gcc only, because the assembly languages are used.

-- Optimizing rw_lock
      Making mutex-free.
      The events are divided with for-x_lock and for-s_lock.
      etc..

-- Splitting btr_search_latch
      "btr_cell_search_latch_arr[]" are splitted out from "btr_search_latch" for each groups of the hash cells.

Results using DBT-1 (my modifying version)

Attachment: DBT-1_no_fullscan_20070209.png (image/png, text), 8.47 KiB.

Experimental patch

Attachment: mysql-5.0.30_yasufumi_20070209.patch.gz (application/x-gzip, text), 17.63 KiB.

Another benchmark example of the "low" utilization ratio of Adaptive Hash.

Attachment: TPC-C_20070209.png (image/png, text), 8.06 KiB.

Yasufumi,

thank you very much for this patch!

The adaptive hash latch is indeed a point of contention under various workloads. One real-world user had it as the 'thrashing semaphore' in his workload.

Best regards,

Heikki

Heikki,

There are anxious points.
In my patch, the original "btr_search_latch" has not been used.
So, the following consistencies might not be secured.

- block->[curr_n_fields, curr_n_bytes, curr_side] and Adaptive Hash.

- rec_set_deleted_flag(rec,..), rec_get_deleted_flag(rec,..) and Adaptive Hash.

Best regards,

Yasufumi

Experimental patch (for also x86_64)

Attachment: mysql-5.0.30_yasufumi_20070405_x86_64.patch.gz (application/x-gzip, text), 17.50 KiB.

Heikki,

Sorry. I have not noticed up to now. 
The former patch may not be built on x86_64 systems..

This is a "first aid".
But it may be enough for experimental performance benchmarks.

Best regards,

Yasufumi

Yasufumi,

thank you for the new patch! See you at the conference!

Heikki

Yasufumi Kinoshita,

Thank you for your patch.

We've also seen scalability problems with adaptive hash indexes. in some extreme cases even disabling it showed better performance on high concurrencies.

I have been rechecking this experimental patch,
and I have found some bugs in the patch...

[at Optimizing rw_lock]
-- It is slightly inexact for heavy contentions of x_lock.
    # I can fix.

-- It still doesn't work with x86_64. (!! Sorry,)
    # I can fix.

[at Splitting btr_search_latch]
-- It doesn't fulfill the intentional lock of "consistent_read"(row0sel.c) .
    # It is very difficult to solve...
    # We may need a new lock mechanism, I think.
    #  (multiple x_lock to use for intentional locks of splitted x_locks)

    # Does anyone have better suggestions?

I am going to recreate a patch privately and to attach to this report one of these days.

And,

This problem may not be solved by pthread_rwlock_xxx().
Because pthread_rwlock_xxx() in glibc is depend on pthread_mutex_xxx()...

Patches for 5.0.41

Attachment: mysql-5.0.41_scale_patches.tar.gz (application/x-gzip, text), 15.42 KiB.

I discovered that we cannot disable all uses of Adaptive Hash Index and btr_search_latch by only the setting "use_adaptive_hash_index = FALSE".
And I made the patch.

Peter should be right.
This patch may be effective for the situations that the serever has many CPUs and the workload has high concurrencies.

Sure enough, to split the btr_search_latch is very difficult..
I think that we have the following two choices.

1. Optimizing rw_lock (x86 x86_64)
  [mysql-5.0.41_optimize_rwlock.patch]

OR

2. To disable Adaptive Hash Index entirely
  [mysql-5.0.41_disable_hash_search_entirely.patch]

-----------------

The following patches also may be effective depending on the situation.

- Page hash cell mutexes (splitted from buf_pool->mutex)
  [mysql-5.0.41_page_hash_cell_mutex.patch]
        This avoid contentions of buf_pool->mutex. (using page hash)

- Improvements of block->mutex and buf_pool->mutex (partially for x86 x86_64)
  [mysql-5.0.41_optimize_block_mutex_etc.patch]
        This may soften contentions of block->mutex and buf_pool->mutex.

- Online control parameter of flushing modified buffer and merging insert buffer
  [mysql-5.0.41_control_flush_and_merge.patch]
        When the "Hot Spot" of modifying activity is larger than the buffer pool,
        we should tune them for the better performance.

I will benchmark these patches one of those days..

Results (Xeon E5310 x2 : openSUSE 10.2 (kernel-2.6.18))

Attachment: scale_patch_combination_test_5.0.41.pdf (application/pdf, text), 27.35 KiB.

"RWlock"          : mysql-5.0.41_optimize_rwlock.patch
"DisableAdaptive" : mysql-5.0.41_disable_hash_search_entirely.patch
"Others"          : mysql-5.0.41_page_hash_cell_mutex.patch & mysql-5.0.41_optimize_block_mutex_etc.patch

Now, we can understand from these results as follows.

- The effects of "RWlock" are great.

- "Others" assists the effects of the both "DisableAdaptive(required!)" and "RWlock(a little)".

- "DisableAdaptive" causes bad effects for 4 or lower cores even with "Others".
  So we should implement this feature as switchable for users selection.

- Bland-new hardware and OS cause a better scale :-)

Then, I think we have two choices.

(best)
   Applying "RWlock" (Maybe, we should also apply "Others" for a little better result).

(secondly)
   Applying switchable-"DisableAdaptive" and "Others" (especially mysql-5.0.41_page_hash_cell_mutex.patch).

Yasufumi,

thank you for these measurements and observations!

Is the rw-lock patch now correct, or is there still some problem with it?

Best regards,

Heikki

Heikki,

There are no problems at the TPC-W(DBT-1)-based benchmarks.
So, I'm going to test the endurance of the patch with low buffer hit rate OLTP-benchmark on ramfs one of these days.

Best regards,

Yasufumi

fixed patch (page hash mutex)

Attachment: mysql-5.0.41_page_hash_cell_mutex_fixed.patch.gz (application/x-gzip, text), 3.20 KiB.

Heikki,

I have found only one problem after all and fixed.
(page_hash_cell_mutex patch)

So, there are at least no problem in the rw-lock patch on my environment!

Best regards,

Yasufumi

much more stable rwlock patch

Attachment: mysql-5.0.41_optimize_rwlock_stable.patch.gz (application/x-gzip, text), 7.10 KiB.

Heikki,

This patch is much more stable than previous one!
I have rechecked it with the pipeline conscious and found one mistake.

Best reagrds,

Yasyfumi

fixed a little

Attachment: mysql-5.0.41_optimize_rwlock_stable2.patch.gz (application/octet-stream, text), 7.08 KiB.

I encounter this problem with as few as 4 concurrent queries on a server with 4 CPU cores -- http://bugs.mysql.com/bug.php?id=30738

To fix Bug#29560 also may solve this problem for the time being.

Reagrds,

Yasufumi

The effects of Bug#29560 fix.

Attachment: os_event_fix_effects.pdf (application/pdf, text), 30.75 KiB.

Yasufumi,

I wonder how fixing the event bug in Bug #29560 improves the scalability with 8 cores? Could it be that the operating system quite often switches a task out just before the task calls os_event_wait()? But if that is the case, there should be many more hangs due to the Bug #29560.

Regards,

Heikki

Heikki,

Bug#29560 seems not to occur when we use single CPU.
So, a task switching may also be innocent.

And the true hangup model may be as follows.

"
A-- want x_lock
B-- have last s_lock
C-- want s_lock after "A"
(these tasks are executed at the same time by each CPU)

A: x_lock (set RW_LOCK_WAIT_EX)
A:     os_event_reset()

B: s_unlock
B:     os_event_set()

C: s_lock (but A's RW_LOCK_WAIT_EX)
C:     os_event_reset()  ------------- Oops...

A:     os_event_wait() (hangup!)
C:     os_event_wait() (hangup!)
"

I think this hangup is a very very rare case.

(* But, the hungup phenomena troubled me when made the patch [mysql-5.0.41_optimize_rwlock_stable2.patch.gz].
  If Bug#29560 is fixed, I may be able to rewrite the patch in more simply...)

However, the most possible problem may be
=================================================
when x_unlock,
sometimes several s_lock waiting threads are left
by the os_event_reset() interrupting.
and these threads are not wake up
until the next "last s_unlock".
=================================================

This is one of the main factor of the bad scalability, I think.
This phenomena may be caused by many CPUs and many sessions and heavy loads...

Regards,

Yasufumi

Yasufumi,

Excellent analysis! Thanks for the effort.

Whereas I agree with most of what you have said, I don't understand the following part:

[cut]
However, the most possible problem may be
=================================================
when x_unlock,
sometimes several s_lock waiting threads are left
by the os_event_reset() interrupting.
and these threads are not wake up
until the next "last s_unlock".
=================================================
[cut]

Whom do you think will call os_event_reset() in this case?

If after x_unlock, the lock is currently held shared then only a writer can call os_event_reset(). If that happens, then waiting of other readers is by design i.e.: we want the writer to go in before the readers.

Can you explain this scenario in a little more detail?

regards,
inaam

Inaam,

The processing speed of CPU may not be always constant.
(Because of influences by buses, caches, pipelines etc..)

So, a CPU can overtake the other CPUs.

I think that the following scenario is possible.

A-- have x_lock
B,C... -- want s_lock
(these tasks are executed at the almost same time by each CPU)

B: try s_lock... and fail
B: os_event_reset()
B: os_event_wait() --- can wake-up later

C: try s_lock... and fail
C: os_event_reset()

D: try s_lock... and fail
D: os_event_reset()

E: try s_lock... and fail

A: x_unlock
A: os_event_set() --- "B" wakes up

B: s_lock

C: os_event_wait() --- "C" can through
C: s_lock

E: os_event_reset() --------- !!

D: os_event_wait()
E: os_event_wait()

....

B: s_unlock
C: s_unlock
C: os_event_set() ---- "D","E" wake up

Ragards,

Yasufumi

Yasufumi,

I think the scenario above is not possible. Please see below. But there might be some other temporary hangs... We would need a mathematical proof to be sure.

Regards,

Heikki

"
A-- have x_lock
B,C... -- want s_lock
(these tasks are executed at the almost same time by each CPU)

B: try s_lock... and fail
B: os_event_reset()
B: os_event_wait() --- can wake-up later

###### HEIKKI:    rw-lock.waiters_exist == TRUE now!

C: try s_lock... and fail
C: os_event_reset()

D: try s_lock... and fail
D: os_event_reset()

E: try s_lock... and fail

A: x_unlock
A: os_event_set() --- "B" wakes up

###### HEIKKI:    rw-lock.waiters_exist == FALSE now!

B: s_lock

C: os_event_wait() --- "C" can through
C: s_lock

##### HEIKKI: Before E comes here, he calls mutex_enter(rw_lock_get_mutex(lock)) and checks if there is a (waiting) X-lock once again. There is no such X-request => E can proceed. The scenario below is not possible.

E: os_event_reset() --------- !!

D: os_event_wait()
E: os_event_wait()

....

B: s_unlock
C: s_unlock
C: os_event_set() ---- "D","E" wake up
"

sync0rw.c, function rw_lock_s_lock_spin():

...
        mutex_enter(rw_lock_get_mutex(lock));

        /* We try once again to obtain the lock */

        if (TRUE == rw_lock_s_lock_low(lock, pass, file_name, line)) {
                mutex_exit(rw_lock_get_mutex(lock));

                return; /* Success */
        } else {
                /* If we get here, locking did not succeed, we may
                suspend the thread to wait in the wait array */

                rw_s_system_call_count++;

                sync_array_reserve_cell(sync_primary_wait_array,
                                lock, RW_LOCK_SHARED,
                                file_name, line,
                                &index);

                rw_lock_set_waiters(lock, 1);

                mutex_exit(rw_lock_get_mutex(lock));

                if (srv_print_latch_waits) {
                        fprintf(stderr,
                "Thread %lu OS wait rw-s-lock at %p cfile %s cline %lu\n",
                                os_thread_pf(os_thread_get_curr_id()),
                        lock, lock->cfile_name, (ulong) lock->cline);
                }

                rw_s_system_call_count++;
                rw_s_os_wait_count++;

                sync_array_wait_event(sync_primary_wait_array, index);

                goto lock_loop;
        }
}

Heikki,

You are right. Sorry..

s_lock's os_event_reset() cannnot interrupt after x_unlock's os_event_set()...

The patch (mysql-5.0.41_os_event_fix.patch.gz) only prevents unlock event handling misses, caused by os_event_reset interrupting.

<when s_unlock>---------------------------------------------
x_lock   : sets RW_LOCK_WAIT_EX, os_event_reset()
s_unlock : os_event_set()
s_lock   : os_event_reset()
... 

This scenario is possible and always causes hang-up.
But, the benchmark results don't show such hang-up periods...
------------------------------------------------------------

<when x_unlock>---------------------------------------------
As you say, no waiter can interrupt.
Because they can lock (both x_lock and s_lock) after os_event_set().
------------------------------------------------------------

<when mutex_exit>-------------------------------------------
There are no mutex protection like rw_lock.
So, the interruption is possible.
But it never cause hang-up.
Because someone always lock that mutex after the os_event_reset().

In this case, the only problem is that
the interrupted waiters call needless pthread_cond_wait().

I think that the needless pthread_cond_wait() callings are next suspects..
------------------------------------------------------------

Regards,

Yasufumi

Yasufumi,

We have been talking about different things in this report. To recap we have touched four different aspects of improving scalability of InnoDB:

1) btr_search_latch split patch. As you mentioned, you have realized it is very difficult to split the btr_search_latch and perhaps the most easiest option, at least for now, is to be able to disable adaptive hash index in such workloads where it hurts. This has already been addressed and Vasil submitted a patch for this which is part of 5.0.52 and 5.1.23.

2) rw_lock tmeporary/permanent hang scenario, which was introduced in 5.0.30. A fix for that has already been committed as ttp://bugs.mysql.com/bug.php?id=29560.

3) Improving rw_lock implementation patch (using assembly). I have put that on my TODO and will probably look into it in 6.x time frame.

4) Split page hash mutex (i.e.: improving on buffer pool mutex). Again that is on the TODO list and will be taken up shortly.

I, for now, as part of cleaning up our backlog of bugs, set the status of this to 'not a bug' (as the only bug of hang reported here has been fixed already).

If you have something to add to your earlier findings/contributions please feel free to reopen this bug.

And thanks for your efforts and insight.

regards,
inaam

The new patch improves rw_lock without assembly (with Atomic Builtin of GCC,ICC)

Attachment: mysql-5.0.54_optimize_rwlock_atomic_builtin.patch (text/x-patch), 36.53 KiB.

Could I reopen this bug?

Then, I'll focus on the topic 3) only, but not using assembly (the above patch).
I've rewritten the patch using the atomic builtin of GCC,ICC without assembly.
(http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html)
It is simpler than the before (because based on Bug#29560 fixed version).
And more portable (there are no influences to environments not support the builtin).

Though Bug#29560 has been fixed, it might be needed for more extreme workload.
The following graphs are the results of the more extreme workload and shows the effect of the patch.
The difference from the above workloads in this bug is allowing JOIN with LIKE clause (hits many rows).
As you see, "Disable Adaptive Hash" shows bad effects for this workload.
(I guess that to disable adaptive hash has no good effects in any cases now, because Bug#29560 has fixed.)

Regards,
Yasufumi

Performance results

Attachment: extreme_workload_results.pdf (application/pdf, text), 20.86 KiB.

The main factor of the less scalability in the extreme case is clarified.
(Bug#35136)

Scanning secondary index seems to cause Adaptive Hash Index contentions.

Yasufumi, Inaam,

thank you. MySQL AB may improve the optimizer to avoid the unnecessary search on i_id in the join.

But we InnoDB engineers need to fix the adaptive hash so that its scalability is not spoiled by a join involving two adaptive hash searches per row examination.

Regards,

Heikki

Related bug discussing when the optimiser prefers secondary index scans to clustered index scans is bug #35850. There seems to be a conflict of optimisations: secondary index scans are faster than clustered index scans quite often but the secondary index scans can cause contention for the adaptive hash index on busy systems.

I don't feel so conflicted.

This problem caused by the "unnecessary" adaptive hash search.
We don't say to avoid also necessary adaptive hash search.
Of course, secondary index scans without adaptive hash searches should be always faster than clustered index scans.

The adaptive hash search is caused by the secondary index serach,
only if the request from mysqld contains the columns other than
the columns of the index and primary key columns.

In this case, because of the like clause, mysqld request the all rows by the secondary index serach.
So, InnoDB must search adaptive hash even for the all rows don't satisfy the condition.

I think...

Ideally, if mysqld can push the condition down to InnoDB,
using the secondary index serach may become the best plan.

Otherwise, when mysqld request secondary index scan,
picking up the columns not included in the index or primary key should be executed later
to save the adaptive hash searches.

I agree that the conflict can be eliminated. It's just a problem today, where different use cases benefit from different choices.

We have had scalability problems with a query running on an 2 x Quad-Core Debian etch machine.

We did have specific query which was our worst offender but we also had a number of others with the same issue.

We ran some tests before and after applying the complete set of scale patches of a number of clients running the query with SQL_NO_CACHE. These gave us the following times per query:

5.0.41 Original
5 Clients - 3s to 3.5s
10 Clients - 14.9s to 17.6s
20 Clients - 31.8s to 34.0s
30 Clients - 39.5s to 43.5s

After Scalability Patches
5 Clients - 1.9s to 2.0s
10 Clients - 1.3s to 2.6s
20 Clients - 5.6s to 6.6s
30 Clients - 9.1s to 10.5s

In the original 5.0.41 version the query did not scale at all as the number of clients increased, this made certain functions in our application unusable. However after the patches it can be seen that this query now scales very well.

Are there a set of these scale patches for later versions (such as 5.0.67)? Is there any issue with applying these patches to later versions? 
Are there any known stability issues with the patches?

5.0 is frozen from major changes.

We are looking at the Google SMP scalability patches, as well as some other patches. It is unlikely that we will put them to 5.1, because MySQL-5.1's GA status is so close.

In 5.1, you can remove the contention on the adaptive hash with the my.cnf option --skip-innodb-adaptive-hash-index

On the application side, you could throttle the number of concurrent queries your application can post to the mysqld server. Some degradation of performance is inevitable if 10 or 100 queries all execute in parallel.

As you may know, I have fixed my patch.
The performance for x-lock contentions may be improved.

http://www.mysqlperformanceblog.com/2008/10/20/improved-innodb-rw_lock-patch/

With the introduction of atomics for locking in plugin 1.0.3 and with fixing of race conditions in the rw-lock code this is fixed. Surely there may be room for further improvement but as of now adaptive hash index has not been reported as a pain point in innodb plugin.

source/binaries and documentation of innodb plugin is available at innodb.com