MySQL Bugs: #49169: read_view_open_now is inefficient with many concurrent sessions

Bug #49169	read_view_open_now is inefficient with many concurrent sessions
Submitted:	27 Nov 2009 16:09	Modified:	28 Feb 2013 13:27
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB Plugin storage engine	Severity:	S5 (Performance)
Version:	5.0,5.1,5.1 plugin	OS:	Any
Assigned to:	Sunny Bains	CPU Architecture:	Any
Tags:	innodb, kernel_mutex, performance, read_view_open_now

Description:
read_view_open_now limits InnoDB performance when there are many concurrent sessions. We have oprofile samples with 10% of CPU time in this function. kernel_mutex is held when this runs so that many other threads will be blocked on that instead of doing work.

I think there are two problems:
* the loop over trx_sys->trx_list
* possible call to malloc from read_view_create_low.

The overhead for both of these is higher when trx_sys->trx_list grows.

This is the loop
        while (trx) {
                if (trx != cr_trx && (trx->conc_state == TRX_ACTIVE ||
                                        trx->conc_state == TRX_PREPARED)) {

                        read_view_set_nth_trx_id(view, n, trx->id);

                        n++;

                        /* NOTE that a transaction whose trx number is <
                        trx_sys->max_trx_id can still be active, if it is
                        in the middle of its commit! Note that when a
                        transaction starts, we initialize trx->no to
                        ut_dulint_max. */

                        if (ut_dulint_cmp(view->low_limit_no, trx->no) > 0) {

                                view->low_limit_no = trx->no;
                        }
                }

                trx = UT_LIST_GET_NEXT(trx_list, trx);
        }

And this is the allocation from read_view_create_low
        view->trx_ids = mem_heap_alloc(heap, n * sizeof(dulint));

How to repeat:
Use oprofile on a busy server (8-cores, 1000+ sessions)

SHOW INNODB STATUS prints the the length of trx_sys->view_list but not of trx_sys->trx_list. It would help to know that length.

Above the loop there is a comment:

/* No active transaction should be visible, except cr_trx */

So assuming this comment can be believed, the loop could abort as soon as it found cr_trx, rather than how it goes through the trx_list entirely now. This could help somewhat.

I don't think that can stop the loop. The loop computes the min trx ID for an uncommitted transaction and creates an array of IDs for active transactions into a buffer specific to this transaction.

Pardon the hand waving as I am new to this code.

Can we fix this with a new cursor isolation level -- cursor-stability as implemented by Berkeley DB. With that level the list of transactions open at statement/transaction start are not needed as the cursor can read any data that was committed at the time of the read.

sysbench readonly QPS on an 8-core server for 1 .. 1024 threads with the loop in read_view_open_now skipped:

17857	31340	63910	135126	126846	125522	124231	123989	122999	117773	96123

and with the code unchanged 
18367	33506	62695	132566	125273	120714	109603	79058	59758	39833	17501

QPS drops significantly because of that loop.

One change slated for 5.6 is to get rid of dulints from the InnoDB code.
This should mitigate the problem somewhat.

reassigning to Sunny as kernel mutex is his area

A fix is pushed and bug is fixed in 5.6.

It's strange to have this closed without the fix documented in the manual (and explained here, or at least URL provided) and specific version with the fix, 5.6.x, explicitly listed. Had the rules for bugs processing changed in February, 2013?

There is nothing to document w.r.t the fix. It was fixed as part of the kernel mutex split in 5.6. Given the number of additional improvements that were made after the kernel mutex was eliminated it is difficult to pinpoint the exact release. Suffice to say the 5.6GA release can be considered as the fixed version. Given the complexity of the task, improvements around MVCC (that's what the read views are used for) are still in progress but the nature of the fix has shifted to another part of the code. I'm not sure how this can be all put in the documentation.

For 5.6 the fix was to split the trx_sys_t::trx_list into RW, RO and auto commit-read-only-non-locking. The rules are:

 1. If a transaction is started with "START TRANSACTION READ ONLY"; it will be put on the RO trx list.
 
 2. If a transaction is start with auto-commit enabled, and we can determine that it is a read-only and  non-locking select. It is not put on any list.

 3. All other transaction are put on the RW transaction list.

#3 is part of the ongoing effort to improve things further. Preliminary results are very encouraging.

Let me reiterate the issues reported in this bug:

> * the loop over trx_sys->trx_list

So in 5.6 there's still a loop over RW list, it's just that RO transactions are put into a separate list. But I don't think this can be claimed "fixed" in a general case.

> * possible call to malloc from read_view_create_low.

malloc is still there.

I appreciate all the already done and ongoing work, just wondering if the Closed status actually reflects the current state of things.

The issue is about the overhead of the read view creation. The loop and malloc being under the mutex is incidental to that fact. The function no longer has the overhead as claimed in the bug report.  This reduction in overhead is a direct result of the kernel mutex split and other related changes. Therefore this bug  is considered fixed and closed.

As mentioned in the previous comment, any future fixes around this will be done from a different perspective. e.g., sharing views is one option.

Isn't kernel_mutex just replaced with trx_sys->mutex in this specific case mentioned in the original report (i.e. read view creation).

In our benchmarks all the 5.6 improvements only apply to the case when the read-only transaction optimization kicks in (i.e. when prerequisites mentioned here and in the manual hold). However, any sysbench read-only transactional workload still suffers from mutex contention with the only difference of contending for trx_sys->mutex instead of kernel_mutex.

Yes, the read view creation is now covered by the trx_sys_t::mutex. Yes. there are optimisations around its handling for read-only transactions. However, you cannot equate the kernel_mutex with the trx_sys_t::mutex. Even in read-write loads it doesn't show up in any of my tests. Besides there are other ways of solving this problem around read view creation that I'm afraid I can't share in this bug report yet.

The contention around RW Sysbench is not uniform across all thread levels. It starts with the trx_sys_t::mutex, then switches to the dict_index_t::lock and finally once you cross 512 threads or so the contention is around the lock_sys_t::mutex.

As far as this bug report is concerned we will not be moving the memory allocation out of the mutex and will not be getting rid of the loop. 

Without "fixing" any of the issues mentioned in this bug report I was able to take the TPS up to 7.5K from 3.7K.

Sysbench 4G buffer pool, 4 instances 24 core HW OLTP_RW at 1K threads

LD_PRELOAD=/usr/lib64/libjemalloc.so.1 sysbench-linux --test=oltp --init-rng=on --oltp-read-only=off --oltp-dist-type=uniform --mysql-socket=/tmp/mysql-sunny.sock --mysql-user=root --max-requests=0 --mysql-table-engine=innodb --max-time=60 --oltp-table-size=1000000 --num-threads=1024  --oltp-skip-trx=off  run
    transactions:                        225725 (3756.75 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 4288775 (71378.29 per sec.)
    other operations:                    451450 (7513.50 per sec.)

To:

LD_PRELOAD=/usr/lib64/libjemalloc.so.1 sysbench-linux --test=oltp --init-rng=on --oltp-read-only=off --oltp-dist-type=uniform --mysql-socket=/tmp/mysql-sunny.sock --mysql-user=root --max-requests=0 --mysql-table-engine=innodb --max-time=60 --oltp-table-size=1000000 --num-threads=1024  --oltp-skip-trx=off  run
    transactions:                        456073 (7592.15 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 8665387 (144250.79 per sec.)
    other operations:                    912146 (15184.29 per sec.)

Therefore lets give this issue a rest because there is a good reason why we have decided to close this bug.