MySQL Bugs: #53825: Removing locks from queue is very CPU intensive with many locks

Bug #53825	Removing locks from queue is very CPU intensive with many locks
Submitted:	19 May 2010 20:17	Modified:	19 May 2010 20:25
Reporter:	Harrison Fisk	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.0, 5.1, plugin	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	lock queueing

Description:
Under high load with many transactions pending on a few rows, it can be possible to run into very high CPU usage due to removing locks from queue.

The problem is that removing a lock from queue is a O(N^2) operation.  InnoDB loops through all locks that were blocking on the removed lock and then for each has to verify if each can be granted or not. 

Another big problem with this is that the kernel_mutex is held during the time, which restricts other operations when they conflict on the kernel_mutex.

Here is example oprofile output:

Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name app name symbol name
257671 91.1329 mysqld mysqld lock_rec_has_to_wait_in_queue
18913 6.6891 no-vmlinux no-vmlinux /no-vmlinux
1214 0.4294 mysqld mysqld lock_rec_dequeue_from_page
793 0.2805 libc-2.5.so libc-2.5.so fgetc

How to repeat:
Run many threads doing many updates on a single row.  

Once the deadlock detection is faster (or disabled) from BUG #49047, then the dequeueing becomes the prominent bottleneck.  It is possible to hit this with regular binaries as well, but it is relatively rare since normally deadlock detection is hit first.

Suggested fix:
Find a new algorithm to use for removing locks from the queue or new data structure which allows it to be more efficient.

Harrison,

How much memory does the InnoDB record lock bitmap actually save in practice ?
If for large sites it doesn't really save much then one solution is to get rid of the bitmap
and make it easier to traverse the <space, pageno,heapno> list. This speeds up the 
lock traversal code hugely. Heikki is of the opinion that the record lock bitmap saves
users a lot of memory and should be preserved. I'm curious as to whether it has as
big an impact as is claimed. 

Regards,
-sunny

Hi Sunny,

What is the alternative to the bitmap?  What sort of space multiplier would it be with a tree or something else?  Assuming it isn't horrible, then it could certainly be useful if it really is much faster.

Generally speaking on most production systems there are thousands of locks at any given time.  However, there are some common scenarios which can set an enormous amount of locks, so those need to be thought of as well.  The ones I can think of off hand:

1.  Updating every row in a table
2.  INSERT...SELECT setting shared locks on the SELECT clause
3.  Serializable transaction level

I've seen error messages from customers regarding the running out of memory for the lock table, such as:

InnoDB: WARNING: over 4 / 5 of the buffer pool is occupied by
InnoDB: lock heaps or the adaptive hash index! Check that your
InnoDB: transactions do not set too many row locks.
InnoDB: Your buffer pool size is 8 MB. Maybe you should make
InnoDB: the buffer pool bigger?

Generally this has only occurred with people with very small buffer pools, but if the space required multiplier is large, it could certainly be more common, which isn't good.

Counteracting that is the fact that the default buffer pool is now 128M, so this would be less common with the later versions as well.

Harrison,

>What is the alternative to the bitmap?  What sort of space multiplier would it be with a
>tree or something else?  Assuming it isn't horrible, then it could certainly be useful if
>it really is much faster.

There are several alternatives here:

  1. Keep the bitmap - This change is a minor change, all we do is hash on <space, pageno>
      as we do currently but the double linked list is anchored in the hash cell allowing cheap
     forward and backward traversal. I tried this and it improves things somewhat but there is
     still the expense of traversing all the locks on a page when we are only interested in the
    locks on <space, pageno, heapno>. Current design only allows cheap forward traversal.

2. Remove the bitmap altogether and use <space, pageno, heapno> as the key and anchor
    the bitmap list there. This is the fastest in my tests. There are further optimisations possible
    here by making the locking more fine grained and/or using lock free list for the actual locks.

3. Haven't thought this through but some type of hybrid of the above two.

>Generally speaking on most production systems there are thousands of locks at any given
>time.  However, there are some common scenarios which can set an enormous amount of
>locks, so those need to be thought of as well.  The ones I can think of off hand:

>1.  Updating every row in a table
>2.  INSERT...SELECT setting shared locks on the SELECT clause
>3.  Serializable transaction level

The downside is that the <space, pageno, heapno> idea will use more memory for at least
1 and 2 use cases above. It is the usual space vs speed tradeoff. 

>I've seen error messages from customers regarding the running out of memory for the lock
>table, such as:
>
>InnoDB: WARNING: over 4 / 5 of the buffer pool is occupied by
>InnoDB: lock heaps or the adaptive hash index! Check that your
>InnoDB: transactions do not set too many row locks.
>InnoDB: Your buffer pool size is 8 MB. Maybe you should make
>InnoDB: the buffer pool bigger?

>Generally this has only occurred with people with very small buffer pools, but if the
>space required multiplier is large, it could certainly be more common, which isn't good.

There are several causes for this too, I've debugged a case reported by internal testing
where 256MB was used by single transaction for record locks. However, the problem
was not that there were that many active locks but lots of stale locks on that transactions
memory heap. It is this that makes me suspicious of claims about the bitmap being the
knight in shining armour :-). Perhaps the space lost by ditching the bitmap could be
reclaimed from the memory heap of stale locks (just a thought).

>Counteracting that is the fact that the default buffer pool is now 128M, so this would be
>less common with the later versions as well.

I think so too. Also, I can see that the bitmap was useful 10-15 years ago when memory
was expensive and limited.

Hi Sunny,

> 1. Keep the bitmap - This change is a minor change, all we do is hash on <space, pageno>
>     as we do currently but the double linked list is anchored in the hash cell allowing cheap
>    forward and backward traversal. I tried this and it improves
things somewhat but there is
>    still the expense of traversing all the locks on a page when we are only interested in the
>   locks on <space, pageno, heapno>. Current design only allows cheap
forward traversal.

What do we need to do cheap reverse traversal for now?  

It used to be used in deadlock detection but is removed via the fix in bug #49047.  

Looking at the plugin code, it only looks like lock_rec_get_prev() is used in lock_queue_iterator_get_prev() which is only used in trx0i_s.c which should only be used for the information schema.

> 2. Remove the bitmap altogether and use <space, pageno, heapno> as the key and anchor
>   the bitmap list there. This is the fastest in my tests. There are further optimisations possible
>   here by making the locking more fine grained and/or using lock free list for the actual locks.

What would be the expected worse case memory increase for this?  

We'd have more hash cells instead of bitmap elements, so it looks like it would be a pointer instead of the bit in the bitmap?  So if there were lots of row locks on the same page, it could be up roughly 1 pointer size of memory wasted per lock (ie. 64 times worse for 64-bit)?

> What do we need to do cheap reverse traversal for now?  

Some background: With the kernel mutex split the inefficiencies in the
locking code result in all sorts of problems: e.g., at >= 128 threads on
an 8 core 32G machine the TPS drops by half and the functions are all
in lock0lock.c. The get_nth_bit() is the most expensive.

Cheap reverse traversal helps in deadlock detection and will be required
when we have a more intelligent lock release strategy compared to the
rather simple version that we have now. 

> It used to be used in deadlock detection but is removed via the fix in bug #49047.  

Only partially, it is still very expensive and can be made a lot more faster. I see this
as a bottleneck in my tests.

> Looking at the plugin code, it only looks like lock_rec_get_prev() is used in
>lock_queue_iterator_get_prev() which is only used in trx0i_s.c which should only be used
>for the information schema.

This use case can be ignored.

> What would be the expected worse case memory increase for this?  
> We'd have more hash cells instead of bitmap elements, so it looks like it would be a
> pointer instead of the bit in the bitmap?  So if there were lots of row locks on the same
> page, it could be up roughly 1 pointer size of memory wasted per lock (ie. 64 times worse
> for 64-bit)?

Well, it's not an oranges vs oranges comparison, The bitmap by default is always up to the
heapno + an 8 byte safety margin. So if you have <space, pageno, 512> the size of the
bitnap is (512 + 64) / 8 + 1. If you remove the bitmap then that is 72 bytes less for the
non-bitmap version. Also, note that the bitmap doesn't expand, the old record lock is
discarded and a new record lock with the bigger bitmap is created to replace it. The old'
lock(s) are the stale locks that I referred to earlier on the transaction's lock heap. It is
most likely these locks that the customers are seeing. This waste of memory is for some
reason ignored in any discussion about record locks :-). I've very keen to see some
numbers where this is also factored in so that we can conclude whether the bitmap
really is as useful as it is cranked up to be.

From lock0lock.c, the code referred to in my earlier comment:

   /* Make lock bitmap bigger by a safety margin */
        n_bits = page_dir_get_n_heap(page) + LOCK_PAGE_BITMAP_MARGIN;
        n_bytes = 1 + n_bits / 8;
                        
        lock = mem_heap_alloc(trx->lock_heap, sizeof(lock_t) + n_bytes);

page_dir_get_n_heap(page) returns the total number of records in the heap ie.
the number of bits required.

> Some background: With the kernel mutex split the inefficiencies in the locking 
> code result in all sorts of problems: e.g., at >= 128 threads on an 8 core 32G 
> machine the TPS drops by half and the functions are all in lock0lock.c. The 
> get_nth_bit() is the most expensive.
>
> Cheap reverse traversal helps in deadlock detection and will be required when 
> we have a more intelligent lock release strategy compared to the rather simple 
> version that we have now.
>
>> It used to be used in deadlock detection but is removed via the fix in bug #49047.
>
> Only partially, it is still very expensive and can be made a lot more faster. 
> I see this as a bottleneck in my tests.

Okay, we would certainly like to see even faster deadlock detection.  This is often a bottleneck in our real production servers as well (so it isn't just a benchmark problem).  Disabling deadlock detection is still something that is often required on our servers.

> Well, it's not an oranges vs oranges comparison, The bitmap by default is 
> always up to the heapno + an 8 byte safety margin. So if you have <space, 
> pageno, 512>the size of the bitnap is (512 + 64) / 8 + 1. If you remove the 
> bitmap then that is 72 bytes less for the non-bitmap version. Also, note that 
> the bitmap doesn't expand, the old record lock is discarded and a new record 
> lock with the bigger bitmap is created to replace it. The old' lock(s) are 
> the stale locks that I referred to earlier on the transaction's lock heap. It 
> is most likely these locks that the customers are seeing. This waste of 
> memory is for some reason ignored in any discussion about record locks :-). 
> I've very keen to see some numbers where this is also factored in so that we 
> can conclude whether the bitmap really is as useful as it is cranked up to be.

Okay, what exact information would you like from a production system?  

I'm not sure how to get the amount of memory used for the locks or how to calculate what is saved by the bitmap vs. other methods?  

I know Percona has a patch which shows the amount of memory used by the locking subsystem, not sure if that is available on the systems or not.

I have some GDB functions that I use to look at the memory used by locking.
These functions can even report memory wasted by the stale locks. However,
this technique is very intrusive :-). 

One other option discussed with Heikki was to use the performance schema
mechanism to report the memory usage and other metrics. I need to add this
code, once I have something we can work on the modalities around getting
it to you etc.

Which one do you prefer ?

Performance schema is a non-starter since there isn't any 5.5 here yet.  GDB might be possible, but I'll need to check on how we can get it done (probably on a shadow system).

Can you attach the GDB functions?

GDB functions to print lock stats

Attachment: gdb-funcs.txt (text/plain), 2.31 KiB.

These are the functions that I've been using for my own testing/debugging. I think
I should refine them further to print the relevant metrics to make the discussion
more meaningful. The numbers required would be:

  1. Average bits set per lock
  2. Average memory used by all rec locks
  3. Average memory in stale locks

(The average is over  samples taken every N seconds)

Feel free to add to this list. I think the first should give us a feel for the bitmap
utilisation, higher the utilisation the more useful the bitmap is in practice.
 2 & 3 should give us the % of memory overhead used by stale locks.

I've added the GDB functions so that you can check that they actually work with
your code base.

What do you think ?

> These are the functions that I've been using for my own testing/debugging. 
>
> I think I should refine them further to print the relevant metrics to make 
> the discussion more meaningful. The numbers required would be:
>
>  1. Average bits set per lock
>  2. Average memory used by all rec locks
>  3. Average memory in stale locks
>
> (The average is over  samples taken every N seconds)
>
> Feel free to add to this list. I think the first should give us a feel for 
> the bitmap utilisation, higher the utilisation the more useful the bitmap is 
> in practice.  2 & 3 should give us the % of memory overhead used by stale 
> locks.
> 
> I've added the GDB functions so that you can check that they actually work 
> with your code base.

The print_all_rec_locks function is extremely slow since there is a large n_cell size, so I will avoid using that one.

print_trx_lock_heap_size is failing for me, saying "There is no member named lock."  I think that should be:

  printf "0x%016lx %8ld\n", $trx, $trx->lock_heap->total_size

Instead of:

  printf "0x%016lx %8ld\n", $trx, $trx->lock.heap->total_size

It seems to work with that change.

The additional information is fine by me.  

I just want to see the locking speed improvements, so if those will get you what you need to justify it, send me some gdb code and I'll try to get it done.  If you have any patches available with your new locking code, I could potentially try those as well.

Regarding trx_t::lock, that is part of my kernel mutex split code. Your
change is correct for existing code. I don't really need to print the locks,
I just need the aggregated result, I think that should be fast enough.

I will try and get the GDB functions mentioned earlier to you ASAP. I
would really like the kernel mutex split code along with the other lock
changes to you but that requires management approval. Need to check
on that one.