Bug #53825 | Removing locks from queue is very CPU intensive with many locks | ||
---|---|---|---|
Submitted: | 19 May 2010 20:17 | Modified: | 19 May 2010 20:25 |
Reporter: | Harrison Fisk | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 5.0, 5.1, plugin | OS: | Any |
Assigned to: | Assigned Account | CPU Architecture: | Any |
Tags: | lock queueing |
[19 May 2010 20:17]
Harrison Fisk
[24 Jun 2010 21:18]
Sunny Bains
Harrison, How much memory does the InnoDB record lock bitmap actually save in practice ? If for large sites it doesn't really save much then one solution is to get rid of the bitmap and make it easier to traverse the <space, pageno,heapno> list. This speeds up the lock traversal code hugely. Heikki is of the opinion that the record lock bitmap saves users a lot of memory and should be preserved. I'm curious as to whether it has as big an impact as is claimed. Regards, -sunny
[5 Jul 2010 22:34]
Harrison Fisk
Hi Sunny, What is the alternative to the bitmap? What sort of space multiplier would it be with a tree or something else? Assuming it isn't horrible, then it could certainly be useful if it really is much faster. Generally speaking on most production systems there are thousands of locks at any given time. However, there are some common scenarios which can set an enormous amount of locks, so those need to be thought of as well. The ones I can think of off hand: 1. Updating every row in a table 2. INSERT...SELECT setting shared locks on the SELECT clause 3. Serializable transaction level I've seen error messages from customers regarding the running out of memory for the lock table, such as: InnoDB: WARNING: over 4 / 5 of the buffer pool is occupied by InnoDB: lock heaps or the adaptive hash index! Check that your InnoDB: transactions do not set too many row locks. InnoDB: Your buffer pool size is 8 MB. Maybe you should make InnoDB: the buffer pool bigger? Generally this has only occurred with people with very small buffer pools, but if the space required multiplier is large, it could certainly be more common, which isn't good. Counteracting that is the fact that the default buffer pool is now 128M, so this would be less common with the later versions as well.
[5 Jul 2010 23:28]
Sunny Bains
Harrison, >What is the alternative to the bitmap? What sort of space multiplier would it be with a >tree or something else? Assuming it isn't horrible, then it could certainly be useful if >it really is much faster. There are several alternatives here: 1. Keep the bitmap - This change is a minor change, all we do is hash on <space, pageno> as we do currently but the double linked list is anchored in the hash cell allowing cheap forward and backward traversal. I tried this and it improves things somewhat but there is still the expense of traversing all the locks on a page when we are only interested in the locks on <space, pageno, heapno>. Current design only allows cheap forward traversal. 2. Remove the bitmap altogether and use <space, pageno, heapno> as the key and anchor the bitmap list there. This is the fastest in my tests. There are further optimisations possible here by making the locking more fine grained and/or using lock free list for the actual locks. 3. Haven't thought this through but some type of hybrid of the above two. >Generally speaking on most production systems there are thousands of locks at any given >time. However, there are some common scenarios which can set an enormous amount of >locks, so those need to be thought of as well. The ones I can think of off hand: >1. Updating every row in a table >2. INSERT...SELECT setting shared locks on the SELECT clause >3. Serializable transaction level The downside is that the <space, pageno, heapno> idea will use more memory for at least 1 and 2 use cases above. It is the usual space vs speed tradeoff. >I've seen error messages from customers regarding the running out of memory for the lock >table, such as: > >InnoDB: WARNING: over 4 / 5 of the buffer pool is occupied by >InnoDB: lock heaps or the adaptive hash index! Check that your >InnoDB: transactions do not set too many row locks. >InnoDB: Your buffer pool size is 8 MB. Maybe you should make >InnoDB: the buffer pool bigger? >Generally this has only occurred with people with very small buffer pools, but if the >space required multiplier is large, it could certainly be more common, which isn't good. There are several causes for this too, I've debugged a case reported by internal testing where 256MB was used by single transaction for record locks. However, the problem was not that there were that many active locks but lots of stale locks on that transactions memory heap. It is this that makes me suspicious of claims about the bitmap being the knight in shining armour :-). Perhaps the space lost by ditching the bitmap could be reclaimed from the memory heap of stale locks (just a thought). >Counteracting that is the fact that the default buffer pool is now 128M, so this would be >less common with the later versions as well. I think so too. Also, I can see that the bitmap was useful 10-15 years ago when memory was expensive and limited.
[5 Jul 2010 23:56]
Harrison Fisk
Hi Sunny, > 1. Keep the bitmap - This change is a minor change, all we do is hash on <space, pageno> > as we do currently but the double linked list is anchored in the hash cell allowing cheap > forward and backward traversal. I tried this and it improves things somewhat but there is > still the expense of traversing all the locks on a page when we are only interested in the > locks on <space, pageno, heapno>. Current design only allows cheap forward traversal. What do we need to do cheap reverse traversal for now? It used to be used in deadlock detection but is removed via the fix in bug #49047. Looking at the plugin code, it only looks like lock_rec_get_prev() is used in lock_queue_iterator_get_prev() which is only used in trx0i_s.c which should only be used for the information schema. > 2. Remove the bitmap altogether and use <space, pageno, heapno> as the key and anchor > the bitmap list there. This is the fastest in my tests. There are further optimisations possible > here by making the locking more fine grained and/or using lock free list for the actual locks. What would be the expected worse case memory increase for this? We'd have more hash cells instead of bitmap elements, so it looks like it would be a pointer instead of the bit in the bitmap? So if there were lots of row locks on the same page, it could be up roughly 1 pointer size of memory wasted per lock (ie. 64 times worse for 64-bit)?
[6 Jul 2010 0:20]
Sunny Bains
> What do we need to do cheap reverse traversal for now? Some background: With the kernel mutex split the inefficiencies in the locking code result in all sorts of problems: e.g., at >= 128 threads on an 8 core 32G machine the TPS drops by half and the functions are all in lock0lock.c. The get_nth_bit() is the most expensive. Cheap reverse traversal helps in deadlock detection and will be required when we have a more intelligent lock release strategy compared to the rather simple version that we have now. > It used to be used in deadlock detection but is removed via the fix in bug #49047. Only partially, it is still very expensive and can be made a lot more faster. I see this as a bottleneck in my tests. > Looking at the plugin code, it only looks like lock_rec_get_prev() is used in >lock_queue_iterator_get_prev() which is only used in trx0i_s.c which should only be used >for the information schema. This use case can be ignored. > What would be the expected worse case memory increase for this? > We'd have more hash cells instead of bitmap elements, so it looks like it would be a > pointer instead of the bit in the bitmap? So if there were lots of row locks on the same > page, it could be up roughly 1 pointer size of memory wasted per lock (ie. 64 times worse > for 64-bit)? Well, it's not an oranges vs oranges comparison, The bitmap by default is always up to the heapno + an 8 byte safety margin. So if you have <space, pageno, 512> the size of the bitnap is (512 + 64) / 8 + 1. If you remove the bitmap then that is 72 bytes less for the non-bitmap version. Also, note that the bitmap doesn't expand, the old record lock is discarded and a new record lock with the bigger bitmap is created to replace it. The old' lock(s) are the stale locks that I referred to earlier on the transaction's lock heap. It is most likely these locks that the customers are seeing. This waste of memory is for some reason ignored in any discussion about record locks :-). I've very keen to see some numbers where this is also factored in so that we can conclude whether the bitmap really is as useful as it is cranked up to be.
[6 Jul 2010 0:37]
Sunny Bains
From lock0lock.c, the code referred to in my earlier comment: /* Make lock bitmap bigger by a safety margin */ n_bits = page_dir_get_n_heap(page) + LOCK_PAGE_BITMAP_MARGIN; n_bytes = 1 + n_bits / 8; lock = mem_heap_alloc(trx->lock_heap, sizeof(lock_t) + n_bytes); page_dir_get_n_heap(page) returns the total number of records in the heap ie. the number of bits required.
[6 Jul 2010 0:46]
Harrison Fisk
> Some background: With the kernel mutex split the inefficiencies in the locking > code result in all sorts of problems: e.g., at >= 128 threads on an 8 core 32G > machine the TPS drops by half and the functions are all in lock0lock.c. The > get_nth_bit() is the most expensive. > > Cheap reverse traversal helps in deadlock detection and will be required when > we have a more intelligent lock release strategy compared to the rather simple > version that we have now. > >> It used to be used in deadlock detection but is removed via the fix in bug #49047. > > Only partially, it is still very expensive and can be made a lot more faster. > I see this as a bottleneck in my tests. Okay, we would certainly like to see even faster deadlock detection. This is often a bottleneck in our real production servers as well (so it isn't just a benchmark problem). Disabling deadlock detection is still something that is often required on our servers. > Well, it's not an oranges vs oranges comparison, The bitmap by default is > always up to the heapno + an 8 byte safety margin. So if you have <space, > pageno, 512>the size of the bitnap is (512 + 64) / 8 + 1. If you remove the > bitmap then that is 72 bytes less for the non-bitmap version. Also, note that > the bitmap doesn't expand, the old record lock is discarded and a new record > lock with the bigger bitmap is created to replace it. The old' lock(s) are > the stale locks that I referred to earlier on the transaction's lock heap. It > is most likely these locks that the customers are seeing. This waste of > memory is for some reason ignored in any discussion about record locks :-). > I've very keen to see some numbers where this is also factored in so that we > can conclude whether the bitmap really is as useful as it is cranked up to be. Okay, what exact information would you like from a production system? I'm not sure how to get the amount of memory used for the locks or how to calculate what is saved by the bitmap vs. other methods? I know Percona has a patch which shows the amount of memory used by the locking subsystem, not sure if that is available on the systems or not.
[6 Jul 2010 0:53]
Sunny Bains
I have some GDB functions that I use to look at the memory used by locking. These functions can even report memory wasted by the stale locks. However, this technique is very intrusive :-). One other option discussed with Heikki was to use the performance schema mechanism to report the memory usage and other metrics. I need to add this code, once I have something we can work on the modalities around getting it to you etc. Which one do you prefer ?
[6 Jul 2010 1:02]
Harrison Fisk
Performance schema is a non-starter since there isn't any 5.5 here yet. GDB might be possible, but I'll need to check on how we can get it done (probably on a shadow system). Can you attach the GDB functions?
[6 Jul 2010 1:41]
Sunny Bains
GDB functions to print lock stats
Attachment: gdb-funcs.txt (text/plain), 2.31 KiB.
[6 Jul 2010 1:52]
Sunny Bains
These are the functions that I've been using for my own testing/debugging. I think I should refine them further to print the relevant metrics to make the discussion more meaningful. The numbers required would be: 1. Average bits set per lock 2. Average memory used by all rec locks 3. Average memory in stale locks (The average is over samples taken every N seconds) Feel free to add to this list. I think the first should give us a feel for the bitmap utilisation, higher the utilisation the more useful the bitmap is in practice. 2 & 3 should give us the % of memory overhead used by stale locks. I've added the GDB functions so that you can check that they actually work with your code base. What do you think ?
[6 Jul 2010 2:35]
Harrison Fisk
> These are the functions that I've been using for my own testing/debugging. > > I think I should refine them further to print the relevant metrics to make > the discussion more meaningful. The numbers required would be: > > 1. Average bits set per lock > 2. Average memory used by all rec locks > 3. Average memory in stale locks > > (The average is over samples taken every N seconds) > > Feel free to add to this list. I think the first should give us a feel for > the bitmap utilisation, higher the utilisation the more useful the bitmap is > in practice. 2 & 3 should give us the % of memory overhead used by stale > locks. > > I've added the GDB functions so that you can check that they actually work > with your code base. The print_all_rec_locks function is extremely slow since there is a large n_cell size, so I will avoid using that one. print_trx_lock_heap_size is failing for me, saying "There is no member named lock." I think that should be: printf "0x%016lx %8ld\n", $trx, $trx->lock_heap->total_size Instead of: printf "0x%016lx %8ld\n", $trx, $trx->lock.heap->total_size It seems to work with that change. The additional information is fine by me. I just want to see the locking speed improvements, so if those will get you what you need to justify it, send me some gdb code and I'll try to get it done. If you have any patches available with your new locking code, I could potentially try those as well.
[6 Jul 2010 2:44]
Sunny Bains
Regarding trx_t::lock, that is part of my kernel mutex split code. Your change is correct for existing code. I don't really need to print the locks, I just need the aggregated result, I think that should be fast enough. I will try and get the GDB functions mentioned earlier to you ASAP. I would really like the kernel mutex split code along with the other lock changes to you but that requires management approval. Need to check on that one.