MySQL Bugs: #71727: Follow up work on Bug#16249481- INNODB DOES NOT SCALE WELL ON 12 CORE ...

Bug #71727	Follow up work on Bug#16249481- INNODB DOES NOT SCALE WELL ON 12 CORE ...
Submitted:	14 Feb 2014 23:30	Modified:	28 Feb 2014 10:24
Reporter:	Inaam Rana (OCA)	Email Updates:
Status:	Analyzing	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.6.16	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
The following is suggested work as follow up of the revision:

revno: 5677
committer: Sunny Bains <Sunny.Bains@Oracle.Com>
branch nick: 5.6
timestamp: Tue 2013-12-10 14:30:34 +0530
message:
  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
  
  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
  
  If this option is enabled then use atomic reference counting to track
  block use. If it is off then use the old way.
  
  Approved by Yasufumi Kinoshita rb#3958.

1) In buf_flush_LRU_list_batch():

1475                 if (evict) {
1476                         if (buf_LRU_free_page(bpage, true)) {
1477                                 /* buf_pool->mutex was potentially
1478                                 released and reacquired. */
1479                                 bpage = UT_LIST_GET_LAST(buf_pool->LRU);
1480                         } else {
1481                                 bpage = UT_LIST_GET_PREV(LRU, bpage);
1482                         }

We should do a ++count when buf_LRU_free_page() was successful. Otherwise we may spend much more time in LRU batch then is necessary.

2)
1505                                 /* buf_pool->mutex was released.
1506                                 reposition the iterator. Note: the
1507                                 prev block could have been repositioned
1508                                 too but that should be rare. */
1509 
1510                                 if (prev_bpage != NULL) {
1511 
1512                                         ut_ad(space != ULINT_UNDEFINED);
1513                                         ut_ad(offset != ULINT_UNDEFINED);
1514 
1515                                         prev_bpage = buf_page_hash_get(
1516                                                 buf_pool, space, offset);
1517                                 }
1518 

The idea of hash_page lookup to check the integrity of prev_bpage pointer is smart. However, the underlying assumption that it happens rarely is not necessarily true. Particularly, when neighbor flushing is enabled it is quite common that prev_bpage is one of the neighboring page in the same extent and may have been flushed and removed from LRU by the time we reach here.

3) If I am understanding it correctly in this patch we have relaxed the condition around LRU and single page flushing in buf_flush_ready_for_flush() by not checking buf_fix_count and has added that constraint in buf_flush_page(). Given this what is actually the need of buf_flush_ready_for_flush() at all? Won't it make code much simpler to simply get rid of this function as it is always followed by buf_flush_page().

4) In buf_buddy_relocate():

95         if (buf_page_can_relocate(bpage)) {
596                 /* Relocate the compressed page. */
597                 ullint  usec = ut_time_us(NULL);
598 
599                 ut_a(bpage->zip.data == src);
600 
601                 /* Note: This is potentially expensive, we need a better
602                 solution here. We go with correctness for now. */
603                 ::memcpy(dst, src, size);
604 
605                 bpage->zip.data = reinterpret_cast<page_zip_t*>(dst);
606 
607                 rw_lock_x_unlock(hash_lock);
608 
609                 mutex_exit(block_mutex);

As the note mentioned, it can be expensive to hold hash_lock for the duration of memcpy. Can we do something like flip pointers and set IO fix and release the hash_lock? Or if that is not safe then may be we can have an option to avoid compressed block coalescing until the list reaches a certain threshold? Come to think of it having a reasonably sized compressed block size list may otherwise help us also as mostly people use a 8K blocks for compression and we can avoid on lots of relocation->coalescing->release to the free list->split->get 8k block sequence.

How to repeat:
see description.

Inaam -

I also have some concerns about the prev_bpage pointer, but different from yours. First, I think that the scenario you describe (neighbor flush flushes and removes the prev_bpage page from the LRU list) should be very rare indeed, since LRU flushes do not remove the pages from the LRU list, only evictions do, thus for it to occur we need a single page LRU flush going in parallel.

Second, I am worried that on the overloaded server the prev_bpage page may become young meanwhile. Then page cleaner resumes and flushes it, while it shouldn't have. If more pages manage to become young at the same time, then the page cleaner ends up flushing the wrong end of the LRU list in that iteration, unless I'm missing something.

Laurynas,

You are right. I confused 5.6 with 5.7 where flushing and eviction happens in a single step.

Your concern about prev_bpage moving to MRU end seems valid but given its unlikeliness my guess is that we should be able to live with it.

Thanks Inaam for this suggestion. Do you have a test case to prove that it is an issue on some work loads ?

Inaam, as an experiment I tried this patch on 5.7:

=== modified file 'storage/innobase/buf/buf0buddy.cc'
--- storage/innobase/buf/buf0buddy.cc	revid:bjorn.munch@oracle.com-20140303171910-l7kxafmxk1xb3mno
+++ storage/innobase/buf/buf0buddy.cc	2014-03-05 00:23:32 +0000
@@ -590,6 +590,8 @@ buf_buddy_relocate(
 		return(false);
 	}
 
+	rw_lock_x_unlock(hash_lock);
+
 	/* The block must have been allocated, but it may
 	contain uninitialized data. */
 	UNIV_MEM_ASSERT_W(src, size);
@@ -605,9 +607,8 @@ buf_buddy_relocate(
 		ut_a(bpage->zip.data == src);
 
 		memcpy(dst, src, size);
-		bpage->zip.data = reinterpret_cast<page_zip_t*>(dst);
 
-		rw_lock_x_unlock(hash_lock);
+		bpage->zip.data = reinterpret_cast<page_zip_t*>(dst);
 
 		mutex_exit(block_mutex);
 
It made some  difference in the test. An in-memory Sysbench OLTP RW @64 threads with an 8K key block size. Server was a 16 physical core, with Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz.

The hash_lock is the bottleneck in the test with and without this patch. There is some room I think to improve this code.

Regarding the other points, looks like you two have cancelled out each others concerns :-)

Analyzing since 2014?