Bug #67808 in innodb engine, double write and multi-buffer pool instance reduce concurrency
Submitted: 5 Dec 2012 2:45 Modified: 26 Dec 2012 2:34
Reporter: wang huai Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.5.* OS:Any
Assigned to: CPU Architecture:Any
Tags: concurrency, double write, innodb, multi-buffer pool instance

[5 Dec 2012 2:45] wang huai
Description:
in innodb engine, if config the doublewirte and multi-buffer pool instance, and the high speed of log generation, the QPS is very low.
in every modify statement, it will check if there is need to do checkpoint, it call the function "log_free_check". 
if it's the time to generate the checkpoint, then many user threads enter into long time waiting(wait one thread do checkpoint over, do doublewrite over), because in function buf_flush_buffered_writes, there have only one mutex "trx_doublewrite->mutex", but there are multi-buffer pool instance, in other words, there are many user threads, so if one thread enter into trx_doublewrite->mutex, others have to wait. so after a short time of running, the concurrency of innodb is very low, linear decrease exactlly.
but in fact, the wait is not necessary. if there have one thread is doing checkpoint, others not need to do completely, it can return in log_free_check and continue execute forward, because checkpoint is need only one thread to do.

How to repeat:
in innodb engine, if config the doublewirte and multi-buffer pool instance, and the high speed of log generation, and there are log of user thread doing the insert, delete, update, after a short time(there is not enough log space, and must flush the buffer pool page and do checkpoint), the QPS will lower and lower

Suggested fix:
the wait is not necessary. if there have one thread is doing checkpoint, others not need to do completely, it can return in log_free_check and continue execute forward, because checkpoint is need only one thread to do.
there only need to do like this, add a member in log_sys, named checkpoint_doing, in function log_check_margins, check it, if it is TRUE, then do nothing and return, if it is FALSE, then modify is to TRUE, and do checkpoint, of course, these operations need mutex. do like this, the concurrency is ok.
buf only do like this, the problem is log->lsn so bigger that is possibly in advance of log->last_checkpoint_lsn result is some log is covered incorrectly, so there should do another thing, in function log_write_up_to, and this process, "if (log_sys->lsn - log_sys->last_checkpoint_lsn > log_sys->max_checkpoint_age)"(this is like if statement in function log_checkpoint_margin, "if(checkpoint_age > log->max_checkpoint_age)"), if this condition is satisfied, execute the function log_checkpoint_margin, then reexecute this function log_write_up_to(goto loop directly).
do like this, the concurrency have no problem, it's all ok.
[5 Dec 2012 22:00] Sveta Smirnova
Thank you for the report.

Please send us your configuration file and configuration of your operating system: how many cores do you have, etc.
[7 Dec 2012 2:14] wang huai
I'm so sorry that I forget to reply this letter, so I attachment the configure option and mechine information.
#MySQL 5.5.8
[client]
port            = 3306
socket          = /data/var/mysql.sock

[mysqld]
port            = 3306
socket          = /data/var/mysql.sock
user=mysql

skip-slave-start
skip-name-resolve

key_buffer_size=2M
read_buffer_size=2M
sort_buffer_size=2M

#5.1.3 table_cache deprecated
table_open_cache=16384

max_allowed_packet = 64M
max_connect_errors = 1000000
#net_buffer_length = 8K
#read_rnd_buffer_size = 512K

#default-storage-engine=InnoDB
character-set-server=utf8
collation-server=utf8_bin
init-connect='SET NAMES utf8'

max_connections=32000
binlog_cache_size=2M

query_cache_size = 256M
query_cache_limit = 2M

slave_apply_threads=8

#READ-UNCOMMITTED, READ-COMMITTED, REPEATABLE-READ, SERIALIZABLE
transaction_isolation = REPEATABLE-READ
tmp_table_size=256M

#--------log---------
slow_query_log=1
slow_query_log_file=slow.log
long_query_time = 1
log-error = err.log
log_slave_updates=0
expire_logs_days=99
binlog_format=statement
log-bin=fb-bin
#last intranet ip
server-id = 162051
datadir=/data/var

tmpdir=/dev/shm
#--------innodb--------------
#innodb_data_home_dir = /data/ibdata
innodb_autoextend_increment = 256
innodb_data_file_path=ibdata1:10M;ibdata2:10M:autoextend
innodb_table_locks = 0
innodb_lock_wait_timeout = 10
innodb_old_blocks_time=1000
#innodb_change_buffering=none
#(50-80)% * total memory ( 5G )
innodb_buffer_pool_size=10G
innodb_buffer_pool_instances=8
innodb_additional_mem_pool_size=128M
innodb_max_dirty_pages_pct=80
innodb_read_io_threads = 8
innodb_write_io_threads =8
innodb_io_capacity=2000
#innodb_doublewrite=0
#ssd
#innodb_flush_method = O_DSYNC

#A recommended value is 2 times the number of CPUs plus the number of disks.
#5.5.8 default 0
innodb_thread_concurrency = 32
#innodb_log_group_home_dir = /data/iblogs
innodb_log_files_in_group =3
innodb_flush_log_at_trx_commit =0
sync_binlog=3000
innodb_support_xa=0
innodb_file_per_table=1

innodb_log_file_size = 1G #25% *buffer pool size (1G)
innodb_log_buffer_size = 96M

relay-log=fb-relay

## the mechine information is follows:
8 kernals cpu
2.4GHZ
256G memary

and is there anything to need?
[7 Dec 2012 21:31] Sveta Smirnova
Thank you for the feedback.

I can not repeat described behavior with sysbench OLTP test.

With --innodb_doublewrite=1 --innodb_buffer_pool_instances=4 (I have 4 cores, no sense to have more):

OLTP test statistics:
    queries performed:
        read:                            3463586
        write:                           989595
        other:                           494797
        total:                           4947978
    transactions:                        247398 (266.02 per sec.)
    deadlocks:                           1      (0.00 per sec.)
    read/write requests:                 4453181 (4788.32 per sec.)
    other operations:                    494797 (532.03 per sec.)

With --innodb_doublewrite=0 --innodb_buffer_pool_instances=4:

OLTP test statistics:
    queries performed:
        read:                            2332330
        write:                           666380
        other:                           333190
        total:                           3331900
    transactions:                        166595 (179.13 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 2998710 (3224.38 per sec.)
    other operations:                    333190 (358.26 per sec.)

With --innodb_doublewrite=1 --innodb_buffer_pool_instances=4 --innodb_read_io_threads=4 --innodb_write_io_threads=4 --innodb_thread_concurrency=8 even much slower:

OLTP test statistics:
    queries performed:
        read:                            1745408
        write:                           498687
        other:                           249343
        total:                           2493438
    transactions:                        124671 (134.05 per sec.)
    deadlocks:                           1      (0.00 per sec.)
    read/write requests:                 2244095 (2412.97 per sec.)
    other operations:                    249343 (268.11 per sec.)

Please try with smaller innodb_thread_concurrency (I suggest 8 for your environment) and inform us if this solves the problem.
[8 Dec 2012 2:47] wang huai
I'm very happy to receive you reply, this problem can not been tested by normal method, there are several points to been mind, they ara as follow:
1. log is very much and speed of generation is very fast.
2. there have several sql threads almost like the number of multi-buffer pool instances.
3. the several sql threads are executed in parallel mode, like multi-sessions execution.
4. you have to watch the procedure of running, but not the result of it. because at the start of execution, the log is little, so the speed is high, but after a short while(several minutes), there have lot of log, innodb have to do  checkpoint, and from this time, the QPS is lower and lower.
5. session attribute is autocommit every sql statement, and every sql is modify the tables, there is no read sql.

The QPS mentioned above is the transactions have been completed per minute.
the method as follows:
use information_schema;
select variable_value into @num_com from global_status where variable_name ='COM_COMMIT';
select variable_value into @num_roll from global_status where variable_name ='COM_ROLLBACK';
select variable_value into @uptime from global_status where variable_name ='uptime';
select @num_com, @num_roll, @uptime, (@num_com+@num_roll)/@uptime;
the QPS is the result of the last sql statement;

do as above, it should show the problem, if it still not, please check the function log_free_check and log_check_margins, check the double write's mutex, because flush the page is multi-threads, but the mutex is single, so others thread must to wait, because many user threads can enter the function log_check_margins and log_checkpoint_margin in the same time.

thank you very much
[12 Dec 2012 18:03] Sveta Smirnova
Thank you for the feedback.

I still can not repeat described behavior.

Actually sysbench OLTP test case does something like you described here. Could you please create repeatable test case?

Would be better if you take SysBench from Launchpad: https://launchpad.net/sysbench and write a test in lua format (example is oltp.lua in tests/db/oltp.lua
[13 Dec 2012 4:26] wang huai
I want to say, the test of this problem with sysbench is not reasonable, because it has only one table's sql statement, so the concurrency of mysql can't been very high.
so if want to improve the concurrency of mysql, there must be 8 session like me, and every session process one table, they are different each other. every session is execute the insert statement.
please change one computer with 8 cpu cores, with the configure that I give you last.
you should check the QPS which I give you the compute method last time.
once the mysql start and you start test, make sure the uptime status is start from 0, from this time, check the QPS one time after few minutes and you should find the problem.

I hope your success.
thank you.
[13 Dec 2012 13:33] Sveta Smirnova
Thank you for the feedback.

But SysBench from Launchpad is scriptable and allows as many tables as you want. I used 8 tables. This is why I asked you about test case.
[13 Dec 2012 14:04] wang huai
Thank you for your feedback.
The test case is multi-threads slave replication, this is distributed by table name, I configure 8 threads, SQL thread take an event to one thread by table name hash. There have lot of  BINLOG of master , so the slave is very tired. It is executed by function apply_event.
The environment about this problem is like this.
I do not know why the problem is not find again . And here, my classmate can also repeat it. The method is like sysbench.
Is The buffer pool instance have 8?
[13 Dec 2012 14:29] Sveta Smirnova
Thank you for the feedback.

Let me explain why I ask you about test case first.

There are known performance improvement impacts, caused by number of concurrency threads, doublewrite buffer and so on. Related described at:

http://dimitrik.free.fr/blog/archives/2012/10/mysql-performance-innodb-buffer-pool-instanc...

http://dimitrik.free.fr/blog/archives/2011/01/mysql-performance-innodb-double-write-buffer...

http://dimitrik.free.fr/blog/archives/2009/08/mysql-performance-innodb-doublewrite-buffer-...

In past (prior 5.5) doublewrite was expected to slowdown performance a bit, but, according to last article, it even increases performance now.

Also, number of buffer pools and concurrent threads should vary from number of cores too.

For example,

> Is The buffer pool instance have 8?

If I set number of buffer pool instances on my 4-cores machine I would get performance decrease and this would not be a bug, but hardware limitation. So I need to use 4 (or find better machine). Generally, if I repeat sequence of actions properly, performance improvement/degradation should scale here.

And this is yet another reason why I want test case from you before start trying to find better machine: I want to be sure I am doing right things.

Actually SysBench or not does not matter: just something which we can compile and run.
[13 Dec 2012 14:45] wang huai
thank you for your feedback 
In the last letter, I said the case, it is not suitable for you, and now, the only different between us is computer cores and CPU speed, which can result in different concurrent and compress of MySQL, and this problem must ask this point----high speed log generation, do the checkpoint frequently, so the bottleneck can show.
I think 4 concurrents is not enough.
Thank you for your review.
[13 Dec 2012 15:58] Sveta Smirnova
Thank you for the feedback.

OK, will try to find better machine and run same test. I also see one more difference between what I tested and your last comment: I ignored the fact it is slave.

Could you please confirm if all this load slave receives from master?
[13 Dec 2012 23:48] wang huai
I think it is not important if it is slave, it is only simple concurrency, and all the SQL are come from master, but you can not test with slave, because it is only single SQL thread, and why I can, because I have modify the source and support the simple SQL concurrency, which our application is satisfied, and in this project test, I found the problem.
In your test, make sure long time, and watch the qps's change. And can also check the util in iostat x 1, it is very high when Innodb do the first checkpoint, it is 90% all the time.
My disk is not ssd, is stat, the buffer pool you can configure very large
[14 Dec 2012 11:31] Sveta Smirnova
Thank you for the feedback.

So you don't use our multi-threaded slave, but your own patch. Yes, I see slave_apply_threads=8 is not slave_parallel_workers=8 Is this problem repeatable with not modified MySQL from Oracle?
[14 Dec 2012 11:45] wang huai
it is existed also, because in order to verified this problem, my classmate also test it in other method, he write a process like the multi-threaded slave's concurrency, it is also existed.
the multi-threaded slave is my own patch, distributed by tablename, only this can result in high concurrency.
so i think you should find a better computer and set more threads to insert different tables, like this, it should show this problem.

thank you for your feedback, my English is poor, some sentences possible not suitable, Please forgive me. thank you very much.
[19 Dec 2012 3:40] Mark Callaghan
Your English and performance debugging efforts are good. I learned something by reading this. I think that storage HW will make it easy or hard to reproduce this. Writes are much slower without HW RAID or SSD storage. Are you using either?
[19 Dec 2012 4:23] wang huai
thank you for your praise.
the disk I used is raid, but I do not know what is storage HW as you said, I want to know whether the problem is found.
but I think it is repeated agian very simple.
[19 Dec 2012 6:26] Justin Swanhart
Why can't you provide your classmate's test program?
[20 Dec 2012 6:37] wang huai
the repeated case

Attachment: 1 (text/plain), 10.59 KiB.

[20 Dec 2012 17:57] Sveta Smirnova
Thank you for the test case.

You can attache your test_server program to the bug report or upload it to Oracle FTP server as described in "Files" tab.
[21 Dec 2012 4:19] wang huai
ok, I have upload the program file to the oracle ftp, the file name is test_server.tar.gz, I don not know whether you can find it, thank you, good luck. it can be run directly.
[25 Dec 2012 13:49] Sveta Smirnova
Thank you for the test case.

With test case I can repeat slight slowdown with innodb_doublewrite=1 and innodb_buffer_pool_instances=8. But I didn't get stalls like in your case:

69171                 92478 
3071                  87945 
3                     96386 
2                     96027 
4                     92605 
24                    87716 
91                    86999 
121                   88956 
137                   81008 
279                   91037 
447                   95036 
1786                  8941

Nothing like this.

In my case this looks like:

$cat dw1t8.log | sort | head -n 50

18425
19132
20581
20981
21217
...
26442
26493

and with innodb_doublewrite=1 and innodb_buffer_pool_instances=1 results are slightly better:

$cat dw1t1.log | sort | head -n 20

19108
21302
23201
...
26210
26372
27051

Note, first output is lowest 50 values while second is lowest 20 only.
[26 Dec 2012 2:34] wang huai
I'm very glad to receive you reply.
but why didn't you repeat the result? are you do the operation follow my last letter step by step? is it 8 threads concurrency? is it 8 tables? is is 200000 rows per table? is the computer have 8 cores? do you use my test case completely? do you use the parameter completely what I give you? 
if they are, I think you can repeat easyly. 
note, the version I used is 5.5.27, but as I know, 5.5* are all have this problem. 
thank you for your patient.
[26 Dec 2012 10:52] Sveta Smirnova
Thank you for the feedback.

> I'm very glad to receive you reply. but why didn't you repeat the result?

Because this is not repeatable on my machine with all settings like you provided.
[13 Jun 2013 18:01] Inaam Rana
Wang Huai,

There are two separate issues. Lets not mix them up.

We have one doublwrite buffer system wide. This buffer has 128 pages. All writes to database pass through this double write buffer. As we scale better with multiple buffer pool instances it is a possibility that this doublewrite buffer becomes a bottleneck. We are aware of this and are working on it.

log_free_check() is not related to doublewrite buffer directly. It is about redo log and when and how we can reuse it. The pain point is when we have to flush the pages from the buffer pool and not the actual writing of the checkpoint record to the redo log. When we make all threads wait there is a good reason for it. We cannot attempt to reuse redo log until all the dirty pages belonging to that are already flushed. We made all threads wait on purpose. It is working as designed.

If you see log_free_check() causing stalls then the configuration workaround is to increase the redo log size and enable adaptive flushing. You have to set relevant parameters to make adaptive flushing more aggressive like innodb_io_capacity, innodb_io_capacity_max (in 5.6) etc.