Bug #73202 write/sync redo log before flush thread cache to binlog
Submitted: 5 Jul 2014 0:31 Modified: 24 Nov 2014 2:01
Reporter: zhai weixiang (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.6, 5.7.5 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[5 Jul 2014 0:31] zhai weixiang
Description:
According to the crash recovery logic of mysql server, we only need to write/sync prepared transaction before writing to binlog. And currently transactions are not grouped during innodb prepare stage.  

I changed the related code described as bellow:

1. don’t sync redo log during innodb prepare stage if binlog enabled.

2. during flush stage of group commit (In function process_flush_stage_queue)
a) collect  group commit queue

b) call ha_flush_logs to sync redo log (make sure all redo log of prepared transaction in current 
queue were synced before writing to binlog)
--innobase_flush_logs have to be changed.

c) flush thread cache to binlog file for every thd in queue

a simple test:
sysbench , 20 tables with 100,000 records, update_index.lua

binlog_enhanced_flush : 0 means default logic, 1 means new logic described above (based on MySQL5.6.16)

threads binlog_enhanced_flush = 0 binlog_enhanced_flush = 1
8 2740 3000
16 4350 5400
32 7300 9800
64 11700 15800
128 18800 24000
256 24000 30000
512 30000 39000

How to repeat:
read the code 

Suggested fix:
I will attach a simple patch later to prove the performance improvement
[7 Jul 2014 7:07] zhai weixiang
patch based on 5.7.4, add new option binlog_enhanced_sync to control the new behavior

Attachment: bug73202.diff (application/octet-stream, text), 13.24 KiB.

[15 Jul 2014 14:29] Sveta Smirnova
Thank you for the report and contribution.

But of us to accept your contribution you need to sign OCA and contribute through "Contribution" tab. Please do it.
[16 Jul 2014 8:45] zhai weixiang
updated patch, code cleanup

Attachment: bug73202.diff (application/octet-stream, text), 12.96 KiB.

[18 Jul 2014 14:41] zhai weixiang
re-attach the patch

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug73202.diff (application/octet-stream, text), 10.94 KiB.

[4 Aug 2014 5:54] zhai weixiang
is there any comment ?
[5 Aug 2014 12:15] Sveta Smirnova
Thank you for the feedback and re-attaching the patch.

We still need to run the tests in our environment to confirm that the bug exists. Please be patient: it will take some time.
[5 Aug 2014 18:52] Inaam Rana
This is an interesting idea. Couple of comments about the patch:
* I am not clear why you need to make all the changes to MYSQL_BIN_LOG::process_flush_stage_queue(). I'd say something like following at the start of the function should do the trick:

if (enhanced_sync)
  ha_flush_logs(NULL, 1);

* There is a bit of a race where opt_binlog_enhanced_sync can change between the time it is read in innodb prepare code path and later in process_flush_stage_queue().
[6 Aug 2014 0:20] zhai weixiang
hi, Inaam

Thanks for your reply.

First:

The main logic of changes in function process_flush_stage_queue :
step 1: collect group commit queue
step 2: write/sync redo log
step 3: flush binlog cache.

Suppose we swap step 1 and step2:
step 1: write/sync redo log
step 2: collect group commit queue
step 3: flush binlog cache
a transaction may haven't prepared (so it's redo log is not copied to log buffer after step1) but quickly entered into ordered_commit and marked itself as a follower before step 2.

Second:
Yes, you're right. this option should remain unchanged between innodb prepare and binlog commit. But I don't know how to solve this problem gracefully :( . Since this option is rarely changed, I decide to tolerate this race currently. But of course, this should be fixed.
[6 Aug 2014 1:59] Inaam Rana
I don't fully understand this part of code but won't having LOCK_log prevent incoming requests to be queued in the flush queue?
[6 Aug 2014 2:10] zhai weixiang
Only the leader of group has to acquire LOCK_log. Check the code in function MYSQL_BIN_LOG::change_stage -> Stage_manager::enroll_for
[11 Aug 2014 23:51] Sveta Smirnova
Thank you for the report.

Verified as described.

My results on 8-cores machine:

	no patch	patch
4	1916.32		2330.66
8	2840.54		3200.32 
16	3506.79		4369.22
32	1994.24		2775.53
[13 Aug 2014 14:42] Inaam Rana
Weixiang,

Thanks for the pointer. I'll look into that code.
One suggestion is that though the current approach should work but I think we should leave the write during innodb prepare as is. That is, we only attempt to group the sync operation which is the real heavy duty call.
[14 Aug 2014 0:09] Sunny Bains
I agree with Inaam, we should split the fix into two and do the binlog changes first and then decide on the InnoDB changes.
[14 Aug 2014 2:59] zhai weixiang
Maybe we can add some option to let user choose: group redo sync or group redo write. :-) because we can also reduce log_sys mutex contention besides IO overhead if log_write_up_to is not called during innodb prepare.
[19 Nov 2014 13:26] David Moss
Posted by developer:
 
Thanks for your feedback. The following was added to the 5.6.23 and 5.7.6 changelog with commit 4672:
When using SHOW SLAVE STATUS to monitor replication performance, Seconds_Behind_Master sometimes displayed unexpected lag behind the master. This was caused by Previous_gtids_log_events being written to the slave's relay log with a timestamp behind the master, and then being used to calculate the Seconds_Behind_Master. This fix ensures that events generated on the slave that are added to the relay log and not used when calculating Seconds_Behind_Master.
[19 Nov 2014 14:12] zhai weixiang
Wait..David, the changelog entry you posted seems not related to this bug report..
[20 Nov 2014 7:36] zhai weixiang
let me re-open this bug :)
[20 Nov 2014 10:23] David Moss
Hello Zhai,
yes you are right, sorry that was my mistake. Let me double check the status of this and get back to you.
[20 Nov 2014 11:08] David Moss
OK so I have spoken to the Replication team and they will follow up in regards to your contribution personally. I will close the bug again :)
[24 Nov 2014 2:01] zhai weixiang
Hi, David, So what is the state of this bug ? Is it fixed in the next release ?
[4 Dec 2014 16:12] David Moss
Hello Zhai, this is the content that has been added to the 5.7.6 changelog:
When using InnoDB with binary logging enabled, concurrent transactions written in the InnoDB redo log are now grouped together before synchronizing to disk when innodb_flush_log_at_trx_commit is set to 1, which reduces the amount of synchronization operations. This can lead to improved performance.

The replication team will follow up to let you know how they worked with your contribution. Thanks!
[4 Dec 2014 16:20] Luis Soares
Zhai,

Thank you.

Your contributed patch inspired us and served as a foundation for
writing a fix for the issue at hand.

Regards,
Luís