Bug #45819 Semisynchronous replication: unstable operation with row-based binlog
Submitted: 29 Jun 2009 8:20 Modified: 12 Nov 2009 12:23
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.4 OS:Any
Assigned to: Zhenxing He CPU Architecture:Any

[29 Jun 2009 8:20] Philip Stoev
Description:
When executing a very simple, non-concurrent workload against the master with row-based binlog, semisynchronous replication repeatedly switches ON and OFF, resulting in numerous transactions that were replicated asynchronously.

This is not observed when using statement-based binlogging. The patch from

http://lists.mysql.com/commits/77401

was used to get the test to proceed that far.

There is how SHOW STATUS looks like after a short test:

+--------------------------------------------+---------+
| Variable_name                              | Value   |
+--------------------------------------------+---------+
| Rpl_semi_sync_master_clients               | 1       |
| Rpl_semi_sync_master_net_avg_wait_time     | 0       |
| Rpl_semi_sync_master_net_wait_time         | 0       |
| Rpl_semi_sync_master_net_waits             | 0       |
| Rpl_semi_sync_master_no_times              | 5       |
| Rpl_semi_sync_master_no_tx                 | 317     |
| Rpl_semi_sync_master_status                | ON      |
| Rpl_semi_sync_master_timefunc_failures     | 0       |
| Rpl_semi_sync_master_tx_avg_wait_time      | 4047    |
| Rpl_semi_sync_master_tx_wait_time          | 5374672 |
| Rpl_semi_sync_master_tx_waits              | 1328    |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0       |
| Rpl_semi_sync_master_wait_sessions         | 0       |
| Rpl_semi_sync_master_yes_tx                | 1328    |
| Rpl_semi_sync_slave_status                 | OFF     |
+--------------------------------------------+---------+

Approximately 25% of the transactions were replicated asynchronously, and there is no indication in the logs as to why this happened. Better debug output in the logs is a must for testing and troubleshooting semisynchronous replication.

The machine is a four-core Xeon, fully capable of supporting a single client on the master and the slave with plenty of memory and CPU left over.

How to repeat:
Branch the mysql-test-extra-6.0 tree, cd to mysql-test/gentest and then run:

$ perl runall.pl --basedir=/build/bzr/azalea-bugfixing --grammar=conf/replication_simple.yy --rpl_mode=row --mem --mysqld=--plugin-dir=/build/bzr/azalea-bugfixing/plugin/semisync/.libs --mysqld=--plugin-load=rpl_semi_sync_master=libsemisync_master.so:rpl_semi_sync_slave=libsemisync_slave.so --validators= --mysqld=--rpl_semi_sync_master_enabled=1 --mysqld=--rpl_semi_sync_slave_enabled=1 --duration=180 --threads=1

This is going to start a very simple workload. If you run SHOW STATUS in a separate console, you will observe changes to the Rpl_semi_sync_master_status variable and and increase Rpl_semi_sync_master_no_tx
[29 Jun 2009 10:10] Zhenxing He
Hi, Philip

Semi-sync can be off if the slave does not send reply in time, you can specify a longer timeout value (default is 2s) to reduce this. 

If you're testing with debug build, please consider using release build with optimization turned on, which can improve the performance and reduce timeouts a lot according to my test.
[29 Jun 2009 11:52] Philip Stoev
The cause of this bug may be in the confusion as to the dimensions of the rpl_semi_sync_master_timeout variable. It is supposed to be in msec, but the default value as defined in semisync_master_plugin.cc is 10, meaning 10 mesec, or 0.01 sec, which causes numerous silent timeouts.
[30 Jun 2009 8:26] Zhenxing He
Right, I think the default should be at least 1000 (1s).
[7 Jul 2009 4:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78074

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the timeout precision to seconds, and also add
      log information when semi-sync on master switches on or off.
[7 Jul 2009 7:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78076

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the timeout precision to seconds, and also add
      log information when semi-sync on master switches on or off.
[7 Jul 2009 9:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78091

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the timeout precision to seconds, and also add
      log information when semi-sync on master switches on or off.
[7 Jul 2009 10:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78104

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the default timeout to 10000ms (10 seconds),
      and also add log information when semi-sync on master switches
      on or off.
[16 Jul 2009 9:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78816

2827 He Zhenxing	2009-07-16
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the default timeout to 10000ms (10 seconds),
      and also add log information when semi-sync on master switches
      on or off.
[23 Jul 2009 10:24] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version source revid:zhenxing.he@sun.com-20090716085959-qtdbjusms4zo136c) (merge vers: 5.4.4-alpha) (pib:11)
[3 Aug 2009 18:43] Paul DuBois
No changelog entry needed. Does not appear in any released version.
[26 Sep 2009 4:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/84703

3108 He Zhenxing	2009-09-26
      Backporting WL#4398 WL#1720
      Backporting BUG#44058 BUG#42244 BUG#45672 BUG#45673
      Backporting BUG#45819 BUG#45973 BUG#39012
[27 Oct 2009 9:49] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091027094604-9p7kplu1vd2cvcju) (version source revid:zhenxing.he@sun.com-20091026140226-uhnqejkyqx1aeilc) (merge vers: 6.0.14-alpha) (pib:13)
[27 Oct 2009 23:14] Paul DuBois
Noted in 6.0.14 changelog.

The default value of rpl_semi_sync_master_timeout was changed from
10ms to 10000ms (10 seconds) to reduce the incidence of timeouts. In
addition, the master now logs when semisynchronous replication
switches on or off.
[27 Oct 2009 23:19] Paul DuBois
Setting report to NDI pending push into 5.5.x.
[12 Nov 2009 8:22] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091110093229-0bh5hix780cyeicl) (version source revid:alik@sun.com-20091027095744-rf45u3x3q5d1f5y0) (merge vers: 5.5.0-beta) (pib:13)
[12 Nov 2009 12:23] Jon Stephens
Also documented in the 5.5.0 changelog; closed.
[18 Dec 2009 15:45] Paul DuBois
Removed 5.5.0 changelog entry. In 5.5, semisync replication first appears in 5.5.0, so this bug affects no 5.5.x releases.