Bug #45819 Semisynchronous replication: unstable operation with row-based binlog
Submitted: 29 Jun 10:20 Modified: 12 Nov 13:23
Reporter: Philip Stoev
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:5.4 OS:Any
Assigned to: Zhenxing He Target Version:5.4+
Triage: Triaged: D3 (Medium)

[29 Jun 10:20] Philip Stoev
Description:
When executing a very simple, non-concurrent workload against the master with row-based
binlog, semisynchronous replication repeatedly switches ON and OFF, resulting in numerous
transactions that were replicated asynchronously.

This is not observed when using statement-based binlogging. The patch from

http://lists.mysql.com/commits/77401

was used to get the test to proceed that far.

There is how SHOW STATUS looks like after a short test:

+--------------------------------------------+---------+
| Variable_name                              | Value   |
+--------------------------------------------+---------+
| Rpl_semi_sync_master_clients               | 1       |
| Rpl_semi_sync_master_net_avg_wait_time     | 0       |
| Rpl_semi_sync_master_net_wait_time         | 0       |
| Rpl_semi_sync_master_net_waits             | 0       |
| Rpl_semi_sync_master_no_times              | 5       |
| Rpl_semi_sync_master_no_tx                 | 317     |
| Rpl_semi_sync_master_status                | ON      |
| Rpl_semi_sync_master_timefunc_failures     | 0       |
| Rpl_semi_sync_master_tx_avg_wait_time      | 4047    |
| Rpl_semi_sync_master_tx_wait_time          | 5374672 |
| Rpl_semi_sync_master_tx_waits              | 1328    |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0       |
| Rpl_semi_sync_master_wait_sessions         | 0       |
| Rpl_semi_sync_master_yes_tx                | 1328    |
| Rpl_semi_sync_slave_status                 | OFF     |
+--------------------------------------------+---------+

Approximately 25% of the transactions were replicated asynchronously, and there is no
indication in the logs as to why this happened. Better debug output in the logs is a must
for testing and troubleshooting semisynchronous replication.

The machine is a four-core Xeon, fully capable of supporting a single client on the
master and the slave with plenty of memory and CPU left over.

How to repeat:
Branch the mysql-test-extra-6.0 tree, cd to mysql-test/gentest and then run:

$ perl runall.pl --basedir=/build/bzr/azalea-bugfixing
--grammar=conf/replication_simple.yy --rpl_mode=row --mem
--mysqld=--plugin-dir=/build/bzr/azalea-bugfixing/plugin/semisync/.libs
--mysqld=--plugin-load=rpl_semi_sync_master=libsemisync_master.so:rpl_semi_sync_slave=libsemisync_slave.so
--validators= --mysqld=--rpl_semi_sync_master_enabled=1
--mysqld=--rpl_semi_sync_slave_enabled=1 --duration=180 --threads=1

This is going to start a very simple workload. If you run SHOW STATUS in a separate
console, you will observe changes to the Rpl_semi_sync_master_status variable and and
increase Rpl_semi_sync_master_no_tx
[29 Jun 12:10] Zhenxing He
Hi, Philip

Semi-sync can be off if the slave does not send reply in time, you can specify a longer
timeout value (default is 2s) to reduce this. 

If you're testing with debug build, please consider using release build with optimization
turned on, which can improve the performance and reduce timeouts a lot according to my
test.
[29 Jun 13:52] Philip Stoev
The cause of this bug may be in the confusion as to the dimensions of the
rpl_semi_sync_master_timeout variable. It is supposed to be in msec, but the default
value as defined in semisync_master_plugin.cc is 10, meaning 10 mesec, or 0.01 sec, which
causes numerous silent timeouts.
[30 Jun 10:26] Zhenxing He
Right, I think the default should be at least 1000 (1s).
[7 Jul 6:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78074

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the timeout precision to seconds, and also add
      log information when semi-sync on master switches on or off.
[7 Jul 9:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78076

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the timeout precision to seconds, and also add
      log information when semi-sync on master switches on or off.
[7 Jul 11:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78091

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the timeout precision to seconds, and also add
      log information when semi-sync on master switches on or off.
[7 Jul 12:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78104

2804 He Zhenxing	2009-07-07
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the default timeout to 10000ms (10 seconds),
      and also add log information when semi-sync on master switches
      on or off.
[16 Jul 11:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/78816

2827 He Zhenxing	2009-07-16
      BUG#45819 Semisynchronous replication: unstable operation with row-based binlog
      
      The master semi-sync wait timeout values was in milliseconds, and
      the defualt was 10ms, which caused a lot timeouts, and there were
      no log information about the semi-sync status changes because of
      timeout.
      
      This patch change the default timeout to 10000ms (10 seconds),
      and also add log information when semi-sync on master switches
      on or off.
[23 Jul 12:24] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090723102221-ps4uaphwbxzj8p0q) (version
source revid:zhenxing.he@sun.com-20090716085959-qtdbjusms4zo136c) (merge vers:
5.4.4-alpha) (pib:11)
[3 Aug 20:43] Paul DuBois
No changelog entry needed. Does not appear in any released version.
[26 Sep 6:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/84703

3108 He Zhenxing	2009-09-26
      Backporting WL#4398 WL#1720
      Backporting BUG#44058 BUG#42244 BUG#45672 BUG#45673
      Backporting BUG#45819 BUG#45973 BUG#39012
[27 Oct 10:49] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091027094604-9p7kplu1vd2cvcju) (version
source revid:zhenxing.he@sun.com-20091026140226-uhnqejkyqx1aeilc) (merge vers:
6.0.14-alpha) (pib:13)
[28 Oct 0:14] Paul DuBois
Noted in 6.0.14 changelog.

The default value of rpl_semi_sync_master_timeout was changed from
10ms to 10000ms (10 seconds) to reduce the incidence of timeouts. In
addition, the master now logs when semisynchronous replication
switches on or off.
[28 Oct 0:19] Paul DuBois
Setting report to NDI pending push into 5.5.x.
[12 Nov 9:22] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091110093229-0bh5hix780cyeicl) (version
source revid:alik@sun.com-20091027095744-rf45u3x3q5d1f5y0) (merge vers: 5.5.0-beta)
(pib:13)
[12 Nov 13:23] Jon Stephens
Also documented in the 5.5.0 changelog; closed.