Bug #66626 user defined variables replicated to same server and stop slave sql_thread hangs
Submitted: 30 Aug 2012 20:02 Modified: 4 Feb 2013 5:08
Reporter: Dan Ragle Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.1.61-log OS:Linux (RHEL 6.3)
Assigned to: CPU Architecture:Any
Tags: user-defined variables wrong server-id stop slave sql_thread hangs

[30 Aug 2012 20:02] Dan Ragle
Description:
In a circular master/slave environment (two servers that replicate back and forth to one another) setting a user-defined variable on one server will replicate to the other and then BACK to the original with the SLAVE's server-id (and thus the var is set in the slave thread on the original server as well). In certain situations this appears to cause a STOP SLAVE SQL_THREAD statement on the original server to hang until another statement is received from the second server; and in the case of a general STOP SLAVE statement may cause the statement to hang indefinitely (since the IO slave is stopped and thus no further IO statements are received from the second server). 

The STOP SLAVE SQL_THREAD part of this bug sounds a lot like bug 53985, I wasn't sure if it was the same beast or not. Also I'm requesting if that particular bug fix can be back ported to 5.1. It's serious for us since it potentially causes STOP SLAVE as well as a regular server shutdown to hang, requiring us to force-kill the mysql thread to stop the server.

In the example I give below, the replication mode is MIXED. 

How to repeat:
1.) Setup two servers, both replicating to one another. For this example I'll call them "A" and "B". "A" has served-id 1 and "B" has server-id 2.

2.) Create TESTTABLE with an int (primary key) and a varchar as a MyISAM

CREATE TABLE `TESTTABLE` (
    `testthis` int(3) unsigned NOT NULL,
    `testthat` varchar(10) NOT NULL,
    PRIMARY KEY (`testthis`)) 
ENGINE=MyISAM DEFAULT CHARSET=latin1

3.) On "B", insert two new records into the table, using two separate statements (i.e.):

mysql> insert into TESTTABLE (testthis,testthat) values (1,'first');
Query OK, 1 row affected (0.00 sec)

mysql> insert into TESTTABLE (testthis,testthat) values (2,'second');
Query OK, 1 row affected (0.00 sec)

4.) On "A", set a user-defined variable and then use it to insert a third record into the test table:

mysql> set @mine = 'third';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into TESTTABLE (testthis,testthat) values (3,@mine);
Query OK, 1 row affected (0.00 sec)

5.) At this point, if you view the relay log on "A" you see something like this:

# at 3179
#120830 15:39:17 server id 2  end_log_pos 5644  Query   thread_id=12    exec_time=0     error_code=0
SET TIMESTAMP=1346355557/*!*/;
insert into TESTTABLE (testthis,testthat) values (1,'first')
/*!*/;
# at 3304
#120830 15:39:26 server id 2  end_log_pos 5770  Query   thread_id=12    exec_time=0     error_code=0
SET TIMESTAMP=1346355566/*!*/;
insert into TESTTABLE (testthis,testthat) values (2,'second')
/*!*/;
# at 3430
#120830 15:40:04 server id 2  end_log_pos 5812  User_var
SET @`mine`:=_latin1 0x7468697264 COLLATE `latin1_swedish_ci`/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Notice that the set @mine statement replicated to "B" which processed it (and the subsequent insert). It ("B") then wrote its own set @mine statement into its binlog, identified with its own server-id (2) and then sent it back to "A" to be processed, now identified as server-id 2 (so "A" did in fact process it, as is noted in "A"'s relay log above). Is this desired behavior? 

6.) If you now issue a STOP SLAVE SQL_THREAD on "A", it will hang. You can force it to clear by sending through ANOTHER statement from "B"; for example, if you submit this on "B":

mysql> insert into TESTTABLE (testthis,testthat) values (4,'fourth');
Query OK, 1 row affected (0.00 sec)

the STOP SLAVE SQL_THREAD on "A" will complete and return as soon as the slave executes this new statement. 

Suggested fix:
Don't know what the right fix is, as I'm not sure if the reset of the server-id on the user-defined variables is desired behavior or not. 

But whatever the fix is it would be great if it could be backported to 5.1. :) We don't have viable path to upgrade to 5.5 anytime soon. 

Thanks!
[31 Aug 2012 19:54] Sveta Smirnova
Thank you for the report.

I can not repeat described behavior with current version 5.1.65. Please try this version and inform us if this problem still repeatable in your environment.
[1 Sep 2012 19:25] Dan Ragle
Hi, thanks for the reply. I updated my test servers to 5.1.65 from the mysql.com site, and the original problem appears to have gone away. HOWEVER, in its place I noticed this:

After step 4 in the above test, the RELAY log on "B" has this:

# at 488
#120901 14:59:56 server id 1  end_log_pos 636   User_var
SET @`mine`:=_latin1 0x7468697264 COLLATE `latin1_swedish_ci`/*!*/;
# at 530
#120901 14:59:56 server id 1  end_log_pos 759   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525996/*!*/;
insert into TESTTABLE (testthis,testthat) values (3,@mine)
/*!*/;

but the BINARY log on "B" has this:

# at 616
#120901 14:59:07 server id 2  end_log_pos 741   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525947/*!*/;
insert into TESTTABLE (testthis,testthat) values (1,'first')
/*!*/;
# at 741
#120901 14:59:14 server id 2  end_log_pos 867   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525954/*!*/;
insert into TESTTABLE (testthis,testthat) values (2,'second')
/*!*/;
# at 867
#120901 14:59:56 server id 1  end_log_pos 990   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525996/*!*/;
insert into TESTTABLE (testthis,testthat) values (3,@mine)
/*!*/;

I.E., notice that now the setting of @mine doesn't actually get into the binary log of "B" at all. Thus the original problem is avoided, since the stray set @mine statement never makes it back to the "A" server. But wouldn't that cause a problem for any other slaves of "B" (e.g., a "C" server that is a slave of "B")? In my test environment I don't have a third server that I can setup as a slave of "B", but that is exactly what we do in our production environment and I'd be concerned that "C" would not replicate properly in that scenario.
[3 Sep 2012 18:52] Sveta Smirnova
Thank you for the feedback.

Not, slave C would not be affected: it replicates those changes which are in binary log of B. Relay logs are for B only, they are not used by C.

Closing as "Can't repeat" since bug is not repeatable anymore.
[3 Sep 2012 22:45] Dan Ragle
I'm not following that. As you stated, a "C" slave server would only replicate those changes which are in binary log of B. And since there is no statement in the "B" binary log to set the value for @mine, how could the third statement (which relied on @mine) possibly replicate properly? I.E., again, here are the three statements, in order, that were in the binary log of "B" following the above test with 5.1.65:

# at 616
#120901 14:59:07 server id 2  end_log_pos 741   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525947/*!*/;
insert into TESTTABLE (testthis,testthat) values (1,'first')
/*!*/;
# at 741
#120901 14:59:14 server id 2  end_log_pos 867   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525954/*!*/;
insert into TESTTABLE (testthis,testthat) values (2,'second')
/*!*/;
# at 867
#120901 14:59:56 server id 1  end_log_pos 990   Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1346525996/*!*/;
insert into TESTTABLE (testthis,testthat) values (3,@mine)
/*!*/;

Cheers!
[4 Sep 2012 17:47] Sveta Smirnova
Thank you for the feedback.

Sorry, I forgot this is statement format. I re-tested this with 5.1.65 again and now I can repeat original problem, but not issue with lack of variable in the binary log.

So verifying original problem.
[28 Jan 2013 12:04] Sujatha Sivakumar
This issue was fixed as part BUG#13596613.
Fix is available from 5.1.66 and 5.5.28. (Issue does not occur in 5.6.)