MySQL Bugs: #36826: rpl_slave_status fails sporadically in pushbuild

Bug #36826	rpl_slave_status fails sporadically in pushbuild
Submitted:	20 May 2008 17:48	Modified:	29 Jul 2008 10:10
Reporter:	Sven Sandberg	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1	OS:	Any
Assigned to:	Sven Sandberg	CPU Architecture:	Any
Tags:	pushbuild, rpl_slave_status, sporadic, test failure

Description:
rpl_slave_status fails sporadically in pushbuild. The slave has stopped because of the following error in the SQL thread:

"Could not execute Delete_rows event on table mysql.user; Can't find record in 'user', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 986"

How to repeat:
Run rpl_slave_status a few times or see pushbuild.

Suggested fix:
.

Here's a result diff:

rpl.rpl_slave_status 'row'     [ fail ]

--- /home/sven/bk/debug-max/5.1-bugteam/mysql-test/suite/rpl/r/rpl_slave_status.result	2008-05-20 20:24:41.000000000 +0300
+++ /home/sven/bk/debug-max/5.1-bugteam/mysql-test/suite/rpl/r/rpl_slave_status.reject	2008-05-20 20:31:53.000000000 +0300
@@ -98,30 +98,31 @@
 flush privileges;
 stop slave;
 start slave;
-show slave status;
-Slave_IO_State	#
+"Failed while waiting for slave SQL to start"
+SHOW SLAVE STATUS;
+Slave_IO_State	Connecting to master
 Master_Host	127.0.0.1
 Master_User	rpl
-Master_Port	MASTER_MYPORT
+Master_Port	10080
 Connect_Retry	1
 Master_Log_File	master-bin.000001
-Read_Master_Log_Pos	#
-Relay_Log_File	#
-Relay_Log_Pos	#
+Read_Master_Log_Pos	1134
+Relay_Log_File	slave-relay-bin.000002
+Relay_Log_Pos	636
 Relay_Master_Log_File	master-bin.000001
 Slave_IO_Running	No
-Slave_SQL_Running	Yes
+Slave_SQL_Running	No
 Replicate_Do_DB	
 Replicate_Ignore_DB	
 Replicate_Do_Table	
 Replicate_Ignore_Table	
 Replicate_Wild_Do_Table	
 Replicate_Wild_Ignore_Table	
-Last_Errno	0
-Last_Error	
+Last_Errno	1032
+Last_Error	Could not execute Delete_rows event on table mysql.user; Can't find record in 'user', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 986
 Skip_Counter	0
-Exec_Master_Log_Pos	#
-Relay_Log_Space	#
+Exec_Master_Log_Pos	620
+Relay_Log_Space	1305
 Until_Condition	None
 Until_Log_File	
 Until_Log_Pos	0
@@ -133,10 +134,7 @@
 Master_SSL_Key	
 Seconds_Behind_Master	NULL
 Master_SSL_Verify_Server_Cert	No
-Last_IO_Errno	#
-Last_IO_Error	#
-Last_SQL_Errno	0
-Last_SQL_Error	
-drop table t1;
-delete from mysql.user where user='rpl';
-drop table t1;
+Last_IO_Errno	1045
+Last_IO_Error	error connecting to master 'rpl@127.0.0.1:10080' - retry-time: 1  retries: 10
+Last_SQL_Errno	1032
+Last_SQL_Error	Could not execute Delete_rows event on table mysql.user; Can't find record in 'user', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 986

mysqltest: In included file "./include/wait_for_slave_sql_to_start.inc": Result content mismatch

Aborting: rpl.rpl_slave_status failed in default mode. 
To continue, re-run with '--force'.
Stopping All Servers

Here's the end of the output from running mysqlbinlog on slave-bin.000001:

# at 476 (0x1dc)
# at 517 (0x205)
#080520 19:31:49 server id 1  end_log_pos 517 	Table_map: `test`.`t1` mapped to number 16
#080520 19:31:49 server id 1  end_log_pos 551 	Write_rows: table id 16 flags: STMT_END_F

BINLOG '
BQszSBMBAAAAKQAAAAUCAAAAABAAAAAAAAAABHRlc3QAAnQxAAEDAAE=
BQszSBcBAAAAIgAAACcCAAAQABAAAAAAAAEAAf/+AQAAAA==
'/*!*/;
# at 551 (0x227)
#080520 19:31:49 server id 1  end_log_pos 620 	Query	thread_id=3	exec_time=0	error_code=0
SET TIMESTAMP=1211304709/*!*/;
COMMIT
/*!*/;
# at 620 (0x26c)
#080520 19:31:50 server id 1  end_log_pos 688 	Query	thread_id=3	exec_time=0	error_code=0
SET TIMESTAMP=1211304710/*!*/;
BEGIN
/*!*/;
# at 688 (0x2b0)
# at 841 (0x349)
#080520 19:31:50 server id 1  end_log_pos 841 	Table_map: `mysql`.`user` mapped to number 2
#080520 19:31:50 server id 1  end_log_pos 986 	Delete_rows: table id 2 flags: STMT_END_F

BINLOG '
BgszSBMBAAAAmQAAAEkDAAAAAAIAAAAAAAAABW15c3FsAAR1c2VyACf+/v7+/v7+/v7+/v7+/v7+
/v7+/v7+/v7+/v7+/v7+/vz8/AMDAwND/rT+MP4p9wH3AfcB9wH3AfcB9wH3AfcB9wH3AfcB9wH3
AfcB9wH3AfcB9wH3AfcB9wH3AfcB9wH3AfcB9wH3AQICAgAAAAAA
BgszSBkBAAAAkQAAANoDAAAQAAIAAAAAAAEAJ///////AAAAAIAJMTI3LjAuMC4xA3JwbCkqNjI0
NDU5Qzg3RTUzNEExMjY0OThBREUxQjEyRTBDNjZFREEwMzVBMwEBAQEBAQEBAQEBAQEBAQEBAQEC
AQEBAQEBAQEBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==
'/*!*/;
# at 986 (0x3da)
#080520 19:31:50 server id 1  end_log_pos 1055 	Query	thread_id=3	exec_time=0	error_code=0
SET TIMESTAMP=1211304710/*!*/;
COMMIT
/*!*/;
# at 1055 (0x41f)
#080520 19:31:50 server id 1  end_log_pos 1134 	Query	thread_id=3	exec_time=0	error_code=0
SET TIMESTAMP=1211304710/*!*/;
flush privileges
/*!*/;
# at 1134 (0x46e)
#080520 19:31:55 server id 1  end_log_pos 1153 	Stop
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Note that the slave's binlog contains a Delete_rows_log_event that removes the record from mysql.user, but SHOW SLAVE STATUS indicates that a Delete_rows_log_event could not be executed since the row is not in the table. This hints that it could be the case that the slave tries to execute the event twice. Also note that SHOW SLAVE STATUS indicates that the slave IO thread had errors.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46998

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47040

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47056

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47057

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47095

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/47154

Is it possible that this bug is the cause of my problems with RBR?  Here is an excerpt of my post from http://forums.mysql.com/read.php?26,216227,216227#msg-216227

I can't stop and start the replication at all without breaking the 
replication. If I do "stop slave" and then "start slave" again it 
immediately has an error like this, seemingly for random tables that 
have been written to (not one in particular): 

Could not execute Write_rows event on table 
XXX; Duplicate entry 
'YYY' for key 'ZZZ', 
Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's 
master log binlog.000003, end_log_pos 704849 

I'm running the Linux (AMD64 / Intel EM64T) build of 5.1.25 on a xeon box...nothing that unusual 
about my environment or database. Only thing I could think of is that 
my app does some large insert delayed ... values ... statements.

Eric, this bug is probably unrelated to your problems. This bug is about a mistake in how our internal test was constructed. It is impossible to know what your problem is, because some information is missing. You'll have to debug your SQL yourself. The error you get indicates that a row was inserted on the master with the same key as a row on the slave. Try to figure out which row, and what could cause your error.

Hi Eric!

As Sven mentions, your problems have nothing to do with this bug. The error HA_ERR_FOUND_DUPP_KEY is generated by the storage engine whenever there is a duplicate key in the table.

Since you are not giving much information about the setup, it is hard to say exactly what is the reason for your failure, but if I were to make a guess, I suspect that for some reason, replication was cut of in the middle of a statement (or transaction). If transactional engines are used on both the master and the slave, this is not a problem, but if a non-transactional engine like MyISAM is used, applied rows on the slave will not be rolled back but replication will start from the beginning of the partially applied statement when you do a START SLAVE, causing it to stop again with a duplicate key error.

What you need to do is to figure out what rows are missing on the slave, add them manually, and then do:

mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
mysql> START SLAVE;

This will cause replication to skip one event or transaction, and start replication after that. This will skip the partially applied statement, so unless you have applied the missing rows manually, the databases will be out of sync.

Please note that you might have another situation, and also that you have to check that the databases are in sync after you have managed to get replication running again. In the future, please ask these kinds of questions to the Replication list (see http://lists.mysql.com/) since you are more likely to get a more constructive answer to your problem there.

Thank you guys for the response.  Obviously we shouldn't be perpetuating this discussion within this bug, but I actually did post this question to the replication list (almost a month ago now) and forum and haven't gotten a response.  It sounds to me that what I'm running into is a fundamental limitation of the current mysql RBR and not a bug.  

Mats, you are saying that with a non-transactional engine, it is impossible to guarantee that stopping and starting RBR won't break replication, right?  In my app, I'm doing lots of batched, large "insert values ..." statements that take the bulk of the time, so it's tough to be able to stop replication and not hit the middle of one of them.  Therefore, starting it back up I always have an inconsistency.  

If I am understanding this correctly, then this should really be documented better.  Or maybe more extreme, using RBR with non-transactional engines should generate at least some sort of warning?  I'm having a hard time thinking of scenario where someone would want to use it unless they were using transactional tables.

Thanks,
eric

Pushed into 5.1.27  (revid:joerg@mysql.com-20080626161101-h5hsaluwtf05xdr2) (version source revid:joerg@mysql.com-20080626161101-h5hsaluwtf05xdr2) (pib:3)

Discussed with Sven on IRC - test case fix only, to end-user changes to document.

Closed without further action.