MySQL Bugs: #67246: --relay-log-recovery required for crash-proof slave

Bug #67246	--relay-log-recovery required for crash-proof slave
Submitted:	16 Oct 2012 1:41	Modified:	7 Feb 2013 10:06
Reporter:	Gerry Narvaja	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S2 (Serious)
Version:	5.6.7-rc	OS:	Linux (ubuntu 12.04 server)
Assigned to:	Jon Stephens	CPU Architecture:	Any
Tags:	crash recovery, replication

Description:
Having set up the options to record master info and relay log info into tables and crashing the slave, upon restarting it, it tries to start replication from the wrong position with the corresponding errors or re-playing already applied statements.

If the statements being replayed are INSERTs into tables with autoincrement PKs, there will be no error and many duplicate rows, creating a serious consistency problem.

How to repeat:
1. Set up 2 VMs as master and slave.

2. Make sure these options are set in the my.cnf file:
   master-info-repository=TABLE
   relay-log-info-repository=TABLE

3. Start both servers and make sure that replication is running properly.

4. Compare SHOW MASTER STATUS in the master, with SHOW SLAVE STATUS and SELECT * FROM mysql.slave_master_info\G in the slave.

The 'Position' column in SHOW MASTER STATUS should show match 'Read_Master_Log_Pos' from SHOW SLAVE STATUS and 'Master_log_pos' from the SELECT statement. However, the Master_log_pos values doesn't match. If the server crashes, it'll try to restart from the position in the mysql.slave_master_info table replaying the wrong statements.

Example:

(make sure replication is caught up)
1. On the slave: STOP SLAVE;
2. On the master: RESET MASTER; 
3. Master: SHOW MASTER STATUS;
show master status;
+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000001 |      120 |              |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
4. Slave: CHANGE MASTER TO master_log_file = 'mysql-bin.000001', master_log_pos = 120;
5. Slave: SHOW SLAVE STATUS\G
...
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 120
...
6. Slave: SELECT Master_log_name, Master_log_pos FROM mysql.slave_master_info;
+------------------+----------------+
| Master_log_name  | Master_log_pos |
+------------------+----------------+
| mysql-bin.000001 |            120 |
+------------------+----------------+
1 row in set (0.00 sec)
7. Master: Transactional statement (doesn't matter which) Example: insert into tbl (st, nbr) values ('advance the logs', 100);
8. Master: show master status;
+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000001 |      414 |              |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
9. Slave: SHOW SLAVE STATUS\G
...
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 414
...
10. Slave: SELECT Master_log_name, Master_log_pos FROM mysql.slave_master_info;
+------------------+----------------+
| Master_log_name  | Master_log_pos |
+------------------+----------------+
| mysql-bin.000001 |            120 | <<== WRONG VALUE!!
+------------------+----------------+
1 row in set (0.00 sec)

Forgot to do START SLAVE after step 6. of the example (Oops!)

Try setting sync_master_info to 1.

Gerry,

Can you try to repeat the bug after setting sync master info to 1 ??? Just as Davi wrote ...

Please, let us know the results with server running with a slightly different configuration.

I'm currently traveling and I don't have access to my test setup. I asked the developer that originally discovered the bug to test it and he hasn't come back w/ the answer yet. 

Since I'm in Moscow, time difference is not working in my favor this time.

I will have the answer this week one way or another. Thank you for your patience.

Hi Gary,

Great that you are testing some of the new features in 5.6.

Can you, please, set relay_log_recovery and check if the problem still happens?

Setting the relay_log_recovery is required step to make a slave crash-safe.
After a restart, the slave will throw away the relay logs and the coordinates
in the master.info will be overwritten by those stored in the relay_log.info.

If you want to see the coordinates in the mysql.slave_master_info being incremented,you need to set sync_master_info to a value greater than 0. The value is number of events retrieved from the master after which the mysql.slave_master.info is updated. The default is zero which means that the table is not updated unless you execute one of the following commands: CHANGE MASTER TO, STOP SLAVE or RESET SLAVE.

Setting sync_master_info to 1, as suggested by Davi, will make the table be updated after each event. Notice however that this not really necessary to make a slave-crash safe.

Setting sync_master_info to a value != 0 worked in updating the slave_master_info table properly.

However, if I understood the documentation correctly, it _also_ implies that the master.info file would be synchronized to disk, which would have some performance impact ... or did this change in version 5.6 and the docs are slightly behind? See http://dev.mysql.com/doc/refman/5.6/en/replication-options-slave.html#sysvar_sync_master_i...

Notice that the documentation doesn't say anything about the slave_master_info table, although it obviously has an effect on it.

I will test the relay_log_recovery option on Monday and let you know what I find.

Thank you for the additional information.

Setting up the relay_log_recovery to ON worked. 

The documentation should be clear and clarify that when using TABLE for master-info-repository and relay-log-info-repository this variable should be ON to recover properly after a crash. 

When using the default file, it should be OFF since there is no guarantees that the information on the file is accurate.

Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

Fixed in mysqldoc rev 34212, Closed.