Bug #73039 relay_log_recovery KO when CHANGE MASTER without FILE and SQL_THREAD not started
Submitted: 18 Jun 2014 14:15 Modified: 29 Oct 2014 16:25
Reporter: Jean-François Gagné Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.6.19, 5.6.20, 5.6.22 OS:Any
Assigned to: CPU Architecture:Any
Tags: Crash-safe replication

[18 Jun 2014 14:15] Jean-François Gagné
Description:
If, on a empty database:
- CHANGE MASTER TO without a MASTER_FILE and MASTER_POS is used,
- the IO_THREAD is started WITHOUT starting the SQL_THREAD,
- MySQL crashed,
- MySQL is restarted with relay_log_recovery = 1.

Crash recovery will not work as expected:
- the IO_THREAD position will NOT be initialized to the SQL_THREAD position,
- SQL_THREAD position will NOT be initialized to the new relay log.

It looks like, when the SQL_THREAD does not have a Relay_Master_Log_File, relay_log_recovery does not work.

How to repeat:
Initialize 2 empty database and make one the slave of the other using CHANGE MASTER TO without a MASTER_FILE and MASTER_POS.  Configure the slave with relay_log_recovery = 1.

On the master, run:
cat << EOF | sudo mysql
CREATE DATABASE test_jfg;
CREATE TABLE test_jfg.t1 (
  userid   BIGINT PRIMARY KEY,
  password BIGINT DEFAULT NULL);
EOF
seq -f "INSERT into test_jfg.t1 VALUES (%.0f, NULL);" 1 1000000 | sudo mysql

While the INSERTs are running on the master, start the slave IO_THREAD but not the SQL_THREAD (START SLAVE IO_THREAD), and kill -9 mysqld.

When the slave will be restarted, DUPLICATE KEY error will happen.

Suggested fix:
Manage the case where Relay_Master_Log_File is empty when doing relay log recovery.
[19 Jun 2014 11:17] MySQL Verification Team
Hello Jean,

Thank you for the bug report and test case.
Verified as described.

Thanks,
Umesh
[19 Jun 2014 11:18] MySQL Verification Team
// How to steps for dev's

[root@cluster-repo mysql-advanced-5.6.19]# more 73039M.cnf 73039S.cnf
::::::::::::::
73039M.cnf
::::::::::::::
#Master
[mysqld]
user = root
pid-file = /tmp/master/mysqld.pid
socket = /tmp/master/mysql.sock
port = 3306
basedir = /data/ushastry/server/mysql-advanced-5.6.19
datadir = /tmp/master
tmpdir = /tmp
log-bin = master-bin
log-bin-index = master-bin.index
server-id = 1

::::::::::::::
73039S.cnf
::::::::::::::
#Slave
[mysqld]
user = root
pid-file = /tmp/slave/mysqld.pid
socket = /tmp/slave/mysql.sock
port = 3307
basedir = /data/ushastry/server/mysql-advanced-5.6.19
datadir = /tmp/slave
tmpdir = /tmp
relay-log-index = slave-relay-bin.index
relay-log = slave-relay-bin
server-id = 2
relay-log-recovery=1

// Initialize
scripts/mysql_install_db --defaults-file=./73039M.cnf
scripts/mysql_install_db --defaults-file=./73039S.cnf

// Start

[root@cluster-repo mysql-advanced-5.6.19]# bin/mysqld_safe --defaults-file=./73039M.cnf &
[root@cluster-repo mysql-advanced-5.6.19]# bin/mysqld_safe --defaults-file=./73039S.cnf &

[root@cluster-repo mysql-advanced-5.6.19]# ps aux|grep mysqld
root      1966  0.0  0.0 106228  1400 pts/1    S    07:11   0:00 /bin/sh bin/mysqld_safe --defaults-file=./73039M.cnf
root      2159  0.2  7.3 1167832 451080 pts/1  Sl   07:11   0:01 /data/ushastry/server/mysql-advanced-5.6.19/bin/mysqld --defaults-file=./73039M.cnf --basedir=/data/ushastry/server/mysql-advanced-5.6.19 --datadir=/tmp/master --plugin-dir=/data/ushastry/server/mysql-advanced-5.6.19/lib/plugin --user=root --log-error=/tmp/master/cluster-repo.err --pid-file=/tmp/master/mysqld.pid --socket=/tmp/master/mysql.sock --port=3306
root      2185  0.0  0.0 106228  1396 pts/1    S    07:13   0:00 /bin/sh bin/mysqld_safe --defaults-file=./73039S.cnf
root      2390  0.2  7.3 1108360 450444 pts/1  Sl   07:13   0:01 /data/ushastry/server/mysql-advanced-5.6.19/bin/mysqld --defaults-file=./73039S.cnf --basedir=/data/ushastry/server/mysql-advanced-5.6.19 --datadir=/tmp/slave --plugin-dir=/data/ushastry/server/mysql-advanced-5.6.19/lib/plugin --user=root --log-error=/tmp/slave/cluster-repo.err --pid-file=/tmp/slave/mysqld.pid --socket=/tmp/slave/mysql.sock --port=3307
root      2475  0.0  0.0 103256   824 pts/3    S+   07:20   0:00 grep mysqld

// Setup replication

* Create replication use on master
* Point slave to master

slave>CHANGE MASTER TO
     MASTER_HOST='127.0.0.1',
     MASTER_PORT=3306,
     MASTER_USER='repl',
     MASTER_PASSWORD='slavepass';

slave>START SLAVE IO_THREAD;
 
// Start provisioning data on master

cat << EOF | sudo bin/mysql -u root -S /tmp/master/mysql.sock
CREATE DATABASE test_jfg;
CREATE TABLE test_jfg.t1 (
  userid   BIGINT PRIMARY KEY,
  password BIGINT DEFAULT NULL);
EOF

seq -f "INSERT into test_jfg.t1 VALUES (%.0f, NULL);" 1 1000000 | sudo bin/mysql -u root -S /tmp/master/mysql.sock

// Kill slave ( Not required to wait for the master to complete provisioning)

[root@cluster-repo mysql-advanced-5.6.19]# kill -9 2185 2390

// Restart slave 

[root@cluster-repo mysql-advanced-5.6.19]# bin/mysqld_safe --defaults-file=./73039S.cnf &
[1] 2486
[root@cluster-repo mysql-advanced-5.6.19]# 140621 07:23:12 mysqld_safe Logging to '/tmp/slave/cluster-repo.err'.
140621 07:23:13 mysqld_safe Starting mysqld daemon with databases from /tmp/slave

## Check slave status

SHOW SLAVE STATUS\G
..
..
Last_SQL_Errno: 1062
Last_SQL_Error: Error 'Duplicate entry '%-test-' for key 'PRIMARY'' on query. Default database: 'mysql'. Query: 'INSERT INTO db SELECT * FROM tmp_db WHERE @had_db_table=0;'
[10 Sep 2014 15:38] MySQL Verification Team
Hello Jean-François,

Imho that duplicate key error in my earlier note was mostly due to "bin-log" generated during initial setup on master:

// Cross checked for tmp_db, and confirmed that it was generated at the time of initialization master post mysql_install_db call.

bin/mysqlbinlog /tmp/master/master-bin.000001 |grep tmp_db
CREATE TEMPORARY TABLE tmp_db LIKE db;
INSERT INTO tmp_db VALUES ('%','test','','Y','Y','Y','Y','Y','Y','N','Y','Y','Y','Y','Y','Y','Y','Y','N','N','Y','Y');
INSERT INTO tmp_db VALUES ('%','test\_%','','Y','Y','Y','Y','Y','Y','N','Y','Y','Y','Y','Y','Y','Y','Y','N','N','Y','Y');
INSERT INTO db SELECT * FROM tmp_db WHERE @had_db_table=0;
DROP TEMPORARY TABLE `tmp_db` /* generated by server */

Then,instructed slave to connect without "MASTER_FILE and MASTER_POS", which means it is replicating from the initial position:

2014-09-12 10:52:53 31297 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='127.
0.0.1', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2014-09-12 10:53:00 31297 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER a
nd PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2014-09-12 10:53:00 31297 [Note] Slave I/O thread: connected to master 'repl@127.0.0.1:3306',replication started in log 'FIRST' at position 4

Any chance you can rerun your steps and please share the exact duplicate key error that you are seeing? Also, how did you "initialize" master instance, was the bin log enabled when mysql_install_db used? If you are seeing "Last_SQL_Error: Error 'Duplicate entry '%-test-' for key 'PRIMARY'' on query. Default database: 'mysql'. Query: 'INSERT INTO db SELECT * FROM tmp_db WHERE @had_db_table=0;'" then it is expected behavior because then the duplicate key is valid because it is conflicting with existing records in slave's mysql.db.
if you are seeing DUP key issue in real table i.e in test_jfg.t1, then could you please share exact steps, conf file etc?

Btw, I'm not seeing the DUP key error with below steps both in real table and in mysql.db(i.e when log-bin disabled when initializing master with mysql_install_db):

[root@cluster-repo mysql-advanced-5.6.19]# scripts/mysql_install_db --basedir=/data/ushastry/server/mysql-advanced-5.6.19 --datadir=/tmp/master
[root@cluster-repo mysql-advanced-5.6.19]# scripts/mysql_install_db --basedir=/data/ushastry/server/mysql-advanced-5.6.19 --datadir=/tmp/slave

## relay-log-recovery is not enabled on slave initially

[root@cluster-repo mysql-advanced-5.6.19]#
[root@cluster-repo mysql-advanced-5.6.19]# bin/mysqld_safe --defaults-file=./73039M.cnf &
[1] 31747
[root@cluster-repo mysql-advanced-5.6.19]# 140912 11:10:36 mysqld_safe Logging to '/tmp/master/cluster-repo.err'.
140912 11:10:37 mysqld_safe Starting mysqld daemon with databases from /tmp/master

[root@cluster-repo mysql-advanced-5.6.19]# vi 73039S.cnf
[root@cluster-repo mysql-advanced-5.6.19]# bin/mysqld_safe --defaults-file=./73039S.cnf &
[2] 31967
[root@cluster-repo mysql-advanced-5.6.19]# 140912 11:11:11 mysqld_safe Logging to '/tmp/slave/cluster-repo.err'.
140912 11:11:11 mysqld_safe Starting mysqld daemon with databases from /tmp/slave

#### Setup 2 new instances, setup replication, CHANGE MASTER TO without a MASTER_FILE and MASTER_POS

Slave>

mysql> CHANGE MASTER TO
    ->      MASTER_HOST='127.0.0.1',
    ->      MASTER_PORT=3306,
    ->      MASTER_USER='repl',
    ->      MASTER_PASSWORD='slavepass';
Query OK, 0 rows affected (0.03 sec)

// Start provisioning on master
Master>

at << EOF | sudo bin/mysql -u root -S /tmp/master/mysql.sock
CREATE DATABASE test_jfg;
CREATE TABLE test_jfg.t1 (
  userid   BIGINT PRIMARY KEY,
  password BIGINT DEFAULT NULL);
EOF

seq -f "INSERT into test_jfg.t1 VALUES (%.0f, NULL);" 1 1000000 | sudo bin/mysql -u root -S /tmp/master/mysql.sock

// 
mysql> START SLAVE IO_THREAD;
Query OK, 0 rows affected (0.00 sec)
==========================================

// Kill -9 slave, start slave with relay-log-recovery=1

2014-09-12 11:18:15 32421 [Note] /data/ushastry/server/mysql-advanced-5.6.19/bin/mysqld: ready for connections.
Version: '5.6.19-enterprise-commercial-advanced'  socket: '/tmp/slave/mysql.sock'  port: 3307  MySQL Enterprise Server - Advanced Edition (Commercial)
2014-09-12 11:18:15 32421 [Warning] Slave SQL: If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0
2014-09-12 11:18:15 32421 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4

Repeated provisioning for number of times but no duplicate key error reported..

Thanks,
Umesh
[11 Sep 2014 13:44] Jean-François Gagné
Hi Umesh,

this is how I reproduce this bug:

1. Setup 2 MySQL, one to be a master (with binlogs), the other to be a slave;
2. On the slave: "STOP SLAVE; DROP DATABASE IF EXISTS test_jfg; RESET SLAVE ALL;";
3. On the master: "DROP DATABASE IF EXISTS test_jfg; RESET MASTER;";
4. On the master: "CREATE DATABASE test_jfg;";
5. On the slave: "CHANGE MASTER TO MASTER_HOST='<master_hostname>', MASTER_USER='<user>', MASTER_PASSWORD='<password>'; START SLAVE IO_THREAD;";
6. On the slave: "pkill -9 mysqld";
7. On the slave: start mysqld with relay_log_recovery = 1.

And you will have an error in a "SHOW SLAVE STATUS":

Last_SQL_Error: Error 'Can't create database 'test_jfg'; database exists' on query. Default database: 'test_jfg'. Query: 'CREATE DATABASE test_jfg'

Test done with MySQL 5.6.20.

JFG
[12 Sep 2014 8:04] MySQL Verification Team
Thank you for the feedback, steps, we are unable to repeat as your reported.
Please could provide conf details from both master/slave, all the relay logs for further investigations?

I'm sharing the steps performed at our end for this in 73039.txt.
[12 Sep 2014 8:04] MySQL Verification Team
test results

Attachment: 73039.txt (text/plain), 16.12 KiB.

[16 Sep 2014 19:16] Sveta Smirnova
Jean-François,

please send us output of SHOW SLAVE STATUS\G, taken before and after you `kill -9` slave server.
[18 Sep 2014 16:32] Jean-François Gagné
Hi Umesh

If I use exactly your commands, I do not reproduce the bug.

But if I start the slave adding those options, I do:

--master-info-repository=TABLE --relay-log-info-repository=TABLE

Thanks,

JFG
[19 Sep 2014 6:15] MySQL Verification Team
Thank you Jean for the feedback, I'm able to reproduce the issue now. 
Inclusion of --master-info-repository=TABLE --relay-log-info-repository=TABLE did the trick and able to see the DDL in 2nd 4th relay logs(which triggered Error 'Can't create database 'test_jfg'; database exists' on query. Default database: 'test_jfg'. Query: 'CREATE DATABASE test_jfg')

Thanks,
Umesh
[19 Sep 2014 6:16] MySQL Verification Team
test results

Attachment: 73039_modified.txt (text/plain), 17.51 KiB.

[27 Oct 2014 17:59] David Moss
Thanks for your report, the following was added to the 5.6.22 and 5.7.6 release notes with commit 4526:

When restarting MySQL with relay_log_recovery enabled to recover from a crash, if the SQL thread had never been started, the position from which to start recovery was not correctly initialized because Relay_Master_Log_File was missing. This fix ensures that in such a situation each of the relay logs, starting from the first relay log file, is searched for a rotate event from the master, which specifies where replication started from. This rotate event is then used to set the SQL thread's Relay_Master_Log_File and Relay_Log_Pos and recovery continues as normal.
[10 Dec 2014 14:43] Laurynas Biveinis
$ bzr log -r 6204
------------------------------------------------------------
revno: 6204
committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com>
branch nick: Bug19021091_mysql-5.6
timestamp: Thu 2014-10-16 11:22:41 +0530
message:
  Bug#19021091: RELAY_LOG_RECOVERY FAILS WHEN CHANGE MASTER
  WITHOUT FILE AND SQL_THD NOT STARTED
  
  Problem:
  ========
  If, on an empty database:
  - CHANGE MASTER TO without a MASTER_FILE and MASTER_POS is
    used
  - the IO_THREAD is started WITHOUT starting the SQL_THREAD
  - MySQL crashed
  - MySQL is restarted with relay_log_recovery = 1.
  
  Crash recovery will not work as expected:
  - The IO_THREAD position will NOT be initialized to the
    SQL_THREAD position,
  - SQL_THREAD position will NOT be initialized to the new
    relay log.
  
  It looks like, when the SQL_THREAD does not have a
  Relay_Master_Log_File, relay_log_recovery does not work.
  
  Analysis:
  ========
  When relay-log-recovery is set existing relay logs are
  ignored and IO_THREAD positions will be initialized to
  SQL_THREAD's positions and SQL_THREAD positions will be set
  to newly generated relay log. The old relay logs will be
  purged by the normal purge mechanism. There can be a special
  case where sql thread has never started all, in such case
  SQL_THREAD's positions will not be initialized. Existing
  recovery mechanism is tightly coupled with
  Relay_Master_Log_File's value if the file name is not set
  recovery will not happen.
  
  Fix:
  ===
  During recovery if Relay_Master_Log_File is not set open the
  relaylog index file. Read the relay logs from the first
  relay log parse each of these files and look for Rotate
  event from the master which will specify from where
  replication has started. Extract master_log_file and
  master_log_pos from master's rotate event and set it as
  SQL_THREAD's Relay_Master_Log_File and position.

$ bzr log -r 6205
------------------------------------------------------------
revno: 6205
committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com>
branch nick: Bug19021091_mysql-5.6
timestamp: Thu 2014-10-16 14:06:37 +0530
message:
  Bug#19021091: RELAY_LOG_RECOVERY FAILS WHEN CHANGE MASTER
  WITHOUT FILE AND SQL_THD NOT STARTED
  
  Fixing a post push compilation issue.