MySQL Bugs: #52859: Replication fails to start after an SQL node restart

Bug #52859	Replication fails to start after an SQL node restart
Submitted:	15 Apr 2010 14:42	Modified:	12 May 2010 10:20
Reporter:	Carlos Pinera	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Replication	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-7.1	OS:	Linux (RedHad 5.4 and CentOS)
Assigned to:	Martin Skold	CPU Architecture:	Any
Tags:	7.1.3, replication, sql node restart, telco-7.1 bzr

Description:
We have multi-master replication between two machines.  Initially the slave status shows everything is fine.  After an SQL node is restarted (/etc/init.d/mysql restart), the slave on that SQL node get an error:

100413 10:15:36 [ERROR] Error reading master configuration
100413 10:15:36 [ERROR] Failed to initialize the master info structure

This was working fine with cluster version 7.0.9 and earlier.  Setting an explicit entry in my.cnf for master-info-file has no effect.  For simplicity I reduced the configuration to two machines, with each machine running a single storage node.  It fails the same way with the configuration of two macnines in each cluster and replication between the clusters (total of 4 machines).  After the restart the storage and sql nodes are up but the slave has not started.

The my.cnf file content for 'andorinha' is:
# MySQL Cluster configuation file. 
[mysqld]
ndbcluster   # Run the NDB storage engine
ndb-connectstring=andorinha.isxinc.com

server_id=3
auto_increment_increment=10
auto_increment_offset=3
binlog-format=ROW
log-bin=row
binlog-ignore-db=test
ndb-log-update-as-write=OFF
log-slave-updates
replicate-same-server-id=0
replicate-ignore-db=test
replicate-do-table=mysql.ndb_apply_status
replicate-wild-ignore-table=mysql.%
relay-log=/var/lib/mysql/andorinha.isxinc.com_rl
relay-log-index=/var/lib/mysql/andorinha.isxinc.com_rl1
relay-log-info-file=/var/lib/mysql/relay-log.info
slave-net-timeout=60

[mysql_cluster]
ndb-connectstring=andorinha.isxinc.com

The my.cnf file content for 'canario' is:
# MySQL Cluster configuation file.
[mysqld]
ndbcluster   # Run the NDB storage engine
ndb-connectstring=canario.isxinc.com

server_id=1
auto_increment_increment=10
auto_increment_offset=1
binlog-format=ROW
log-bin=row
binlog-ignore-db=test
ndb-log-update-as-write=OFF
log-slave-updates
replicate-same-server-id=0
replicate-ignore-db=test
replicate-do-table=mysql.ndb_apply_status
replicate-wild-ignore-table=mysql.%
relay-log=/var/lib/mysql/canario.isxinc.com_rl
relay-log-index=/var/lib/mysql/canario.isxinc.com_rl1
relay-log-info-file=/var/lib/mysql/relay-log.info
slave-net-timeout=60

[mysql_cluster]
ndb-connectstring=canario.isxinc.com
===============================================
The config.ini file (on 'andorinha') :
[tcp default]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[ndbd default]
NoOfReplicas=1
MaxNoOfConcurrentOperations=100000
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
DiskCheckpointSpeed=10M
DiskCheckpointSpeedInRestart=100M
RedoBuffer=32M
DataMemory=427M
IndexMemory=85M

[ndb_mgmd]
HostName=andorinha.isxinc.com
datadir=/var/lib/mysql-cluster

[ndbd]
HostName=andorinha.isxinc.com
datadir=/usr/local/mysql/data
[mysqld]

==============================================

The error messages from the slave error file after the restart:
...
100413 13:53:45 mysqld_safe mysqld from pid file /var/lib/mysql/andorinha.isxinc.com.pid ended
100413 13:53:45 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
100413 13:53:45 [Note] Plugin 'FEDERATED' is disabled.
100413 13:53:45  InnoDB: Started; log sequence number 0 44233
100413 13:53:45 [Note] NDB: NodeID is 3, management server 'andorinha.isxinc.com:1186'
100413 13:53:46 [Note] NDB[0]: NodeID: 3, no storage nodes connected (timed out)
100413 13:53:46 [Note] Starting Cluster Binlog Thread
100413 13:53:46 [ERROR] Error reading master configuration
100413 13:53:46 [ERROR] Failed to initialize the master info structure
100413 13:53:46 [Note] Event Scheduler: Loaded 0 events
100413 13:53:47 [Note] NDB Binlog: Ndb tables initially read only.
100413 13:53:47 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema
100413 13:53:47 [Note] NDB Binlog: logging ./mysql/ndb_schema (UPDATED,USE_UPDATE)
100413 13:53:47 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_apply_status
100413 13:53:47 [Note] NDB Binlog: logging ./mysql/ndb_apply_status (UPDATED,USE_UPDATE)
2010-04-13 13:53:47 [NdbApi] INFO     -- Flushing incomplete GCI:s < 199/0
2010-04-13 13:53:47 [NdbApi] INFO     -- Flushing incomplete GCI:s < 199/0
100413 13:53:47 [Note] NDB Binlog: starting log at epoch 199/0
100413 13:53:47 [Note] NDB Binlog: ndb tables writable
100413 13:53:47 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.1.44-ndb-7.1.3-cluster-gpl-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Cluster Server (GPL)

The output of SHOW SLAVE STATUS after the restart
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: canario.isxinc.com
                  Master_User: repl
                  Master_Port: 0
                Connect_Retry: 0
              Master_Log_File: row.000001
          Read_Master_Log_Pos: 0
               Relay_Log_File:
                Relay_Log_Pos: 0
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB: test
           Replicate_Do_Table: mysql.ndb_apply_status
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 0
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
                  Master_Bind:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0

How to repeat:
A single master/slave setup is sufficient.  Do the 'change master' on the slave, and start the slave.  Perform SHOW SLAVE STATUS \G; and verify that "Slave_IO_Running" and "Slave_SQL_Running" are both "YES".

Restart the sql node where the slave is running, for example "/etc/init.d/mysql restart".  Perform SHOW SLAVE STATUS \G; and see that "Slave_IO_Running" and "Slave_SQL_Running" are both "NO" and the error file in /var/lib/mysql shows the two error messages listed in the description section.

Retry this on version 7.0.9 and earlier and the error does not occur.

Thank you for the report.

Verified as described.

We are having the same bug with 7.1.3 (MySQL 5.1.44), and also was having the problem with 7.0.13 (MySQL 5.1.41) as described. It wasn't an issue in 7.0.9 (MySQL 5.1.39)

Please fix.

We have the same bug (mysql cluster 7.1.3).

Replication is running ok. We do a mysql stop and mysql start, and replication doesn't work anymore.

>start slave;
ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log

>show slave status;
+----------------+--------------+-------------+-------------+---------------+------------------+---------------------+----------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-------------+-----------------------------+------------------+
| Slave_IO_State | Master_Host  | Master_User | Master_Port | Connect_Retry | Master_Log_File  | Read_Master_Log_Pos | Relay_Log_File | Relay_Log_Pos | Relay_Master_Log_File | Slave_IO_Running | Slave_SQL_Running | Replicate_Do_DB | Replicate_Ignore_DB | Replicate_Do_Table | Replicate_Ignore_Table | Replicate_Wild_Do_Table | Replicate_Wild_Ignore_Table | Last_Errno | Last_Error | Skip_Counter | Exec_Master_Log_Pos | Relay_Log_Space | Until_Condition | Until_Log_File | Until_Log_Pos | Master_SSL_Allowed | Master_SSL_CA_File | Master_SSL_CA_Path | Master_SSL_Cert | Master_SSL_Cipher | Master_SSL_Key | Seconds_Behind_Master | Master_SSL_Verify_Server_Cert | Last_IO_Errno | Last_IO_Error | Last_SQL_Errno | Last_SQL_Error | Master_Bind | Replicate_Ignore_Server_Ids | Master_Server_Id |
+----------------+--------------+-------------+-------------+---------------+------------------+---------------------+----------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-------------+-----------------------------+------------------+
|                | 192.168.10.7 | repl        |           0 |             0 | mysql-bin.001320 |                   0 |                |             0 |                       | No               | No                | db   |                     |                    |                        | db.%         |                             |          0 |            |            0 |                   0 |               0 | None            |                |             0 | No                 |                    |                    |                 |                   |                |                  NULL | No                            |             0 |               |              0 |                |             |                             |                0 |
+----------------+--------------+-------------+-------------+---------------+------------------+---------------------+----------------+---------------+-----------------------+------------------+-------------------+-----------------+---------------------+--------------------+------------------------+-------------------------+-----------------------------+------------+------------+--------------+---------------------+-----------------+-----------------+----------------+---------------+--------------------+--------------------+--------------------+-----------------+-------------------+----------------+-----------------------+-------------------------------+---------------+---------------+----------------+----------------+-------------+-----------------------------+------------------+

Seems the master.info file on slave had two lines swapped, making patch and adding new test case.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107634

3192 Martin Skold	2010-05-06
      Bug#52859 Replication fails to start after an SQL node restart: Lines for heartbeat period and bind addr were swapped in master.info
      modified:
        sql/rpl_mi.cc

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107635

3193 Martin Skold	2010-05-06
      Bug#52859 Replication fails to start after an SQL node restart: Added new test for restarting slave
      added:
        mysql-test/suite/rpl_ndb/r/rpl_ndb_slave_restart.result
        mysql-test/suite/rpl_ndb/t/rpl_ndb_slave_restart.test

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107637

3525 Martin Skold	2010-05-06 [merge]
      Merge
      added:
        mysql-test/suite/rpl_ndb/r/rpl_ndb_slave_restart.result
        mysql-test/suite/rpl_ndb/t/rpl_ndb_slave_restart.test
      modified:
        sql/rpl_mi.cc

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107638

3594 Martin Skold	2010-05-06 [merge]
      Merge
      added:
        mysql-test/suite/rpl_ndb/r/rpl_ndb_slave_restart.result
        mysql-test/suite/rpl_ndb/t/rpl_ndb_slave_restart.test
      modified:
        sql/rpl_mi.cc

Pushed into 5.1.44-ndb-7.0.15 (revid:martin.skold@mysql.com-20100506123735-3zd8rkua1ycl569c) (version source revid:martin.skold@mysql.com-20100506122822-2hntg1bkhkxxsh1m) (merge vers: 5.1.44-ndb-7.0.15) (pib:16)

Pushed into 5.1.44-ndb-6.3.34 (revid:martin.skold@mysql.com-20100506124030-01g0x6mi14sbb92g) (version source revid:martin.skold@mysql.com-20100506121756-37ny8k4n3fmft5oq) (merge vers: 5.1.44-ndb-6.3.34) (pib:16)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107816

3529 Martin Skold	2010-05-10
      Bug#52859 Replication fails to start after an SQL node restart: Fixed test case to work in Windows
      modified:
        mysql-test/suite/rpl_ndb/r/rpl_ndb_slave_restart.result
        mysql-test/suite/rpl_ndb/t/rpl_ndb_slave_restart.test

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107817

3600 Martin Skold	2010-05-10 [merge]
      Merge
      modified:
        mysql-test/suite/rpl_ndb/r/rpl_ndb_slave_restart.result
        mysql-test/suite/rpl_ndb/t/rpl_ndb_slave_restart.test

Documented bugfix in the NDB-6.3.34, 7.0.15, and 7.1.4 changelogs, as follows:

      Replication failed after a restart of the slave SQL node,
      due to an error in writing the master.info file.

Closed.