Bug #38433 SQL slave thread dies, until 0 sec behind, then works fine!
Submitted: 29 Jul 2008 17:17 Modified: 1 Sep 2008 6:10
Reporter: Martin Houle Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.0.45 OS:Linux (CentOS 4.5)
Assigned to: Assigned Account CPU Architecture:Any

[29 Jul 2008 17:17] Martin Houle
Description:
Hi there,

I am experiencing a serious problem with replication, and unfortunately, there is absolutely no precise errors code or error descriptions in either the error logs, show engine innodb status or show slave status. I'm running out of ideas.

SETUP
My Setup consists of 2 masters (Mx) many slaves (Sx)
 - all version 5.0.45
 - all same hardware, except for 2 slaves

(S1, S2, S3) <- M1 <-> M2 -> (S4, S5, S6) 

PROBLEM DESCRIPTION
Replication works A1 across ALL machines normally. The problem show-up when I change my replication flow (point a slave to another master, or build a new slave)

-- On any replicated slave, I can:
mysql> STOP SLAVE;(wait a while)
mysql> START SLAVE;
mysql> SHOW SLAVE STATUS\G 
          Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
          ...
          Seconds_Behind_Master: 0

-- On any existing slave, or new slave:
Any change of master params with CHANGE MASTER TO MASTER_.... will cause my slave to connect IO & SQL threads, and within a second, SQL thread will stop, and leave NO ERROR!! argh!

mysql> CHANGE MASTER TO MASTER_HOST = 'm1.mydomain.com', MASTER_.....;
mysql> START SLAVE;
mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Queueing master event to the relay log
                Master_Host: m1.mydomain.com
                Master_User: node_S2
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.017577
        Read_Master_Log_Pos: 16698333
             Relay_Log_File: node_S2-relay-bin.000002
              Relay_Log_Pos: 4670
      Relay_Master_Log_File: mysql-bin.017560
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 235
            Relay_Log_Space: 366760615
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL

mysql> STOP SLAVE;
mysql> show engine innodb status; (reports NO errors)
mysql> exit
bash> tail -n 20 mysql.err

080729  8:24:18 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.017560' at position 233, relay log './node_S2-relay-bin.000002' position: 3812
080729  8:24:18 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.017560' at position 296
080729  8:24:18 [Note] Slave I/O thread: connected to master 'node_S2@m1.mydomain.com:3306',  replication started in log 'mysql-bin.017565' at position 837919
080729  8:24:54 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.017574', position 15037820
080729  8:26:04 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.017560' at position 296, relay log './node_S2-relay-bin.000002' position: 4135
080729  8:26:04 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.017560' at position 246
080729  8:26:04 [Note] Slave I/O thread: connected to master 'node_S2@m1.mydomain.com:3306',  replication started in log 'mysql-bin.017574' at position 15037820
080729  8:26:10 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.017576', position 20200884
080729 12:51:30 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.017560' at position 246, relay log './node_S2-relay-bin.000002' position: 4408
080729 12:51:30 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.017560' at position 235
080729 12:51:30 [Note] Slave I/O thread: connected to master 'node_S2@m1.mydomain.com:3306',  replication started in log 'mysql-bin.017576' at position 20200884

SOLUTION (QUICK FIX)
Using these ignores:
          replicate-ignore-table          = db.nowplaying_property
          replicate-ignore-table          = db.nowplaying_event
I can then
          start slave (until: Seconds_Behind_Master: 0)
          remove the ignores from my.cnf
          restart mysql server
          start slave
and all is good, except that I have to re-insert all data that was ignored.

See my tables in "How to repeat"

If anyone can suggest some tests or fix, I would really appreciate

Thanks in advance

Martin

How to repeat:
CREATE TABLE `nowplaying_event` (
  `nowplaying_event_id` bigint(20) unsigned NOT NULL auto_increment,
  `nowplaying_eventtype_key` varchar(64) NOT NULL,
  `mount_id` int(11) NOT NULL,
  `src_node_id` int(11) unsigned NOT NULL,
  `src_node_type_key` varchar(20) NOT NULL,
  `nowplaying_event_date` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`nowplaying_event_id`),
  KEY `nowplaying_event-is-a-nowplaying_eventtype` (`nowplaying_eventtype_key`),
  KEY `nowplaying_event-comefrom-node_type` (`src_node_type_key`),
  CONSTRAINT `nowplaying_event-comefrom-node_type` FOREIGN KEY (`src_node_type_key`) REFERENCES `node_type` (`node_type_key`),
  CONSTRAINT `nowplaying_event-is-a-nowplaying_eventtype` FOREIGN KEY (`nowplaying_eventtype_key`) REFERENCES `nowplaying_eventtype` (`nowplaying_eventtype_key`)
) ENGINE=InnoDB AUTO_INCREMENT=184638174 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

CREATE TABLE `nowplaying_eventtype` (
  `nowplaying_eventtype_key` varchar(64) NOT NULL,
  `nowplaying_eventtype_description` varchar(256) default NULL,
  PRIMARY KEY  (`nowplaying_eventtype_key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

CREATE TABLE `nowplaying_property` (
  `nowplaying_property_key` varchar(64) NOT NULL,
  `nowplaying_event_id` bigint(20) unsigned NOT NULL,
  `nowplaying_property_value` varchar(512) default NULL,
  PRIMARY KEY  (`nowplaying_property_key`,`nowplaying_event_id`),
  KEY `nowplaying_property-belongto-nowplaying_event` (`nowplaying_event_id`),
  CONSTRAINT `nowplaying_property-belongto-nowplaying_event` FOREIGN KEY (`nowplaying_event_id`) REFERENCES `nowplaying_event` (`nowplaying_event_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

CREATE TABLE `node_type` (
  `node_type_key` varchar(32) NOT NULL,
  PRIMARY KEY  (`node_type_key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

Suggested fix:
I wish I had one...
[29 Jul 2008 18:36] Sveta Smirnova
Thank you for the report.

But version 5.0.45 is a bit old. Please try current version and if problem still exists provide output of next statements:

1. On running slave before CHANGE MASTER: SHOW SLAVE STATUS;
2. On running master you are going to change slave to: SHOW MASTER STATUS;
3. Full CHANGE MASTER command including all options (replace private data with something like ***)
4. SHOW SLAVE STATUS; after CHANGE MASTER and START SLAVE commands.
[31 Jul 2008 14:16] Martin Houle
Thank you for the quick response.

I wonder, which stable version should I upgrade my servers to, from v5.0.45?

Should I stay with 5.0.xx series, or is it safe to hop in 5.1.xx/6.0.xx?

I will gladly upgrade and test with your recommended version(s). This is production, I just want to ensure I'm using versions that will upgrade nicely.

Thank you
[1 Aug 2008 1:49] Martin Houle
Sorry guys, I read more about the versions of MySQL, I didn't know 5.1.x & 6.x were development releases

I'll head to 5.0.51a and I'll post the results.
[1 Aug 2008 6:09] Susanne Ebrecht
Hello Martin,

our "stable" release at the moment is MySQL 5.0.51b. Unfortunately, this is only available as source.
[1 Sep 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".