MySQL Bugs: #28421: Infinite loop on slave relay logs

Bug #28421	Infinite loop on slave relay logs
Submitted:	14 May 2007 16:56	Modified:	12 Mar 2010 17:55
Reporter:	Baron Schwartz (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.38, 5.1.40	OS:	Linux (Gentoo)
Assigned to:	Libing Song	CPU Architecture:	Any
Tags:	binlog, qc, relay log, replication, rotate

Description:
MySQL got stuck in a loop on all three slaves that were attached to a particular master.  Each slave began creating new relay logs very quickly, filled up the available space, and failed with duplicate key violations.  The duplicated data did exist on the slave already (it's not table corruption), and the slaves were trying to replay statements that were contained in earlier relay logs.

Here is a peek into the files on fresno, one of the slaves:

-rw-rw----  1 mysql mysql 1073746132 May 14 11:31 fresno-relay-bin.000214
-rw-rw----  1 mysql mysql     905125 May 14 11:31 fresno-relay-bin.000215
-rw-rw----  1 mysql mysql        189 May 14 11:31 fresno-relay-bin.000216
-rw-rw----  1 mysql mysql 1073746132 May 14 11:32 fresno-relay-bin.000217
-rw-rw----  1 mysql mysql     905125 May 14 11:32 fresno-relay-bin.000218
-rw-rw----  1 mysql mysql        189 May 14 11:32 fresno-relay-bin.000219
-rw-rw----  1 mysql mysql 1073746132 May 14 11:33 fresno-relay-bin.000220
-rw-rw----  1 mysql mysql     905125 May 14 11:34 fresno-relay-bin.000221
-rw-rw----  1 mysql mysql        189 May 14 11:34 fresno-relay-bin.000222
-rw-rw----  1 mysql mysql 1073746132 May 14 11:35 fresno-relay-bin.000223
-rw-rw----  1 mysql mysql     905125 May 14 11:35 fresno-relay-bin.000224
-rw-rw----  1 mysql mysql        189 May 14 11:35 fresno-relay-bin.000225
-rw-rw----  1 mysql mysql 1073746132 May 14 11:35 fresno-relay-bin.000226
-rw-rw----  1 mysql mysql     905125 May 14 11:36 fresno-relay-bin.000227
-rw-rw----  1 mysql mysql        189 May 14 11:36 fresno-relay-bin.000228
-rw-rw----  1 mysql mysql 1073746132 May 14 11:36 fresno-relay-bin.000229
-rw-rw----  1 mysql mysql     905125 May 14 11:37 fresno-relay-bin.000230
-rw-rw----  1 mysql mysql        189 May 14 11:37 fresno-relay-bin.000231
-rw-rw----  1 mysql mysql 1073746132 May 14 11:38 fresno-relay-bin.000232
-rw-rw----  1 mysql mysql     905125 May 14 11:38 fresno-relay-bin.000233
-rw-rw----  1 mysql mysql        189 May 14 11:38 fresno-relay-bin.000234
-rw-rw----  1 mysql mysql 1073746132 May 14 11:39 fresno-relay-bin.000235
-rw-rw----  1 mysql mysql     905125 May 14 11:39 fresno-relay-bin.000236
-rw-rw----  1 mysql mysql        189 May 14 11:39 fresno-relay-bin.000237
-rw-rw----  1 mysql mysql  218174660 May 14 12:04 fresno-relay-bin.000238
-rw-rw----  1 mysql mysql        650 May 14 11:39 fresno-relay-bin.index

Notice how quickly the new files were being rolled.  This is way faster than logs roll on the master; we do several a day on that master, but not one every minute.

Some of the files seem corrupt:

baron@fresno ~ $ mysqlbinlog fresno-relay-bin.000227 | head -n 50
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#070514 11:36:02 server id 40  end_log_pos 98   Start: binlog v 4, server v 5.0.38-log created 070514 11:36:02
# at 98
#070514  6:43:03 server id 11  end_log_pos 0    Start: binlog v 4, server v 5.0.38-log created 070514  6:43:03
Warning: ignoring Append_block as there is no Create_file event for file_id: 14154
Warning: ignoring Append_block as there is no Create_file event for file_id: 14154
Warning: ignoring Append_block as there is no Create_file event for file_id: 14154
Warning: ignoring Append_block as there is no Create_file event for file_id: 14154
Warning: ignoring Append_block as there is no Create_file event for file_id: 14154
Warning: ignoring Append_block as there is no Create_file event for file_id: 14154
[snip]

There is an infinite loop in the Rotate header section of these relay logs:

baron@fresno ~ $ for a in fresno-relay-bin.0*; do echo $a; mysqlbinlog $a | head -n 50 | grep Rotate; done > rotate
baron@fresno ~ $ cat rotate
fresno-relay-bin.000214
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000215
fresno-relay-bin.000216
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:31:25 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000217  pos: 4
fresno-relay-bin.000217
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000218
fresno-relay-bin.000219
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:32:53 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000220  pos: 4
fresno-relay-bin.000220
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000221
fresno-relay-bin.000222
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:34:01 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000223  pos: 4
fresno-relay-bin.000223
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000224
fresno-relay-bin.000225
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:35:02 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000226  pos: 4
fresno-relay-bin.000226
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000227
fresno-relay-bin.000228
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:36:02 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000229  pos: 4
fresno-relay-bin.000229
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000230
fresno-relay-bin.000231
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:37:01 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000232  pos: 4
fresno-relay-bin.000232
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000233
fresno-relay-bin.000234
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:38:02 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000235  pos: 4
fresno-relay-bin.000235
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4
fresno-relay-bin.000236
fresno-relay-bin.000237
#070514 11:30:03 server id 11  end_log_pos 1074650871   Rotate to usa-bin.000298  pos: 4
#070514 11:39:11 server id 40  end_log_pos 189  Rotate to fresno-relay-bin.000238  pos: 4
fresno-relay-bin.000238
#691231 19:00:00 server id 11  end_log_pos 139  Rotate to usa-bin.000297  pos: 4

How to repeat:
I don't want to try to repeat this :-)

Another possibly related detail: since all three of this master's slaves behaved the same way, it suggests it's a problem coming from the master.  The only unusual thing that happened today on the master is I edited usa-bin.index and removed files that had been rm-ed from the disk, then ran PURGE MASTER LOGS on the master (see bug #28238).  I can give more details on this if it seems relevant.

This has happened again, this time to fresno's slave (nepal used to be a slave of usa, but I chained it off fresno for various reasons):

-rw-rw----  1 mysql mysql 1073741938 Jun 16 04:10 nepal-relay-bin.000319
-rw-rw----  1 mysql mysql     475958 Jun 16 04:10 nepal-relay-bin.000320
-rw-rw----  1 mysql mysql        191 Jun 16 04:10 nepal-relay-bin.000321
-rw-rw----  1 mysql mysql 1073741938 Jun 16 04:12 nepal-relay-bin.000322
-rw-rw----  1 mysql mysql     475958 Jun 16 04:12 nepal-relay-bin.000323
-rw-rw----  1 mysql mysql        191 Jun 16 04:12 nepal-relay-bin.000324
... snip many lines, each about 2-3 minutes apart...
-rw-rw----  1 mysql mysql        191 Jun 16 06:07 nepal-relay-bin.000543
-rw-rw----  1 mysql mysql 1073741938 Jun 16 06:08 nepal-relay-bin.000544
-rw-rw----  1 mysql mysql     475958 Jun 16 06:08 nepal-relay-bin.000545
-rw-rw----  1 mysql mysql        191 Jun 16 06:08 nepal-relay-bin.000546

The output of SHOW SLAVE STATUS reports the SQL thread is stopped with a duplicate key error, because the relay log has already executed events.  The events are duplicated in the relay logs.  The relay logs just kept getting fetched and written, as you can see above, until the disk filled up.

I consider this a very serious bug.  Has anyone heard of this happening to anyone else?  Surely it's not just me.

More information on the most recent occurrence: on the master, SHOW MASTER STATUS returns nothing, and SHOW MASTER LOGS says binary logging is not enabled (though it clearly is).

In the server's error log,

070616  4:09:00 [ERROR] Could not use fresno-bin for logging (error 17).
Turning logging off for the whole duration of the MySQL server process. To
turn it on again: fix the cause, shutdown the MySQL server and restart it. 

Error 17 is "file exists" according to perror.  It almost looks like the master filled log X, got done and wrote a final "next, go to log X" into the binlog, then tried to go open log X itself and failed because it's the one it was just using.  Then the slave fetched log X, hit the end, fetched log X again...

In this case, "log X" is /var/lib/mysql/fresno-bin.000106.  The end of the log looks okay to me, though:

# at 1074217468
#070616  4:09:00 server id 40  end_log_pos 1074217512   Rotate to fresno-bin.000107  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Thank you for the report.

Please provide configuration files for master and slave.

Slave (nepal.cnf) and master (fresno.cnf) are attached.  If you need to see the binlog mentioned (.....0106) I have saved it too.

Thank you for the feedback.

Please attach the binary log too.

Binlog uploaded via FTP as bug-data-28421.tar.gz.

I'd just like to add that I have seen this 2 times myself.
We are running 5.0.41

Interestingly enough we were able to restore one of our slaves to a database snapshot that was minutes before it happened and restart the slave, but it did not happen the second time. 

To me it implies that this happens when the master rotates the binlogs and the slave is gathering statements for its relay-bin.log

Unfortunately it was not reproducible on the same machine with the same dataset :(

Many thanks for writing a bug report.

Unfortunately we can't repeat this. Please try newest version of MySQL (5.0.67).

I have repeated this somewhat (slave continuously creating relay log files). This test shows it with very small relay logs only but likely the same root cause.

Starting with master -> slave hosts replicating.

# get some binlogs in the .index file
master> flush logs; flush logs;

rm master-bin.000002
master> flush logs;

# edit master-bin.index file, remove 000002, and flush logs again, and slave relaylogs repeat over and over; (rapidly to fill disk or until stop slave etc)

vi master-bin.index (delete the line with 000002 and save)
master> flush logs;

3-500 per second created;

master-bin.000004 contains only;
[root@dixon var]# mysqlbinlog master-bin.000004 
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#091110  8:10:33 server id 1  end_log_pos 106 	Start: binlog v 4, server v 5.1.40-enterprise-gpl-advanced-log created 091110  8:10:33
# at 106
#091110  8:11:16 server id 1  end_log_pos 150 	Rotate to master-bin.000005  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

The last two relay logs contain this;

[root@dixon var1]# mysqlbinlog dixon-relay-bin.001775 
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#091110  8:12:28 server id 2  end_log_pos 106 	Start: binlog v 4, server v 5.1.40-enterprise-gpl-advanced-log created 091110  8:12:28
# at 106
#700101  9:30:00 server id 1  end_log_pos 0 	Rotate to master-bin.000005  pos: 106
# at 150
#091110  8:11:16 server id 1  end_log_pos 0 	Start: binlog v 4, server v 5.1.40-enterprise-gpl-advanced-log created 091110  8:11:16
# at 252
#091110  8:12:38 server id 2  end_log_pos 301 	Rotate to dixon-relay-bin.001776  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
[root@dixon var1]# mysqlbinlog dixon-relay-bin.001776 
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#091110  8:12:38 server id 2  end_log_pos 106 	Start: binlog v 4, server v 5.1.40-enterprise-gpl-advanced-log created 091110  8:12:38
# at 106
#700101  9:30:00 server id 1  end_log_pos 0 	Rotate to master-bin.000005  pos: 106
# at 150
#091110  8:11:16 server id 1  end_log_pos 0 	Start: binlog v 4, server v 5.1.40-enterprise-gpl-advanced-log created 091110  8:11:16
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

# cat master-bin.index 
./master-bin.000001
./master-bin.000003
./master-bin.000004
./master-bin.000005

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: localhost
                  Master_User: slave_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: master-bin.000005
          Read_Master_Log_Pos: 106
               Relay_Log_File: dixon-relay-bin.001776
                Relay_Log_Pos: 252
        Relay_Master_Log_File: master-bin.000005
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 106
              Relay_Log_Space: 553
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
1 row in set (0.00 sec)

When the situation is reproduced, if stop slave; is issued, simply running start slave; again will continue the looping of relay logs.

Reset Master is one way to 'clear' the situation...

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/93489

3215 Li-Bing.Song@sun.com	2009-12-10
      BUG #28421 Infinite loop on slave relay logs
      
      That, manually deleteing one or more entries from 'master-bin.index', will
      cause master infinitely loop to send one binlog file. 
      
      When starting a dump session, master opens index file and search the binlog file
      which is being requested by the slave. The position of the binlog file in the
      index file is recorded. it will be used to find the next binlog file when current
      binlog file has been dumped completely. As only the position is used, it will can
      not get the correct file if some entries has been removed manually from the index file.
      the master will reopen the current binlog file which has been dump completely
      and redump it if it can not get the next binlog file's name from index file.
      It obviously is a logical error.
      
      
      Manually changing index file is a illegal action, so after this patch, master
      sends a fatal error to slave and close the dump session if a new binlog file
      has been generated and master can not get it from the index file.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/95941

3303 Li-Bing.Song@sun.com	2010-01-05
      BUG #28421 Infinite loop on slave relay logs
      
      Manually deleteing one or more entries from 'master-bin.index', will
      cause master infinitely loop to send one binlog file. 
      
      When starting a dump session, master opens index file and search the binlog file
      which is being requested by the slave. The position of the binlog file in the
      index file is recorded. it will be used to find the next binlog file when current
      binlog file has dumped completely. As only the position is used, it may
      not get the correct file if some entries has been removed manually from the index file.
      the master will reopen the current binlog file which has been dump completely
      and redump it if it can not get the next binlog file's name from index file.
      It obviously is a logical error.
      
      
      Even though it is allowed to manually change index file,
      but it is not recommended. so after this patch, master
      sends a fatal error to slave and close the dump session if a new binlog file
      has been generated and master can not get it from the index file.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/96344

3308 Li-Bing.Song@sun.com	2010-01-08
      BUG #28421 Infinite loop on slave relay logs
      
      Manually deleteing one or more entries from 'master-bin.index', will
      cause master infinitely loop to send one binlog file. 
      
      When starting a dump session, master opens index file and search the binlog file
      which is being requested by the slave. The position of the binlog file in the
      index file is recorded. it will be used to find the next binlog file when current
      binlog file has dumped completely. As only the position is used, it may
      not get the correct file if some entries has been removed manually from the index file.
      the master will reopen the current binlog file which has been dump completely
      and redump it if it can not get the next binlog file's name from index file.
      It obviously is a logical error.
      
      
      Even though it is allowed to manually change index file,
      but it is not recommended. so after this patch, master
      sends a fatal error to slave and close the dump session if a new binlog file
      has been generated and master can not get it from the index file.

Pushed to mysql-5.1-bugteam and mysql-pe

Pushed into 5.1.43 (revid:joro@sun.com-20100115085139-qkh0i0fpohd9u9p5) (version source revid:li-bing.song@sun.com-20100108054223-p0cnli6sp1npui3k) (merge vers: 5.1.42) (pib:16)

Noted in 5.1.43 changelog.

Manually removing entries from the binary log index file on a
replication master could cause the server to repeatedly send the same
binary log file to slaves. 

Setting report to NDI pending push to 5.5.x+.

Pushed into mysql-next-mr (revid:alik@sun.com-20100204063540-9czpdmpixi3iw2yb) (version source revid:alik@sun.com-20100119163614-172adculixyu26j5) (pib:16)

Pushed into 6.0.14-alpha (revid:alik@sun.com-20100205113942-oqovjy0eoqbarn7i) (version source revid:alik@sun.com-20100204064210-ljwanqvrjs83s1gq) (merge vers: 6.0.14-alpha) (pib:16)

Pushed into 5.5.2-m2 (revid:alik@sun.com-20100203172258-1n5dsotny40yufxw) (version source revid:alexey.kopytov@sun.com-20100115112653-e3a24041ag1cv6v3) (merge vers: 5.5.1-m2) (pib:16)

Noted in 5.5.2, 6.0.14 changelogs.

Setting report to Need Merge pending push into Celosia.

Pushed into 5.1.44-ndb-7.0.14 (revid:jonas@mysql.com-20100312135944-t0z8s1da2orvl66x) (version source revid:jonas@mysql.com-20100312115609-woou0te4a6s4ae9y) (merge vers: 5.1.44-ndb-7.0.14) (pib:16)

Pushed into 5.1.44-ndb-6.2.19 (revid:jonas@mysql.com-20100312134846-tuqhd9w3tv4xgl3d) (version source revid:jonas@mysql.com-20100312060623-mx6407w2vx76h3by) (merge vers: 5.1.44-ndb-6.2.19) (pib:16)

Pushed into 5.1.44-ndb-6.3.33 (revid:jonas@mysql.com-20100312135724-xcw8vw2lu3mijrhn) (version source revid:jonas@mysql.com-20100312103652-snkltsd197l7q2yg) (merge vers: 5.1.44-ndb-6.3.33) (pib:16)

Fixed in earlier 5.1.x, 5.5.x.