MySQL Bugs: #28238: expire_logs_days and PURGE MASTER LOGS fail when index not up to date

Bug #28238	expire_logs_days and PURGE MASTER LOGS fail when index not up to date
Submitted:	4 May 2007 13:20	Modified:	5 Aug 2010 14:05
Reporter:	Baron Schwartz (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:	5.0.38	OS:	Linux (Gentoo)
Assigned to:	Mats Kindahl	CPU Architecture:	Any
Tags:	binlog, expire_logs_days, master, purge, qc, replication

Description:
expire_logs_days and PURGE MASTER LOGS do not work for me in production, even though they worked on a laptop where I tried them.

I believe this is because the server in question is a master, and there are slaves attached to it.  It seems to work fine when there are no slaves attached.

How to repeat:
USA is the master server here, where the purging and expiring do not work.  First, I check on the command-line:

baron@usa /var/lib/mysql $ ls -l usa-bin*
-rw-rw---- 1 mysql mysql 1073793134 Apr 20 02:20 usa-bin.000167
-rw-rw---- 1 mysql mysql 1077593626 Apr 20 04:50 usa-bin.000168
[snip]
-rw-rw---- 1 mysql mysql  705424290 May  4 08:45 usa-bin.000239
-rw-rw---- 1 mysql mysql   45606607 May  4 08:55 usa-bin.000240
-rw-rw---- 1 mysql mysql       4080 May  4 08:49 usa-bin.index

We run PURGE MASTER LOGS BEFORE DATE_SUB(CURRENT_DATE, INTERVAL 9 DAY) every night on this server, so the earliest log should be Apr 25.  Now I connect to the mysql client:

baron@keywest ~ $ mysql -husa -uroot -p
Enter password: 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 27969
Server version: 5.0.38-log Gentoo Linux mysql-5.0.38

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show grants\G
*************************** 1. row ***************************
Grants for root@%: GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY PASSWORD .... WITH GRANT OPTION
1 row in set (0.00 sec)

mysql> show master logs;
+----------------+------------+
| Log_name       | File_size  |
+----------------+------------+
| usa-bin.000001 |          0 | 
| usa-bin.000002 |          0 | 
| usa-bin.000003 |          0 | 
[snip] these are the un-pruned entries caused by using rm on the disk
| usa-bin.000165 |          0 | 
| usa-bin.000166 |          0 | 
| usa-bin.000167 | 1073793134 | 
| usa-bin.000168 | 1077593626 | 
| usa-bin.000169 | 1074431032 | 
[snip]
| usa-bin.000238 | 1074221318 | 
| usa-bin.000239 |  705424290 | 
| usa-bin.000240 |   48554652 | 
+----------------+------------+
240 rows in set (0.00 sec)

mysql> PURGE MASTER LOGS BEFORE DATE_SUB(CURRENT_DATE, INTERVAL 10 DAY);
Query OK, 0 rows affected (0.23 sec)

mysql> show master logs;
+----------------+------------+
[ snip ]
+----------------+------------+
240 rows in set (0.00 sec)

-- No logs were purged -- there are still 240 of them

mysql> select date_sub(current_date, interval 10 day), version();
+-----------------------------------------+------------+
| date_sub(current_date, interval 10 day) | version()  |
+-----------------------------------------+------------+
| 2007-04-24                              | 5.0.38-log | 
+-----------------------------------------+------------+

mysql> flush logs;
Query OK, 0 rows affected (0.14 sec)

mysql> show master logs;
+----------------+------------+
| Log_name       | File_size  |
+----------------+------------+
| usa-bin.000001 |          0 | 
[snip]
| usa-bin.000239 |  705424290 | 
| usa-bin.000240 |   62867054 | 
| usa-bin.000241 |        539 | 
+----------------+------------+
241 rows in set (0.00 sec)

Now I head back to the disk:

baron@usa /var/lib/mysql $ ls -l usa-bin*
-rw-rw---- 1 mysql mysql 1073793134 Apr 20 02:20 usa-bin.000167
-rw-rw---- 1 mysql mysql 1077593626 Apr 20 04:50 usa-bin.000168
[snip]
-rw-rw---- 1 mysql mysql 1074238278 May  3 21:41 usa-bin.000236
-rw-rw---- 1 mysql mysql 1074113820 May  4 02:30 usa-bin.000237
-rw-rw---- 1 mysql mysql 1074221318 May  4 05:34 usa-bin.000238
-rw-rw---- 1 mysql mysql  705424290 May  4 08:45 usa-bin.000239
-rw-rw---- 1 mysql mysql   62867054 May  4 09:00 usa-bin.000240
-rw-rw---- 1 mysql mysql   15432265 May  4 09:01 usa-bin.000241
-rw-rw---- 1 mysql mysql       4097 May  4 09:00 usa-bin.index
baron@usa /var/lib/mysql $ 

Clearly no logs were purged.  Now I connect to another server named fresno, which is not a master and doesn't have binary logging.  I re-configure it to enable binary logging, and restart it.  Here are the logs:

fresno mysql # ls -l fresno-bin*
-rw-rw---- 1 mysql mysql 98 May  4 09:07 fresno-bin.000001
-rw-rw---- 1 mysql mysql 20 May  4 09:07 fresno-bin.index

Now I connect to the command-line mysql client and execute the same SQL:

baron@keywest ~ $ mysql -hfresno -uroot -p
Enter password: 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 5.0.38-log Gentoo Linux mysql-5.0.38

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| fresno-bin.000001 |        98 | 
+-------------------+-----------+
1 row in set (0.02 sec)

mysql> flush logs;
Query OK, 0 rows affected (0.13 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| fresno-bin.000001 |       142 | 
| fresno-bin.000002 |        98 | 
+-------------------+-----------+
2 rows in set (0.00 sec)

mysql> PURGE MASTER LOGS BEFORE current_date + interval 1 day;          
Query OK, 0 rows affected (0.01 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| fresno-bin.000002 |        98 | 
+-------------------+-----------+
1 row in set (0.00 sec)

mysql> Bye

Now I look at the disk again:

fresno mysql # ls -l fresno-bin*
-rw-rw---- 1 mysql mysql 98 May  4 09:08 fresno-bin.000002
-rw-rw---- 1 mysql mysql 20 May  4 09:09 fresno-bin.index
fresno mysql # 

It seems to have worked fine on the server that has no slaves connected to it.

Suggested fix:
I think the problem here might be related to how MySQL determines if a slave is reading from a binary log.  On USA, the master:

mysql> show slave hosts;
+-----------+----------+------+-------------------+-----------+
| Server_id | Host     | Port | Rpl_recovery_rank | Master_id |
+-----------+----------+------+-------------------+-----------+
|         9 | nepal    | 3306 |                 0 |        11 | 
|        40 | fresno   | 3306 |                 0 |        11 | 
|        42 | portland | 3306 |                 0 |        11 | 
+-----------+----------+------+-------------------+-----------+

However, we just in the last few days completely rebuilt all three of these slave machines, and they have just been restarted, so there is no way they are still holding binary logs open from April 20th.  Here's what innotop shows:

_________________________ Slave SQL Status _________________________
CXN       Master  On?  TimeLag  Catchup  Temp  Relay Pos  Last Error
fresno    usa     Yes    00:00     0.00     0   23118767            
nepal     usa     Yes    00:00     0.00     0   56242217            
portland  usa     Yes    00:00     0.00     0   56242217            

____________________ Slave I/O Status _____________________
CXN       Master  On?  File            Relay Size  Pos     
fresno    usa     Yes  usa-bin.000241      22.05M  56242082
nepal     usa     Yes  usa-bin.000241      53.64M  56242082
portland  usa     Yes  usa-bin.000241      53.64M  56242082

_________________ Master Status _________________
CXN     File               Position  Binlog Cache
fresno  fresno-bin.000002        98         0.00%
usa     usa-bin.000241     56242082         0.53%

Whoops, I messed up the report a little bit.  After running FLUSH LOGS on the master (usa), I ran PURGE MASTER LOGS before going back and looking at the disk again.  Just in case that's not clear.

In fact, I only copied a small fraction of the different kinds of commands I really tried running, but none of them purged any logs.  I just didn't want the bug report to fill up with too much text.

Hi Baron,

Initially, I can't repeat this, even with a slave connected (on 5.0.40):

/* Master data directory */
medusa:/usr/local/mysql root# ls -l data
total 58296
-rw-rw----    1 mysql  wheel   5242880 May  4 14:38 ib_logfile0
-rw-rw----    1 mysql  wheel   5242880 May  2 22:27 ib_logfile1
-rw-rw----    1 mysql  wheel  18874368 May  4 14:29 ibdata1
-rw-rw----    1 mysql  wheel    102541 May  4 14:44 medusa-bin.000001
-rw-rw----    1 mysql  wheel    102517 May  4 14:44 medusa-bin.000002
-rw-rw----    1 mysql  wheel    102517 May  4 14:44 medusa-bin.000003
-rw-rw----    1 mysql  wheel    102517 May  4 14:44 medusa-bin.000004
-rw-rw----    1 mysql  wheel     51338 May  4 14:44 medusa-bin.000005
-rw-rw----    1 mysql  wheel       200 May  4 14:44 medusa-bin.index
-rw-rw----    1 mysql  wheel         6 May  4 14:38 medusa.pid
drwxr-x---   53 mysql  wheel      1802 May  2 22:26 mysql
drwxr-x---   12 mysql  wheel       408 May  3 16:42 test

/* Slave data directory */
medusa:/usr/local/mysql root# ls -l data2
total 57480
-rw-r-----    1 mysql  wheel   5242880 May  4 14:39 ib_logfile0
-rw-r-----    1 mysql  wheel   5242880 May  4 14:31 ib_logfile1
-rw-r-----    1 mysql  wheel  18874368 May  4 14:38 ibdata1
-rw-rw----    1 mysql  wheel        68 May  4 14:45 master.info
-rw-rw----    1 mysql  wheel     51476 May  4 14:45 medusa-relay-bin.000010
-rw-rw----    1 mysql  wheel        47 May  4 14:44 medusa-relay-bin.index
-rw-rw----    1 mysql  wheel         6 May  4 14:39 medusa.pid
drwxr-x---   53 mysql  wheel      1802 May  4 14:31 mysql
-rw-rw----    1 mysql  wheel        79 May  4 14:45 relay-log.info
drwxr-x---   12 mysql  wheel       408 May  4 14:31 test

medusa:/usr/local/mysql root# more master-my.cnf 
[mysqld]

basedir = /usr/local/mysql/
datadir = /usr/local/mysql/data

server-id = 1

log-bin
max_binlog_size = 100K
expire_logs_days = 2

medusa:/usr/local/mysql root# more slave-my.cnf 
[mysqld]

port = 3307
socket = /tmp/mysql-slave.sock

basedir = /usr/local/mysql
datadir = /usr/local/mysql/data2

server-id = 2

/* Switch dates */
medusa:/usr/local/mysql root# date
Fri May  4 14:45:55 BST 2007
medusa:/usr/local/mysql root# date
Tue May  8 14:46:17 BST 2007

/* Generate more binlogs */
medusa:/usr/local/mysql root# while [ 1 ] ; do    mysql -u root test -e 'insert into binlog_test (i,j) values (1,1)' ; done
^C

/* Check data directories */
medusa:/usr/local/mysql root# ls -l data
total 57888
-rw-rw----    1 mysql  wheel   5242880 May  4 14:38 ib_logfile0
-rw-rw----    1 mysql  wheel   5242880 May  2 22:27 ib_logfile1
-rw-rw----    1 mysql  wheel  18874368 May  4 14:29 ibdata1
-rw-rw----    1 mysql  wheel    102517 May  8 14:46 medusa-bin.000005
-rw-rw----    1 mysql  wheel    102517 May  8 14:46 medusa-bin.000006
-rw-rw----    1 mysql  wheel     55538 May  8 14:47 medusa-bin.000007
-rw-rw----    1 mysql  wheel       120 May  8 14:46 medusa-bin.index
-rw-rw----    1 mysql  wheel         6 May  4 14:38 medusa.pid
drwxr-x---   53 mysql  wheel      1802 May  2 22:26 mysql
drwxr-x---   12 mysql  wheel       408 May  3 16:42 test
medusa:/usr/local/mysql root# ls -l data2
total 57488
-rw-r-----    1 mysql  wheel   5242880 May  4 14:39 ib_logfile0
-rw-r-----    1 mysql  wheel   5242880 May  4 14:31 ib_logfile1
-rw-r-----    1 mysql  wheel  18874368 May  4 14:38 ibdata1
-rw-rw----    1 mysql  wheel        68 May  8 14:47 master.info
-rw-rw----    1 mysql  wheel     55676 May  8 14:47 medusa-relay-bin.000014
-rw-rw----    1 mysql  wheel        47 May  8 14:46 medusa-relay-bin.index
-rw-rw----    1 mysql  wheel         6 May  4 14:39 medusa.pid
drwxr-x---   53 mysql  wheel      1802 May  4 14:31 mysql
-rw-rw----    1 mysql  wheel        79 May  8 14:47 relay-log.info
drwxr-x---   12 mysql  wheel       408 May  4 14:31 test

So it seems that the master's binary logs are *still* getting purged, even with a slave connected.

Could you provide the my.cnf files for the servers in question please? 

Cheers,

Mark

I just added fresno's and usa's my.cnf files to the bug (I also added nepal's and called it fresno's, sorry... I'm not doing so well this morning).

I can't see significant differences between them.

We don't have expire_logs_days set anymore in these cnf files.  The testing I did with that a week or so ago didn't work, so I disabled it.  Today's testing is only on the effect of PURGE MASTER LOGS.

Thanks for being patient with me as I proceed to make a completely unintelligible bug report even worse :-/

I think I have found the bug. Whether this is a bug is now open to question.

I hooked up another server to fresno and was able to confirm that PURGE MASTER LOGS worked even when there is a slave server reading from it. So, my earlier theory is wrong, and Paul DuBois and Mark Leith's inability to repro makes sense. I also went through all the config differences between usa and fresno; there is only really one (query cache size). Changing fresno's query cache size to match usa's didn't make PURGE MASTER LOGS stop working on fresno.

I was maintaining the log files on 'usa', the master, with a typical find|rm cron job. The usa-bin.index file was not being updated as a result, and was out of sync with the filesystem; its first entry is ./usa-bin.000001. This is also shown by SHOW MASTER LOGS showing files usa-bin.000001 through current, with zero size for the ones the cron job removed.

When I run PURGE MASTER LOGS on usa, I get this output:

mysql> purge master logs before date_sub(current_date, interval 7 day);
Query OK, 0 rows affected (0.04 sec)

In the log file, though:

070511 9:26:16 [ERROR] Failed to open log (file './usa-bin.000001', errno 2)

Apparently it is looking through the contents of usa-bin.index, trying to remove a non-existent file, and stopping.

I would propose that it should ignore files that don't exist, and update usa-bin.index to match what's on the filesystem. But I don't know all the ramifications of this. At least it should return an error to the client.

I would also suggest that the manual for PURGE LOGS and the note on expire_logs_days be updated to document this behavior in versions where it's not fixed. It might be helpful to include a pointer to SHOW MASTER LOGS. And on SHOW MASTER LOGS, it would be good to note that if any files appear to have zero size, it is likely the log-bin.index file is out of date with the filesystem and purging will fail.

Actually, it looks like the error in the log file:

070511  9:26:16 [ERROR] Failed to open log (file './usa-bin.000001', errno 2)

Is probably from another time.  My mistake.  But I hand-corrected usa-bin.index and was able to confirm that PURGE MASTER LOGS *does* work, as I said, so I still think I've found the bug.

(There should probably be an error in the mysqld.err file when a file in usa-bin.index can't be found for purging, regardless).

Hi Baron,

Indeed, there are issues with these statements when having binary logs manually removed, and not updating the binlog index file, which is why we always recommend to use PURGE MASTER LOGS or expire_logs_days exclusively.

For this reason, I am actually going to verify this - but verify it as a feature request. 

I think the feature request should be:

Update the binlog index file when PURGE MASTER LOGS or expire_logs_days encounters binary log references for non-existant binary log files.

Would you agree with this?

Certainly.  I would even say if it's not a high priority to change the behavior, just documenting it would be a big help.

This is a reasonable feature/documentation request.

This has lasted over a year now and we have run into it. Having just spent all morning tracking it down, could someone provide a progress update please?

Is there any way to recover this state once you get there?

I've also run into the problem. In my case I had no option but to manually remove some old binary logs (the slow query log grew to such a size that it saturated the partition - so much so that you could not even connect to the mysql server to issue a "flush logs" command obviously removing the slow query log on disk didn't free the space as the running mysql still had the file handle open... only by removing an old log could I free up space to allow me to connect and thus flush the logs and thus free up the slow query file handle/disk space! phew!)

I need to find a way to reset mysql's internal log file state (show master logs) with what is available on disk and preferably have it update mysql-bin.index again (althouhg why it appears to keep an internal state and a separate file is currently beyond me :)

I'm using a 5.0.27 build just now due to bugs in the later ones relating to UDF handling - it's fixed in current 5.0.x releases but I've not had a chance to update this production environment yet.

I suggest re-triaging and re-classifying this as bug, as:

1. It can happen without manual intervention (race condition / fs issue / crash )
2. There is no reporting, neither via errors, warnings or logs
3. Binlog accumulation because of this can lead to system failures (full file system and all the hell afterwards)

This seems to have been fixed in both current 5.0 and 5.1 releases,
instead of failing to purge absent log files a warning is generated
for every missing file now:

| Warning | 1477 | Being purged log ./hp-laptop-bin.000001 was not found |

and the purge continues.

No idea which version this was fixed in though (a quick search for duplicate bugs and for related changelog entries did not come up with any results)

Looks like it was fixed with this commit:

http://lists.mysql.com/commits/44122

ChangeSet@1.2598, 2008-03-17 18:03:47+02:00, aelkin@mysql1000.(none) +5 -0
  Bug #18199  PURGE BINARY LOGS fails silently with missing logs;
  Bug #18453  Warning/error message if there is a mismatch between binary logs and index file