MySQL Bugs: #57801: Mysql slave has a chance to stop when master is rotating binary logs

Bug #57801	Mysql slave has a chance to stop when master is rotating binary logs
Submitted:	28 Oct 2010 11:07	Modified:	3 Jun 2012 11:02
Reporter:	Kim Carlsen	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1.51	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	binlog, flush logs, mysqldump, slave stop

Description:
It seems that when flushing logs (and rotating binary log) the slave has a chance to stop, with an error complaining that the binary log is not valid (see below). But Im not sure how and when this happens.

Error on the slave:
101028  2:19:59 [ERROR] Error reading packet from server: Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL ( server_errno=1236)
101028  2:19:59 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL', Error_code: 1236
101028  2:19:59 [Note] Slave I/O thread exiting, read up to log 'cactus-binlog.001309', position 4

This happened right between when our backup of db1 finished and backup of db2 started.

Our backup script, the extra file just contains the password:
/usr/local/bin/mysqldump --defaults-extra-file=/home/backups/my.cnf --routines -F --skip-lock-tables --single-transaction -ubackup db1 | gzip > /home/backups/mysql/db1.$TIMESTAMP.sql.gz

/usr/local/bin/mysqldump --defaults-extra-file=/home/backups/my.cnf --routines -F --skip-lock-tables --single-transaction -ubackup db2 | gzip > /home/backups/mysql/db2.$TIMESTAMP.sql.gz

Timestamp of our backup files
10010157919 Oct 28 02:19 db1.10-28-10-0100.sql.gz
315762511   Oct 28 02:22 db2.10-28-10-0100.sql.gz

How to repeat:
I cant repeat this, but every now and then, the slave stops with this error.

The slave still works, just need a 'start slave' command and it will continue from where it stopped.

Please see bug: http://bugs.mysql.com/bug.php?id=37926 and verify it is your case. Thanks in advance.

To me that seems like something completely different.

There are no crashes involved in this or any other failures. Just that MySQL master report to the slave for a brief time that the binary log is invalid. When you try to open it the second time, there are no problems. I suspect that when you tell mysql to flush the logs and create a new binary log file, there is a short timeframe where it has not written the binary log header and if the slave ask for updates in that timeframe it will see the empty binary log file and fails.

This is purely speculation, as I dont know how the internals of MySQL work.

This weekend the slave failed again. This time the following was added to the error log on the master.

mysql-log.err:
101025  4:30:10 [Note] /usr/local/libexec/mysqld: ready for connections.
Version: '5.1.51-log'  socket: '/tmp/mysql.sock'  port: 3306  FreeBSD port: mysql-server-5.1.51_1
101101  1:00:01 [ERROR] Error writing file '/home/mysqldb/logs/cactus-binlog.index' (errno: 9) 

and as a hexdump.
00000340  35 31 5f 31 0a 07 00 00  01 00 00 00 03 00 00 00  |51_1............|
00000350  07 00 00 01 00 00 00 03  00 00 00 31 30 31 31 30  |...........10110|
00000360  31 20 20 31 3a 30 30 3a  30 31 20 5b 45 52 52 4f  |1  1:00:01 [ERRO|
00000370  52 5d 20 45 72 72 6f 72  20 77 72 69 74 69 6e 67  |R] Error writing|
00000380  20 66 69 6c 65 20 27 2f  68 6f 6d 65 2f 6d 79 73  | file '/home/mys|
00000390  71 6c 64 62 2f 6c 6f 67  73 2f 63 61 63 74 75 73  |qldb/logs/cactus|
000003a0  2d 62 69 6e 6c 6f 67 2e  69 6e 64 65 78 27 20 28  |-binlog.index' (|
000003b0  65 72 72 6e 6f 3a 20 39  29 0a                    |errno: 9).|

Checking the index file, everything looks fine:
cat cactus-binlog.index 
/home/mysqldb/logs/cactus-binlog.001293
/home/mysqldb/logs/cactus-binlog.001294
... snip ...
/home/mysqldb/logs/cactus-binlog.001321
/home/mysqldb/logs/cactus-binlog.001322

System is running FreeBSD 7.3-RELEASE-p2 on amd64 using UFS as file system. The slave works fine after running a 'start slave'. The failure was again triggered by mysqldump running.

This weekend the slave failed again. This time the following was added to the error log on the master.

mysql-log.err:
101025  4:30:10 [Note] /usr/local/libexec/mysqld: ready for connections.
Version: '5.1.51-log'  socket: '/tmp/mysql.sock'  port: 3306  FreeBSD port: mysql-server-5.1.51_1
101101  1:00:01 [ERROR] Error writing file '/home/mysqldb/logs/cactus-binlog.index' (errno: 9) 

and as a hexdump.
00000340  35 31 5f 31 0a 07 00 00  01 00 00 00 03 00 00 00  |51_1............|
00000350  07 00 00 01 00 00 00 03  00 00 00 31 30 31 31 30  |...........10110|
00000360  31 20 20 31 3a 30 30 3a  30 31 20 5b 45 52 52 4f  |1  1:00:01 [ERRO|
00000370  52 5d 20 45 72 72 6f 72  20 77 72 69 74 69 6e 67  |R] Error writing|
00000380  20 66 69 6c 65 20 27 2f  68 6f 6d 65 2f 6d 79 73  | file '/home/mys|
00000390  71 6c 64 62 2f 6c 6f 67  73 2f 63 61 63 74 75 73  |qldb/logs/cactus|
000003a0  2d 62 69 6e 6c 6f 67 2e  69 6e 64 65 78 27 20 28  |-binlog.index' (|
000003b0  65 72 72 6e 6f 3a 20 39  29 0a                    |errno: 9).|

Checking the index file, everything looks fine:
cat cactus-binlog.index 
/home/mysqldb/logs/cactus-binlog.001293
/home/mysqldb/logs/cactus-binlog.001294
... snip ...
/home/mysqldb/logs/cactus-binlog.001321
/home/mysqldb/logs/cactus-binlog.001322

System is running FreeBSD 7.3-RELEASE-p2 on amd64 using UFS as file system. The slave works fine after running a 'start slave'. The failure was again triggered by mysqldump running.

I can add that mysqldump also fails with the following error.

mysqldump: Got error: 2013: Lost connection to MySQL server during query when doing refresh

The command used is:
/usr/local/bin/mysqldump --defaults-extra-file=/home/backups/my.cnf --routines -F --skip-lock-tables --single-transaction -ubackup db1 gzip > /home/backups/mysql/db1.$TIMESTAMP.sql.gz

Thank you for the feedback.

Latest errors look strange. Please send us full error logs from master and slave.

It seems table data is leaked/dumped into the error log, I have noticed that sometimes when it fails, it is impossible to run the mysqldump command until the mysqld process has been restarted.

Mysqldump will continue to report 'lost connection to server', until the server is restarted. Im not sure how to get any closer to the problem. Could this just be an indication of some sort of filesystem corruption?

I will try for a period to stop the slave while we make the backup, to see if this has any influence.

Thank you for the feedback.

Data in the error log file is information of InnoDB transaction which was rolled back on server shutdown. Nothing wrong with it.

regarding to original issue I see following in slave error log:

443 101113 11:07:19 [Note] Slave I/O thread: connected to master 'repl@XXX:3306',replication started in log 'XXX.001380' at position 106
444 101117  2:21:27 [ERROR] Error reading packet from server: Binlog has bad magic number;  It's not a binary log file that can be used by this version of MySQL ( server_errno=1236)
445 101117  2:21:27 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Binlog has bad magic number;  It's not a binary log file that can be used by this version of MyS    QL', Error_code: 1236
446 101117  2:21:27 [Note] Slave I/O thread exiting, read up to log 'XXX.001396', position 4
447 101117  9:15:52 [Note] Slave I/O thread: connected to master 'repl@cactus.fullrate.dk:3306',replication started in log 'XXX.001396' at position 4

So there is a jusmp from #001380 to #001396

Please send us output of `ls -l DATADIR_ON_MASTER` and binlog.index file

The jump from #001380 to #001396 is just because the slave have been running for 4 days, 
and that we produce 4 binary logs each day. I see this as a healthy sign.

9:15:52 [Note] Slave I/O thread: connected to master
'repl@cactus.fullrate.dk:3306',replication started in log 'XXX.001396' at position 4

is from our start slave command after the slave stopped during the nightly backup routine.

By datadir, do you mean where we store our binary logs on the master? or the innodb table data?
krc@cactus /home/mysqldb/logs $ ls -l
total 8306262
-rw-rw----  1 mysql  mysql  1018751791 Nov 12 01:00 cactus-binlog.001369
-rw-rw----  1 mysql  mysql     8199825 Nov 12 02:18 cactus-binlog.001370
-rw-rw----  1 mysql  mysql         153 Nov 12 02:18 cactus-binlog.001371
-rw-rw----  1 mysql  mysql   483472034 Nov 12 09:30 cactus-binlog.001372
-rw-rw----  1 mysql  mysql      517123 Nov 12 09:31 cactus-binlog.001373
-rw-rw----  1 mysql  mysql      192968 Nov 12 09:32 cactus-binlog.001374
-rw-rw----  1 mysql  mysql      515828 Nov 12 09:32 cactus-binlog.001375
-rw-rw----  1 mysql  mysql     1791908 Nov 12 09:34 cactus-binlog.001376
-rw-rw----  1 mysql  mysql     1932424 Nov 12 09:35 cactus-binlog.001377
-rw-rw----  1 mysql  mysql        5951 Nov 12 09:37 cactus-binlog.001378
-rw-rw----  1 mysql  mysql     2582459 Nov 12 09:47 cactus-binlog.001379
-rw-rw----  1 mysql  mysql   619285983 Nov 13 01:00 cactus-binlog.001380
-rw-rw----  1 mysql  mysql     3547191 Nov 13 02:17 cactus-binlog.001381
-rw-rw----  1 mysql  mysql         153 Nov 13 02:17 cactus-binlog.001382
-rw-rw----  1 mysql  mysql   624777108 Nov 14 01:00 cactus-binlog.001383
-rw-rw----  1 mysql  mysql     2250883 Nov 14 02:15 cactus-binlog.001384
-rw-rw----  1 mysql  mysql         153 Nov 14 02:15 cactus-binlog.001385
-rw-rw----  1 mysql  mysql   804871802 Nov 15 01:00 cactus-binlog.001386
-rw-rw----  1 mysql  mysql     4514741 Nov 15 02:15 cactus-binlog.001387
-rw-rw----  1 mysql  mysql         153 Nov 15 02:15 cactus-binlog.001388
-rw-rw----  1 mysql  mysql  1079004380 Nov 15 22:21 cactus-binlog.001389
-rw-rw----  1 mysql  mysql    29046247 Nov 16 01:00 cactus-binlog.001390
-rw-rw----  1 mysql  mysql     7906648 Nov 16 02:19 cactus-binlog.001391
-rw-rw----  1 mysql  mysql         153 Nov 16 02:19 cactus-binlog.001392
-rw-rw----  1 mysql  mysql  1099492878 Nov 16 22:21 cactus-binlog.001393
-rw-rw----  1 mysql  mysql    68668317 Nov 17 01:00 cactus-binlog.001394
-rw-rw----  1 mysql  mysql   103011617 Nov 17 02:21 cactus-binlog.001395
-rw-rw----  1 mysql  mysql         153 Nov 17 02:21 cactus-binlog.001396
-rw-rw----  1 mysql  mysql   585409862 Nov 17 09:52 cactus-binlog.001397
-rw-rw----  1 mysql  mysql   525774822 Nov 18 01:00 cactus-binlog.001398
-rw-rw----  1 mysql  mysql     2802115 Nov 18 02:14 cactus-binlog.001399
-rw-rw----  1 mysql  mysql         153 Nov 18 02:14 cactus-binlog.001400
-rw-rw----  1 mysql  mysql   956961097 Nov 19 01:00 cactus-binlog.001401
-rw-rw----  1 mysql  mysql         153 Nov 19 01:00 cactus-binlog.001402
-rw-rw----  1 mysql  mysql   465464112 Nov 19 09:37 cactus-binlog.001403
-rw-rw----  1 mysql  mysql        1400 Nov 19 01:00 cactus-binlog.index

krc@cactus /home/mysqldb/logs $ sudo cat cactus-binlog.index
/home/mysqldb/logs/cactus-binlog.001369
/home/mysqldb/logs/cactus-binlog.001370
/home/mysqldb/logs/cactus-binlog.001371
/home/mysqldb/logs/cactus-binlog.001372
/home/mysqldb/logs/cactus-binlog.001373
/home/mysqldb/logs/cactus-binlog.001374
/home/mysqldb/logs/cactus-binlog.001375
/home/mysqldb/logs/cactus-binlog.001376
/home/mysqldb/logs/cactus-binlog.001377
/home/mysqldb/logs/cactus-binlog.001378
/home/mysqldb/logs/cactus-binlog.001379
/home/mysqldb/logs/cactus-binlog.001380
/home/mysqldb/logs/cactus-binlog.001381
/home/mysqldb/logs/cactus-binlog.001382
/home/mysqldb/logs/cactus-binlog.001383
/home/mysqldb/logs/cactus-binlog.001384
/home/mysqldb/logs/cactus-binlog.001385
/home/mysqldb/logs/cactus-binlog.001386
/home/mysqldb/logs/cactus-binlog.001387
/home/mysqldb/logs/cactus-binlog.001388
/home/mysqldb/logs/cactus-binlog.001389
/home/mysqldb/logs/cactus-binlog.001390
/home/mysqldb/logs/cactus-binlog.001391
/home/mysqldb/logs/cactus-binlog.001392
/home/mysqldb/logs/cactus-binlog.001393
/home/mysqldb/logs/cactus-binlog.001394
/home/mysqldb/logs/cactus-binlog.001395
/home/mysqldb/logs/cactus-binlog.001396
/home/mysqldb/logs/cactus-binlog.001397
/home/mysqldb/logs/cactus-binlog.001398
/home/mysqldb/logs/cactus-binlog.001399
/home/mysqldb/logs/cactus-binlog.001400
/home/mysqldb/logs/cactus-binlog.001401
/home/mysqldb/logs/cactus-binlog.001402
/home/mysqldb/logs/cactus-binlog.001403

Thank you for the feedback.

Yes, I meant directory where binary logs are located.

Do you still have rpl-1-relay-bin.001181 or another relay log for which slave reports this error? Could you successfully run mysqlbinlog on it? What it outputs?

Looks like problem is with relay log really, because master creates new file first, then adds it to index file thus would not send not complete log to slave.

Unfortunately not, it only saves the relay log for a short time (until last flush-logs?).

krc@rpl /home/mysqldb/logs $ ls -l
total 1732
-rw-rw---- 1 mysql mysql 202 Nov 24 10:30 rpl-1-relay-bin.000106
-rw-rw---- 1 mysql mysql 1742378 Nov 24 10:36 rpl-1-relay-bin.000107
-rw-rw---- 1 mysql mysql 84 Nov 24 10:30 rpl-1-relay-bin.index
krc@rpl /home/mysqldb/logs $ sudo cat rpl-1-relay-bin.index
Password:
/home/mysqldb/logs/rpl-1-relay-bin.000106
/home/mysqldb/logs/rpl-1-relay-bin.000107
krc@rpl /home/mysqldb/logs $

Im not sure that it is the relay log that is broken. If it were, then a start slave would still read the same relay log and stop again with the same error?, what puzzles me is that when we run 'start slave' it just runs without any problem.

I have tried to stop the slave while we do the backup, (stop slave half hour before the backup and start it again half hour after the backup), but we still experience 'mysqldump: Got error: 2013: Lost connection to MySQL server during query when doing refresh'. I believe that if the slave was running, it would fail with a 'bad magic number' error. So Maybe the slave is not involved in this at all.

The table/transaction data in the error log still worries me.
Is it all transaction rollback it should log?, or only information about the last transaction before an unclean shutdown? The data logged does not belong to one transaction and the tables it logs from usually only have 1-2 row affected per transaction, so Im quite sure the data logged does not belong to any particular transaction.

What worries me more is that I have not seen anything else than table/transaction data in the error log since I last posted it. I would expect at least a header with a date and some text before the table/transaction data.
mysqld has not been restarted since 12. nov.

As a last thing, this morning the backup failed again and we started a new one which worked fine. I wanted to rotate the error log, so I moved the log file to a new file and ran 'flush logs' from the mysql prompt and mysql responded with the following error

'
mysql> flush logs;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> flush logs;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 3024196
Current database: database

Query OK, 0 rows affected (0.02 sec)
'

My connection were the only connection that got reset, and the new error_log only contains table/transaction data. Again I would expect a log header with a date and some text before table/transaction data.

I noticed that the behaviour of flush logs has been changed in mysql 5.1.51, to not move the error log to a new file. Maybe the change has introduced some errors?

Im sorry that the information is so sparse and random, but its really hard to boil down when I cant reproduce the problem on command.

Thank you for the feedback.

----<q>----
mysql> flush logs;
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> flush logs;
ERROR 2006 (HY000): MySQL server has gone away
----</q>----

Could you please send error log with record about this crash(?)?

The error log before contains similar binary data, dont know if you want to see that one also...

I dont think mysql server crashed as it were only my session that were reset. Every other sessions continued to work.

After changing our my.cnf file we longer experience this problem.

All the lines which now has a # in front have been disabled.

======= my.cnf ========
max_connections = 250
default-storage-engine = innodb

# sort_buffer_size = 2M          # default 2097144
# read_buffer_size = 2M          # default 131072
# read_rnd_buffer_size = 8M      # default 262144

# table_open_cache = 512         # default 64
# thread_cache_size = 32         # default 0
query_cache_size = 512M          # default 0

# max_heap_table_size = 1024M    # default 16777216
# max_allowed_packet=1G          # default 1048576
# tmp_table_size = 512M          # default 16777216

# INNODB SPECIFIC SETTINGS
# innodb_thread_concurrency = 10            # default 8
innodb_buffer_pool_size   = 25G
innodb_lock_wait_timeout  = 120             # default 50

innodb_log_file_size = 250M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1

innodb_file_per_table
innodb_data_home_dir = /home/mysqldb/
innodb_log_group_home_dir = /home/mysqldb/
# innodb_additional_mem_pool_size = 20M     # default 1048576
# innodb_open_files = 512                   # default 300

# BINARY LOG
binlog_format = ROW
log-bin = /home/mysqldb/logs/dev-1-binlog
server-id = 1
sync_binlog = 1
expire_logs_days = 1

Thank you for the feedback.

> After changing our my.cnf file we longer experience this problem.

Did you mean you NO longer experience the problem?

It means that after changing my.cnf to something closer to default, the issue disappeared. We have not changed the config much since, and have not seen this issue either.

My theory is master creates huge temporary table, then flush logs on slave fails when it tries to replicate those events from master binary log file and write events into its own binary log though. I can not confirm this with 5.5

Kim,

thank you for the feedback.

I tested my guess posted in earlier comment with newer version and could not repeat issue. There also was bug #55478 solved in version 5.1.53, which can affect this case.

If you have possibility, please, test new version (5.1.63 or 5.5.23) with options 

max_heap_table_size = 1024M    # default 16777216
max_allowed_packet=1G          # default 1048576
tmp_table_size = 512M          # default 16777216
binlog_format = ROW

and inform us if problem still repeatable.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".