Bug #57801 | Mysql slave has a chance to stop when master is rotating binary logs | ||
---|---|---|---|
Submitted: | 28 Oct 2010 11:07 | Modified: | 3 Jun 2012 11:02 |
Reporter: | Kim Carlsen | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.1.51 | OS: | Any |
Assigned to: | Assigned Account | CPU Architecture: | Any |
Tags: | binlog, flush logs, mysqldump, slave stop |
[28 Oct 2010 11:07]
Kim Carlsen
[28 Oct 2010 11:09]
Kim Carlsen
The slave still works, just need a 'start slave' command and it will continue from where it stopped.
[28 Oct 2010 15:20]
MySQL Verification Team
Please see bug: http://bugs.mysql.com/bug.php?id=37926 and verify it is your case. Thanks in advance.
[28 Oct 2010 15:35]
Kim Carlsen
To me that seems like something completely different. There are no crashes involved in this or any other failures. Just that MySQL master report to the slave for a brief time that the binary log is invalid. When you try to open it the second time, there are no problems. I suspect that when you tell mysql to flush the logs and create a new binary log file, there is a short timeframe where it has not written the binary log header and if the slave ask for updates in that timeframe it will see the empty binary log file and fails. This is purely speculation, as I dont know how the internals of MySQL work.
[1 Nov 2010 8:59]
Kim Carlsen
This weekend the slave failed again. This time the following was added to the error log on the master. mysql-log.err: 101025 4:30:10 [Note] /usr/local/libexec/mysqld: ready for connections. Version: '5.1.51-log' socket: '/tmp/mysql.sock' port: 3306 FreeBSD port: mysql-server-5.1.51_1 101101 1:00:01 [ERROR] Error writing file '/home/mysqldb/logs/cactus-binlog.index' (errno: 9) and as a hexdump. 00000340 35 31 5f 31 0a 07 00 00 01 00 00 00 03 00 00 00 |51_1............| 00000350 07 00 00 01 00 00 00 03 00 00 00 31 30 31 31 30 |...........10110| 00000360 31 20 20 31 3a 30 30 3a 30 31 20 5b 45 52 52 4f |1 1:00:01 [ERRO| 00000370 52 5d 20 45 72 72 6f 72 20 77 72 69 74 69 6e 67 |R] Error writing| 00000380 20 66 69 6c 65 20 27 2f 68 6f 6d 65 2f 6d 79 73 | file '/home/mys| 00000390 71 6c 64 62 2f 6c 6f 67 73 2f 63 61 63 74 75 73 |qldb/logs/cactus| 000003a0 2d 62 69 6e 6c 6f 67 2e 69 6e 64 65 78 27 20 28 |-binlog.index' (| 000003b0 65 72 72 6e 6f 3a 20 39 29 0a |errno: 9).| Checking the index file, everything looks fine: cat cactus-binlog.index /home/mysqldb/logs/cactus-binlog.001293 /home/mysqldb/logs/cactus-binlog.001294 ... snip ... /home/mysqldb/logs/cactus-binlog.001321 /home/mysqldb/logs/cactus-binlog.001322 System is running FreeBSD 7.3-RELEASE-p2 on amd64 using UFS as file system. The slave works fine after running a 'start slave'. The failure was again triggered by mysqldump running.
[1 Nov 2010 8:59]
Kim Carlsen
This weekend the slave failed again. This time the following was added to the error log on the master. mysql-log.err: 101025 4:30:10 [Note] /usr/local/libexec/mysqld: ready for connections. Version: '5.1.51-log' socket: '/tmp/mysql.sock' port: 3306 FreeBSD port: mysql-server-5.1.51_1 101101 1:00:01 [ERROR] Error writing file '/home/mysqldb/logs/cactus-binlog.index' (errno: 9) and as a hexdump. 00000340 35 31 5f 31 0a 07 00 00 01 00 00 00 03 00 00 00 |51_1............| 00000350 07 00 00 01 00 00 00 03 00 00 00 31 30 31 31 30 |...........10110| 00000360 31 20 20 31 3a 30 30 3a 30 31 20 5b 45 52 52 4f |1 1:00:01 [ERRO| 00000370 52 5d 20 45 72 72 6f 72 20 77 72 69 74 69 6e 67 |R] Error writing| 00000380 20 66 69 6c 65 20 27 2f 68 6f 6d 65 2f 6d 79 73 | file '/home/mys| 00000390 71 6c 64 62 2f 6c 6f 67 73 2f 63 61 63 74 75 73 |qldb/logs/cactus| 000003a0 2d 62 69 6e 6c 6f 67 2e 69 6e 64 65 78 27 20 28 |-binlog.index' (| 000003b0 65 72 72 6e 6f 3a 20 39 29 0a |errno: 9).| Checking the index file, everything looks fine: cat cactus-binlog.index /home/mysqldb/logs/cactus-binlog.001293 /home/mysqldb/logs/cactus-binlog.001294 ... snip ... /home/mysqldb/logs/cactus-binlog.001321 /home/mysqldb/logs/cactus-binlog.001322 System is running FreeBSD 7.3-RELEASE-p2 on amd64 using UFS as file system. The slave works fine after running a 'start slave'. The failure was again triggered by mysqldump running.
[1 Nov 2010 9:28]
Kim Carlsen
I can add that mysqldump also fails with the following error. mysqldump: Got error: 2013: Lost connection to MySQL server during query when doing refresh The command used is: /usr/local/bin/mysqldump --defaults-extra-file=/home/backups/my.cnf --routines -F --skip-lock-tables --single-transaction -ubackup db1 gzip > /home/backups/mysql/db1.$TIMESTAMP.sql.gz
[16 Nov 2010 22:20]
Sveta Smirnova
Thank you for the feedback. Latest errors look strange. Please send us full error logs from master and slave.
[17 Nov 2010 9:49]
Kim Carlsen
It seems table data is leaked/dumped into the error log, I have noticed that sometimes when it fails, it is impossible to run the mysqldump command until the mysqld process has been restarted. Mysqldump will continue to report 'lost connection to server', until the server is restarted. Im not sure how to get any closer to the problem. Could this just be an indication of some sort of filesystem corruption? I will try for a period to stop the slave while we make the backup, to see if this has any influence.
[18 Nov 2010 17:49]
Sveta Smirnova
Thank you for the feedback. Data in the error log file is information of InnoDB transaction which was rolled back on server shutdown. Nothing wrong with it. regarding to original issue I see following in slave error log: 443 101113 11:07:19 [Note] Slave I/O thread: connected to master 'repl@XXX:3306',replication started in log 'XXX.001380' at position 106 444 101117 2:21:27 [ERROR] Error reading packet from server: Binlog has bad magic number; It's not a binary log file that can be used by this version of MySQL ( server_errno=1236) 445 101117 2:21:27 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Binlog has bad magic number; It's not a binary log file that can be used by this version of MyS QL', Error_code: 1236 446 101117 2:21:27 [Note] Slave I/O thread exiting, read up to log 'XXX.001396', position 4 447 101117 9:15:52 [Note] Slave I/O thread: connected to master 'repl@cactus.fullrate.dk:3306',replication started in log 'XXX.001396' at position 4 So there is a jusmp from #001380 to #001396 Please send us output of `ls -l DATADIR_ON_MASTER` and binlog.index file
[19 Nov 2010 8:59]
Kim Carlsen
The jump from #001380 to #001396 is just because the slave have been running for 4 days, and that we produce 4 binary logs each day. I see this as a healthy sign. 9:15:52 [Note] Slave I/O thread: connected to master 'repl@cactus.fullrate.dk:3306',replication started in log 'XXX.001396' at position 4 is from our start slave command after the slave stopped during the nightly backup routine. By datadir, do you mean where we store our binary logs on the master? or the innodb table data? krc@cactus /home/mysqldb/logs $ ls -l total 8306262 -rw-rw---- 1 mysql mysql 1018751791 Nov 12 01:00 cactus-binlog.001369 -rw-rw---- 1 mysql mysql 8199825 Nov 12 02:18 cactus-binlog.001370 -rw-rw---- 1 mysql mysql 153 Nov 12 02:18 cactus-binlog.001371 -rw-rw---- 1 mysql mysql 483472034 Nov 12 09:30 cactus-binlog.001372 -rw-rw---- 1 mysql mysql 517123 Nov 12 09:31 cactus-binlog.001373 -rw-rw---- 1 mysql mysql 192968 Nov 12 09:32 cactus-binlog.001374 -rw-rw---- 1 mysql mysql 515828 Nov 12 09:32 cactus-binlog.001375 -rw-rw---- 1 mysql mysql 1791908 Nov 12 09:34 cactus-binlog.001376 -rw-rw---- 1 mysql mysql 1932424 Nov 12 09:35 cactus-binlog.001377 -rw-rw---- 1 mysql mysql 5951 Nov 12 09:37 cactus-binlog.001378 -rw-rw---- 1 mysql mysql 2582459 Nov 12 09:47 cactus-binlog.001379 -rw-rw---- 1 mysql mysql 619285983 Nov 13 01:00 cactus-binlog.001380 -rw-rw---- 1 mysql mysql 3547191 Nov 13 02:17 cactus-binlog.001381 -rw-rw---- 1 mysql mysql 153 Nov 13 02:17 cactus-binlog.001382 -rw-rw---- 1 mysql mysql 624777108 Nov 14 01:00 cactus-binlog.001383 -rw-rw---- 1 mysql mysql 2250883 Nov 14 02:15 cactus-binlog.001384 -rw-rw---- 1 mysql mysql 153 Nov 14 02:15 cactus-binlog.001385 -rw-rw---- 1 mysql mysql 804871802 Nov 15 01:00 cactus-binlog.001386 -rw-rw---- 1 mysql mysql 4514741 Nov 15 02:15 cactus-binlog.001387 -rw-rw---- 1 mysql mysql 153 Nov 15 02:15 cactus-binlog.001388 -rw-rw---- 1 mysql mysql 1079004380 Nov 15 22:21 cactus-binlog.001389 -rw-rw---- 1 mysql mysql 29046247 Nov 16 01:00 cactus-binlog.001390 -rw-rw---- 1 mysql mysql 7906648 Nov 16 02:19 cactus-binlog.001391 -rw-rw---- 1 mysql mysql 153 Nov 16 02:19 cactus-binlog.001392 -rw-rw---- 1 mysql mysql 1099492878 Nov 16 22:21 cactus-binlog.001393 -rw-rw---- 1 mysql mysql 68668317 Nov 17 01:00 cactus-binlog.001394 -rw-rw---- 1 mysql mysql 103011617 Nov 17 02:21 cactus-binlog.001395 -rw-rw---- 1 mysql mysql 153 Nov 17 02:21 cactus-binlog.001396 -rw-rw---- 1 mysql mysql 585409862 Nov 17 09:52 cactus-binlog.001397 -rw-rw---- 1 mysql mysql 525774822 Nov 18 01:00 cactus-binlog.001398 -rw-rw---- 1 mysql mysql 2802115 Nov 18 02:14 cactus-binlog.001399 -rw-rw---- 1 mysql mysql 153 Nov 18 02:14 cactus-binlog.001400 -rw-rw---- 1 mysql mysql 956961097 Nov 19 01:00 cactus-binlog.001401 -rw-rw---- 1 mysql mysql 153 Nov 19 01:00 cactus-binlog.001402 -rw-rw---- 1 mysql mysql 465464112 Nov 19 09:37 cactus-binlog.001403 -rw-rw---- 1 mysql mysql 1400 Nov 19 01:00 cactus-binlog.index krc@cactus /home/mysqldb/logs $ sudo cat cactus-binlog.index /home/mysqldb/logs/cactus-binlog.001369 /home/mysqldb/logs/cactus-binlog.001370 /home/mysqldb/logs/cactus-binlog.001371 /home/mysqldb/logs/cactus-binlog.001372 /home/mysqldb/logs/cactus-binlog.001373 /home/mysqldb/logs/cactus-binlog.001374 /home/mysqldb/logs/cactus-binlog.001375 /home/mysqldb/logs/cactus-binlog.001376 /home/mysqldb/logs/cactus-binlog.001377 /home/mysqldb/logs/cactus-binlog.001378 /home/mysqldb/logs/cactus-binlog.001379 /home/mysqldb/logs/cactus-binlog.001380 /home/mysqldb/logs/cactus-binlog.001381 /home/mysqldb/logs/cactus-binlog.001382 /home/mysqldb/logs/cactus-binlog.001383 /home/mysqldb/logs/cactus-binlog.001384 /home/mysqldb/logs/cactus-binlog.001385 /home/mysqldb/logs/cactus-binlog.001386 /home/mysqldb/logs/cactus-binlog.001387 /home/mysqldb/logs/cactus-binlog.001388 /home/mysqldb/logs/cactus-binlog.001389 /home/mysqldb/logs/cactus-binlog.001390 /home/mysqldb/logs/cactus-binlog.001391 /home/mysqldb/logs/cactus-binlog.001392 /home/mysqldb/logs/cactus-binlog.001393 /home/mysqldb/logs/cactus-binlog.001394 /home/mysqldb/logs/cactus-binlog.001395 /home/mysqldb/logs/cactus-binlog.001396 /home/mysqldb/logs/cactus-binlog.001397 /home/mysqldb/logs/cactus-binlog.001398 /home/mysqldb/logs/cactus-binlog.001399 /home/mysqldb/logs/cactus-binlog.001400 /home/mysqldb/logs/cactus-binlog.001401 /home/mysqldb/logs/cactus-binlog.001402 /home/mysqldb/logs/cactus-binlog.001403
[19 Nov 2010 17:35]
Sveta Smirnova
Thank you for the feedback. Yes, I meant directory where binary logs are located. Do you still have rpl-1-relay-bin.001181 or another relay log for which slave reports this error? Could you successfully run mysqlbinlog on it? What it outputs? Looks like problem is with relay log really, because master creates new file first, then adds it to index file thus would not send not complete log to slave.
[24 Nov 2010 10:17]
Kim Carlsen
Unfortunately not, it only saves the relay log for a short time (until last flush-logs?). krc@rpl /home/mysqldb/logs $ ls -l total 1732 -rw-rw---- 1 mysql mysql 202 Nov 24 10:30 rpl-1-relay-bin.000106 -rw-rw---- 1 mysql mysql 1742378 Nov 24 10:36 rpl-1-relay-bin.000107 -rw-rw---- 1 mysql mysql 84 Nov 24 10:30 rpl-1-relay-bin.index krc@rpl /home/mysqldb/logs $ sudo cat rpl-1-relay-bin.index Password: /home/mysqldb/logs/rpl-1-relay-bin.000106 /home/mysqldb/logs/rpl-1-relay-bin.000107 krc@rpl /home/mysqldb/logs $ Im not sure that it is the relay log that is broken. If it were, then a start slave would still read the same relay log and stop again with the same error?, what puzzles me is that when we run 'start slave' it just runs without any problem. I have tried to stop the slave while we do the backup, (stop slave half hour before the backup and start it again half hour after the backup), but we still experience 'mysqldump: Got error: 2013: Lost connection to MySQL server during query when doing refresh'. I believe that if the slave was running, it would fail with a 'bad magic number' error. So Maybe the slave is not involved in this at all. The table/transaction data in the error log still worries me. Is it all transaction rollback it should log?, or only information about the last transaction before an unclean shutdown? The data logged does not belong to one transaction and the tables it logs from usually only have 1-2 row affected per transaction, so Im quite sure the data logged does not belong to any particular transaction. What worries me more is that I have not seen anything else than table/transaction data in the error log since I last posted it. I would expect at least a header with a date and some text before the table/transaction data. mysqld has not been restarted since 12. nov. As a last thing, this morning the backup failed again and we started a new one which worked fine. I wanted to rotate the error log, so I moved the log file to a new file and ran 'flush logs' from the mysql prompt and mysql responded with the following error ' mysql> flush logs; ERROR 2013 (HY000): Lost connection to MySQL server during query mysql> flush logs; ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id: 3024196 Current database: database Query OK, 0 rows affected (0.02 sec) ' My connection were the only connection that got reset, and the new error_log only contains table/transaction data. Again I would expect a log header with a date and some text before table/transaction data. I noticed that the behaviour of flush logs has been changed in mysql 5.1.51, to not move the error log to a new file. Maybe the change has introduced some errors? Im sorry that the information is so sparse and random, but its really hard to boil down when I cant reproduce the problem on command.
[24 Nov 2010 11:51]
Sveta Smirnova
Thank you for the feedback. ----<q>---- mysql> flush logs; ERROR 2013 (HY000): Lost connection to MySQL server during query mysql> flush logs; ERROR 2006 (HY000): MySQL server has gone away ----</q>---- Could you please send error log with record about this crash(?)?
[24 Nov 2010 12:33]
Kim Carlsen
The error log before contains similar binary data, dont know if you want to see that one also... I dont think mysql server crashed as it were only my session that were reset. Every other sessions continued to work.
[16 Mar 2011 13:13]
Kim Carlsen
After changing our my.cnf file we longer experience this problem. All the lines which now has a # in front have been disabled. ======= my.cnf ======== max_connections = 250 default-storage-engine = innodb # sort_buffer_size = 2M # default 2097144 # read_buffer_size = 2M # default 131072 # read_rnd_buffer_size = 8M # default 262144 # table_open_cache = 512 # default 64 # thread_cache_size = 32 # default 0 query_cache_size = 512M # default 0 # max_heap_table_size = 1024M # default 16777216 # max_allowed_packet=1G # default 1048576 # tmp_table_size = 512M # default 16777216 # INNODB SPECIFIC SETTINGS # innodb_thread_concurrency = 10 # default 8 innodb_buffer_pool_size = 25G innodb_lock_wait_timeout = 120 # default 50 innodb_log_file_size = 250M innodb_log_buffer_size = 8M innodb_flush_log_at_trx_commit = 1 innodb_file_per_table innodb_data_home_dir = /home/mysqldb/ innodb_log_group_home_dir = /home/mysqldb/ # innodb_additional_mem_pool_size = 20M # default 1048576 # innodb_open_files = 512 # default 300 # BINARY LOG binlog_format = ROW log-bin = /home/mysqldb/logs/dev-1-binlog server-id = 1 sync_binlog = 1 expire_logs_days = 1
[26 Apr 2012 19:57]
Sveta Smirnova
Thank you for the feedback. > After changing our my.cnf file we longer experience this problem. Did you mean you NO longer experience the problem?
[27 Apr 2012 6:03]
Kim Carlsen
It means that after changing my.cnf to something closer to default, the issue disappeared. We have not changed the config much since, and have not seen this issue either.
[1 May 2012 12:34]
Sveta Smirnova
My theory is master creates huge temporary table, then flush logs on slave fails when it tries to replicate those events from master binary log file and write events into its own binary log though. I can not confirm this with 5.5
[3 May 2012 11:02]
Sveta Smirnova
Kim, thank you for the feedback. I tested my guess posted in earlier comment with newer version and could not repeat issue. There also was bug #55478 solved in version 5.1.53, which can affect this case. If you have possibility, please, test new version (5.1.63 or 5.5.23) with options max_heap_table_size = 1024M # default 16777216 max_allowed_packet=1G # default 1048576 tmp_table_size = 512M # default 16777216 binlog_format = ROW and inform us if problem still repeatable.
[4 Jun 2012 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".