Bug #57589 | SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum | ||
---|---|---|---|
Submitted: | 20 Oct 2010 5:26 | Modified: | 30 Nov 2010 22:25 |
Reporter: | Nidhi Shrotriya | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | mysql-5.1-rep+2-wl2540 | OS: | Any |
Assigned to: | Andrei Elkin | CPU Architecture: | Any |
Tags: | replication checksum |
[20 Oct 2010 5:26]
Nidhi Shrotriya
[20 Oct 2010 8:16]
Andrei Elkin
The summary of the issue: OS can't connect to the checksumming-ON NM and an error goes out to the error log (which is good), but OS can't stop trying to reconnect constantly (which is not). Explanatory notes: The checksum-ON NM play its role as it was supposed to: rejects to accept OS connection with a specific to checksum new error. OS gets back the error which it does not regard as a critical where it would stop at once. `connect-retry' logics on the slave does not apply here because it's too late. It works at connection establishing when the master dump thread does not have any clue where from the OS slave is going to request dump from later. So it's too late for `connect-retry' when the slave proceeds with requesting the replication restart position. Since the new error is not in the list of 3 critical of OS, it loops over to reconnect/request dump/fail endlessly. Possible solutions: A. Send back to checksum-unaware OS a critical error instead. ER_MASTER_FATAL_ERROR_READING_BINLOG sounds a good compromise to me. B. Send back the new ER_SLAVE_IS_NOT_CHECKSUM_CAPABLE error first to follow with ER_MASTER_FATAL_ERROR_READING_BINLOG at reconnecting. With receiving ER_MASTER_FATAL_ERROR_READING_BINLOG the IO thread terminates.
[26 Oct 2010 17:28]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/121949 3185 Andrei Elkin 2010-10-26 Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum OS can't connect to the checksumming-ON NM and an error goes out to the error log (which is good), but OS can't stop trying to reconnect constantly (which is not). While practically this scenario must be pretty rare it's possible to fix the issue a rather nice way. Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG critical error accompanied with a verbose clarification mentioning the checksum situation like in the following snippet of the error log from the patch testing: 101026 20:15:45 [ERROR] Error reading packet from server: Slave can not handle replication events with the checksum that master is configured to log ( server_errno=1236) 101026 20:15:45 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Slave can not handle replication events with the checksum that master is configured to log', Error_code: 1236 Master also logs a warning 101026 20:15:33 [Warning] Configured to log replication events with checksum Master rejects sending them to Slave that can not handle it. @ sql/rpl_master.cc In case of checksumming-ON NM -> OS replication Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG critical error accompanied with a verbose clarification. Master also logs a warning. @ sql/rpl_slave.cc Unneeded anymore piece of codes is removed. @ sql/share/errmsg-utf8.txt Unneeded error is removed.
[27 Oct 2010 10:24]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/122066 3185 Andrei Elkin 2010-10-27 Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum OS can't connect to the checksumming-ON NM and an error goes out to the error log (which is good), but OS can't stop trying to reconnect constantly (which is not). While practically this scenario must be pretty rare it's possible to fix the issue a rather nice way. Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG critical error accompanied with a verbose clarification mentioning the checksum situation like in the following snippet of the error log from the patch testing: 101026 20:15:45 [ERROR] Error reading packet from server: Slave can not handle replication events with the checksum that master is configured to log ( server_errno=1236) 101026 20:15:45 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Slave can not handle replication events with the checksum that master is configured to log', Error_code: 1236 Master also logs a warning 101026 20:15:33 [Warning] Configured to log replication events with checksum Master rejects sending them to Slave that can not handle it. Additional fixes for wl#2540 targeting sysvar suite. Each system var must have a test file in there. The tests are adeed. @ mysql-test/suite/sys_vars/r/all_vars.result results are changed. @ sql/rpl_master.cc In case of checksumming-ON NM -> OS replication Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG critical error accompanied with a verbose clarification. Master also logs a warning. @ sql/rpl_slave.cc Unneeded anymore piece of codes is removed. @ sql/share/errmsg-utf8.txt Unneeded error is removed. @ sql/sys_vars.cc In case of binlog is not open, binlog_checksum changes anyway when a new value is set.
[27 Oct 2010 14:46]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/122110 3175 Andrei Elkin 2010-10-27 Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave cant handle checksum fixing the first patch copy-not-pasted goto err and a text for a new warning on the master
[27 Oct 2010 14:53]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/122112 3186 Andrei Elkin 2010-10-27 Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave cant handle checksum fixing the first patch copy-not-pasted`s goto err and the text for a new warning on the master
[28 Oct 2010 10:48]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/122193 3188 Andrei Elkin 2010-10-28 Bug #57589 error numbers shifted, rpl_checksum simulates OS and its error text is updated
[28 Oct 2010 17:10]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/122233 3189 Andrei Elkin 2010-10-28 Bug #57589 sysvar suite tests are added.
[1 Nov 2010 11:07]
Andrei Elkin
Pushed to next-mr-wl2540. Nothing to document, fixes cover wl2540 testing w/o adding any new features.
[29 Nov 2010 11:11]
Bugs System
Pushed into mysql-trunk 5.6.1-m5 (revid:alexander.nozdrin@oracle.com-20101129111021-874if2qsp0i8d5ze) (version source revid:alexander.nozdrin@oracle.com-20101129111021-874if2qsp0i8d5ze) (merge vers: 5.6.1-m5) (pib:23)
[30 Nov 2010 22:25]
Jon Stephens
No changelog entry needed per comment of [1 Nov 12:07] / Andrei Elkin. Closed.