MySQL Bugs: #57589: SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum

Bug #57589	SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum
Submitted:	20 Oct 2010 5:26	Modified:	30 Nov 2010 22:25
Reporter:	Nidhi Shrotriya	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	mysql-5.1-rep+2-wl2540	OS:	Any
Assigned to:	Andrei Elkin	CPU Architecture:	Any
Tags:	replication checksum

Description:
This issue is found while testing for WL#2540 with mysql-5.1-rep+2-wl2540 tree.
 
As per the worklog requirements specification in case of NM -> OS we expect
'error 1665:  Slave can not handle replication events with the checksum that master is configured to log.'
and replication won't start.

Though I am able to see the error in slave's server log as below:

101008 11:34:40 [ERROR] Error reading packet from server: Slave can not handle replication events with the checksum that master is configured to log. ( server_errno=1665)
101008 11:34:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000002' at postion 112

but in SHOW SLAVE STATUS it doesn't appear as error and shows as below:

mysql> show slave status;

| Slave_IO_State | Master_Host | Master_User | Master_Port | Connect_Retry | Master_Log_File | Read_Master_Log_Pos | Relay_Log_File | Relay_Log_Pos | Relay_Master_Log_File | Slave_IO_Running | Slave_SQL_Running | Replicate_Do_DB | Replicate_Ignore_DB | Replicate_Do_Table | Replicate_Ignore_Table | Replicate_Wild_Do_Table | Replicate_Wild_Ignore_Table | Last_Errno | Last_Error | Skip_Counter | Exec_Master_Log_Pos | Relay_Log_Space | Until_Condition | Until_Log_File | Until_Log_Pos | Master_SSL_Allowed | Master_SSL_CA_File | Master_SSL_CA_Path | Master_SSL_Cert | Master_SSL_Cipher | Master_SSL_Key | Seconds_Behind_Master | Master_SSL_Verify_Server_Cert | Last_IO_Errno | Last_IO_Error | Last_SQL_Errno | Last_SQL_Error |

| Queueing master event to the relay log | localhost | root | 3310 | 60 | mysql-bin.000002 | 112 | siv35-relay-bin.002105 | 149 | mysql-bin.000002 | Yes | Yes | | | | | | | 0 | | 0 | 112 | 347 | None | | 0 | No | | | | | | 0 | No | 0 | | 0 | |

1 row in set (0.09 sec)

How to repeat:
Start new master(with WL#2540) and old slave(without WL#2540).
Set --binlog-checksum=CRC32 on master and try to replicate.
Replication shouldn't start and should throw error 1665.

The summary of the issue:

OS can't connect to the checksumming-ON NM and an error goes out to the
error log (which is good), but OS can't stop trying to reconnect
constantly (which is not).

Explanatory notes:

The checksum-ON NM play its role as it was supposed to: rejects to
accept OS connection with a specific to checksum new error.
OS gets back the error which it does not regard as a critical where it
would stop at once.
`connect-retry' logics on the slave does not apply here because it's too late.
It works at connection establishing when the master dump thread does not
have any clue where from the OS slave is going to request dump from
later. So it's too late for `connect-retry' when the slave proceeds with 
requesting the replication restart position.
Since the new error is not in the list of 3 critical of OS, it loops over
to reconnect/request dump/fail endlessly.

Possible solutions:

A. Send back to checksum-unaware OS a critical error instead.
   ER_MASTER_FATAL_ERROR_READING_BINLOG sounds a good compromise to me.

B. Send back the new ER_SLAVE_IS_NOT_CHECKSUM_CAPABLE error 
  first to follow with ER_MASTER_FATAL_ERROR_READING_BINLOG at
  reconnecting.

With receiving ER_MASTER_FATAL_ERROR_READING_BINLOG the IO thread terminates.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/121949

3185 Andrei Elkin	2010-10-26
      Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum
      
      OS can't connect to the checksumming-ON NM and an error goes out to the
      error log (which is good), but OS can't stop trying to reconnect
      constantly (which is not). 
      
      While practically this scenario must be pretty rare it's possible
      to fix the issue a rather nice way.
      Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG 
      critical error accompanied with a verbose clarification mentioning
      the checksum situation like in the following snippet of the error log from the patch testing:
      
      101026 20:15:45 [ERROR] Error reading packet from server: Slave can not handle replication events with the checksum that master is configured to log ( server_errno=1236)
      101026 20:15:45 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Slave can not handle replication events with the checksum that master is configured to log', Error_code: 1236
      
      Master also logs a warning
      
      101026 20:15:33 [Warning] Configured to log replication events with checksum Master rejects sending them to Slave that can not handle it.
     @ sql/rpl_master.cc
        In case of checksumming-ON NM -> OS replication
        Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG 
        critical error accompanied with a verbose clarification.
        Master also logs a warning.
     @ sql/rpl_slave.cc
        Unneeded anymore piece of codes is removed.
     @ sql/share/errmsg-utf8.txt
        Unneeded error is removed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/122066

3185 Andrei Elkin 2010-10-27
Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave can't handle checksum

OS can't connect to the checksumming-ON NM and an error goes out to the
error log (which is good), but OS can't stop trying to reconnect
constantly (which is not).

While practically this scenario must be pretty rare it's possible
to fix the issue a rather nice way.
Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG
critical error accompanied with a verbose clarification mentioning
the checksum situation like in the following snippet of the error log from the patch testing:

101026 20:15:45 [ERROR] Error reading packet from server: Slave can not handle replication events with the checksum that master is configured to log ( server_errno=1236)
101026 20:15:45 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Slave can not handle replication events with the checksum that master is configured to log', Error_code: 1236

Master also logs a warning

101026 20:15:33 [Warning] Configured to log replication events with checksum Master rejects sending them to Slave that can not handle it.

Additional fixes for wl#2540 targeting sysvar suite.
Each system var must have a test file in there. The tests are adeed.
@ mysql-test/suite/sys_vars/r/all_vars.result
results are changed.
@ sql/rpl_master.cc
In case of checksumming-ON NM -> OS replication
Master sends back to checksum-unaware OS the ER_MASTER_FATAL_ERROR_READING_BINLOG
critical error accompanied with a verbose clarification.
Master also logs a warning.
@ sql/rpl_slave.cc
Unneeded anymore piece of codes is removed.
@ sql/share/errmsg-utf8.txt
Unneeded error is removed.
@ sql/sys_vars.cc
In case of binlog is not open, binlog_checksum changes anyway when a new value is set.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122110

3175 Andrei Elkin	2010-10-27
      Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave cant handle checksum
      
      fixing the first patch copy-not-pasted goto err and a text for a new warning on the master

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122112

3186 Andrei Elkin	2010-10-27
      Bug #57589 SHOW SLAVE STATUS doesnt show err:1665 on NM-OS when slave cant handle checksum
      
      fixing the first patch copy-not-pasted`s goto err and the text for a new warning on the master

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122193

3188 Andrei Elkin	2010-10-28
      Bug #57589
      
      error numbers shifted, rpl_checksum simulates OS and its error text is updated

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122233

3189 Andrei Elkin	2010-10-28
      Bug #57589
      
      sysvar suite tests are added.

Pushed to next-mr-wl2540.
Nothing to document, fixes cover wl2540 testing w/o adding any new features.

Pushed into mysql-trunk 5.6.1-m5 (revid:alexander.nozdrin@oracle.com-20101129111021-874if2qsp0i8d5ze) (version source revid:alexander.nozdrin@oracle.com-20101129111021-874if2qsp0i8d5ze) (merge vers: 5.6.1-m5) (pib:23)

No changelog entry needed per comment of [1 Nov 12:07] / Andrei Elkin.

Closed.