MySQL Bugs: #55231: COM_BINLOG_DUMP needs to accept 64-bit positions else slaves can break

Bug #55231	COM_BINLOG_DUMP needs to accept 64-bit positions else slaves can break
Submitted:	13 Jul 2010 20:40	Modified:	24 Nov 2010 18:32
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	In progress	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.1.48, 5.7.22	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
since binlogs can be >4G in size, COM_BINLOG_DUMP should accept larger than 4G otherwise it is possible slaves will break like this:

100713 22:36:36 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication started in log 'xp64-bin.000128' at position 4327348974
100713 22:36:38 [ERROR] Error reading packet from server: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master ( server_errno=1236)
100713 22:36:38 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master', Error_code: 1236
100713 22:36:38 [Note] Slave I/O thread exiting, read up to log 'xp64-bin.000128', position 32381678

How to repeat:
have some large transactions and try starting a slave in between.
stop and start the slave.

the problem is obvious by looking at com_binlog_dump and request_dump code.

Suggested fix:
use 64 bits for position in com_binlog_dump.

workaround is to reduce size of transactions so they never cause binlog to get larger than 4G.

workaround is to set a lower max_binlog_cache_size, around this size:
(4*1024*1024*1024) - max_binlog_size
maybe some overhead bytes can be deducted too.

Also, please, let's ensure that the max_allowed_packet error message is not used for unrelated problems.. it's just plain confusing.

Also, oh.. the irony:

      /* TODO: The following has to be changed to an 8 byte integer */
      pos = uint4korr(packet);

Also note that maximum allowed value for max_binlog_size is 1G, but that server allows creating larger binlogs to prevent spreading transactions across binlogs.

Note that besides a stop/start, this issue can potentially be created by a network glitch;

100910 08:09:10 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)

It's possible to fix this bug and keep backward and forward compatibility if we use a new server command code instead of COM_BINLOG_DUMP. We need to do this:

 - In the master, add COM_BINLOG_DUMP_64BIT. It is exactly like COM_BINLOG_DUMP except it uses 8 bytes for the position instead of 4.

 - In the master, add the variable "@@GLOBAL.HAVE_64_BIT_BINLOG_POSITIONS" with the constant value 1.

 - In the slave startup code (get_master_version_and_clock() or similar function), send "SHOW VARIABLES LIKE 'HAVE_64_BIT_BINLOG_POSITIONS'" to the master to determine if COM_BINLOG_DUMP_64BIT is supported. If it is supported, send 64 bit position. Otherwise, send 32 bit position.

 - I checked interfaces that we expose. CHANGE MASTER, START SLAVE UNTIL, SHOW SLAVE STATUS, SHOW MASTER STATUS, SHOW BINARY LOGS, and the plugin interface all use 64 bits. master.info and relay-log.info use a string, so we just need to make it write and parse 64 bit numbers. We'll have to ensure that WL#2775 and WL#3656 use 64 bits. So I think all external interfaces are ok. Then, we have to fix the internal interfaces. I found the following in next-mr by grepping 'pos' in rpl_slave.cc, rpl_mi.cc and rpl_rli.cc:

in handle_slave_sql:
      sql_print_information("'SQL_SLAVE_SKIP_COUNTER=%ld' executed at "
        "relay_log_file='%s', relay_log_pos='%ld', master_log_name='%s', "
        "master_log_pos='%ld' and new position at "
        "relay_log_file='%s', relay_log_pos='%ld', master_log_name='%s', "
        "master_log_pos='%ld' ",
        (ulong) saved_skip, saved_log_name, (ulong) saved_log_pos,
        saved_master_log_name, (ulong) saved_master_log_pos,
        rli->get_group_relay_log_name(), (ulong) rli->get_group_relay_log_pos(),
        rli->get_group_master_log_name(), (ulong) rli->get_group_master_log_pos());

in process_io_rotate:
  DBUG_PRINT("info", ("master_log_pos: '%s' %lu",
                      mi->get_master_log_name(), (ulong) mi->get_master_log_pos()));

in queue_binlog_ver_1_event and queue_binlog_ver_3_event:
  DBUG_PRINT("info", ("master_log_pos: %lu", (ulong) mi->get_master_log_pos()));

in queue_event:
    DBUG_PRINT("info", ("master_log_pos: %lu, event originating from %u server, ignored",
                        (ulong) mi->get_master_log_pos(), uint4korr(buf + SERVER_ID_OFFSET)));

      DBUG_PRINT("info", ("master_log_pos: %lu", (ulong) mi->get_master_log_pos()));

in next_event:
      DBUG_PRINT("info", ("assertion skip %lu file pos %lu event relay log pos %lu file %s\n",
        (ulong) rli->slave_skip_counter, (ulong) my_b_tell(cur_log),
        (ulong) rli->get_event_relay_log_pos(),
        rli->get_event_relay_log_name()));

      DBUG_PRINT("info", ("next_event group master %s %lu group relay %s %lu event %s %lu\n",
        rli->get_group_master_log_name(),
        (ulong) rli->get_group_master_log_pos(),
        rli->get_group_relay_log_name(),
        (ulong) rli->get_group_relay_log_pos(),
        rli->get_event_relay_log_name(),
        (ulong) rli->get_event_relay_log_pos()));

      In two places ^^^

in change_master:
  DBUG_PRINT("info", ("master_log_pos: %lu", (ulong) mi->get_master_log_pos()));

       in two places ^^^

in relay_log_info::wait_for_pos:
    DBUG_PRINT("info",("group_master_log_name: '%s'  pos: %lu",
                       group_master_log_name, (ulong) group_master_log_pos));

in Relay_log_info::inc_group_relay_log_pos:
  DBUG_PRINT("info", ("log_pos: %lu  group_master_log_pos: %lu",
                      (long) log_pos, (long) group_master_log_pos));

in master_info::flush_info:
  DBUG_PRINT("enter",("master_pos: %lu", (ulong) master_log_pos));

Consider Bug#55247 as related.

This is more complicated than indicated above. We also have to:

 - Extend the binlog format so that the end_log_pos of the common-header can be 64 bits. This is probably doable, we just change the common-header size to 23 bytes instead of 19. The end_log_pos will be split up in two pieces: first the low 4 bytes, then other header fields, and lastly the high 4 bytes. One possible problem is that the formats of Format_description_log_event and Rotate_log_event are supposed to be fixed. This is probably OK for Format_description_log_event because it can only appear in the beginning of the binlog, so it does not need the extra bits for the end_log_pos. It is unclear how this limitation affects Rotate_log_event - that remains to be investigated.

 - Extend mysqlbinlog --read-from-remote-server so that it first checks the capabilities of the server, then executes either COM_BINLOG_DUMP or COM_BINLOG_DUMP_64BIT accordingly.

For older production versions I assume it's easy to change the error message, so please do that even if a full fix is too risky for a particular version.

Bug #92250 marked as duplicate of this one

Bug#92346 is set as duplicate of this one

Bug #95074 marked as duplicate of this one

https://bugs.mysql.com/bug.php?id=101795 is marked as a duplicate of this bug

Bug #95418 is marked as duplicate of this one

Bug #106536 is marked as duplicate of this one

Sadly, the 13-year bug is still actual. Got it on mysql 8.0.26

2023-12-17T14:20:47.369937Z 133875 [ERROR] [MY-013118] [Repl] Slave I/O for channel '': Unexpected master's heartbeat data: heartbeat is not compatible with local info; the event's data: log_file_name mysql80-bin.172113 log_pos 772469444, Error_code: MY-013118
2023-12-17T14:20:47.370663Z 133875 [ERROR] [MY-013122] [Repl] Slave I/O for channel '': Relay log write failure: could not queue event from master, Error_code: MY-013122

The binlog file size is 11GB (usually it is 1GB). After that, the replica was stuck and couldn't replicate further from the master. And trying to read from binlog from master causes mysql master to crash.

mysql> show binlog events in 'mysql80-bin.172113' from 3663667491;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Event too big
mysql> show binlog events in 'mysql80-bin.172113' from 3663667490;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Event too big
mysql> show binlog events in 'mysql80-bin.172113';
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> \q

Is there any workaround for how I can fix my replica? Maybe I can split big binlog file somehow and apply it to replica? I read this and the duplicate issues mentioned here and haven't found anything to help.

I've tried two things:

1. setting GTID_ONLY = 1 on the replica (I had to upgrade it to 8.0.27) doesn't help, I got the same error "Unexpected master's heartbeat data: heartbeat is not compatible with local info"

2. setting MASTER_HEARTBEAT_PERIOD = 400000 (just dome big value) helps, I've managed to launch my replica and it started to chase source.