MySQL Bugs: #65783: crash on all masters if next

Bug #65783	crash on all masters if next_file is NULL internally
Submitted:	2 Jul 2012 11:25	Modified:	14 Aug 2019 19:20
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Replication	Severity:	S2 (Serious)
Version:	cluster 7.2.6	OS:	Linux
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Crash happens on this line in 

3784         ndb_binlog_index->field[NBICOL_NEXT_FILE]
3785           ->store(first->next_master_log_file,
3786                   (uint)strlen(first->next_master_log_file),
3787                   &my_charset_bin);

as next_master_log_file is NULL at that point (using "row" instead of "first" in gdb print command as gdb says that "first" was "optimized out")

(gdb) frame 8
#8  0x0861e586 in ndb_binlog_index_table__write_rows (thd=0xb9e98b8, row=0x796b0034)
    at /export/home/pb2/build/sb_0-5685222-1336648662.25/mysql-cluster-gpl-7.2.6/sql/ha_ndbcluster_binlog.cc:3787
3787	in /export/home/pb2/build/sb_0-5685222-1336648662.25/mysql-cluster-gpl-7.2.6/sql/ha_ndbcluster_binlog.cc
(gdb) print *row
$1 = {epoch = 1380368129196038, start_master_log_file = 0xba6e190 "./webrad1-master-bin.000005", 
  start_master_log_pos = 18166817, n_inserts = 0, n_updates = 0, n_deletes = 0, n_schemaops = 0, 
  orig_server_id = 0, orig_epoch = 0, gci = 321392, next_master_log_file = 0x0, next_master_log_pos = 0, 
  next = 0x0}

How to repeat:
no real idea, it "just happens" without any obvious / strange queries near the end of the binlog or general query log ...

Suggested fix:
Workaround: drop next_file and next_position columns from ndb_binlog_index, the code section that the crashing line is in will be ignored if the fields do not exist (they will probably get restored if mysql_upgrade is run though)

Code level workaround: check for non-null next_master_log_file content before populating the new columns

Real fix: find out how it can be NULL in the first place?

"S2 Serious" as it brings down all replication masters at the same time => cluster change events that happen while these are down will not be binlogged at all

The actual crash is of course in the strlen() call on the line above as strlen(NULL) => segfault ...

Any chance you still have the other stack frames?
Maybe the scenario is logging an empty epoch...

--ndb-log-empty-epochs switched on?

Unfortunately that information is not available anymore :(

Hi Hartmut,

Since the info about the original issue is lost and we can't reproduce this I'm switching it to can't repeat.

all best
Bogdan