Bug #65783 crash on all masters if next_file is NULL internally
Submitted: 2 Jul 2012 11:25 Modified: 14 Aug 19:20
Reporter: Hartmut Holzgraefe Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S2 (Serious)
Version:cluster 7.2.6 OS:Linux
Assigned to: Bogdan Kecman CPU Architecture:Any

[2 Jul 2012 11:25] Hartmut Holzgraefe
Description:
Crash happens on this line in 

3784         ndb_binlog_index->field[NBICOL_NEXT_FILE]
3785           ->store(first->next_master_log_file,
3786                   (uint)strlen(first->next_master_log_file),
3787                   &my_charset_bin);

as next_master_log_file is NULL at that point (using "row" instead of "first" in gdb print command as gdb says that "first" was "optimized out")

(gdb) frame 8
#8  0x0861e586 in ndb_binlog_index_table__write_rows (thd=0xb9e98b8, row=0x796b0034)
    at /export/home/pb2/build/sb_0-5685222-1336648662.25/mysql-cluster-gpl-7.2.6/sql/ha_ndbcluster_binlog.cc:3787
3787	in /export/home/pb2/build/sb_0-5685222-1336648662.25/mysql-cluster-gpl-7.2.6/sql/ha_ndbcluster_binlog.cc
(gdb) print *row
$1 = {epoch = 1380368129196038, start_master_log_file = 0xba6e190 "./webrad1-master-bin.000005", 
  start_master_log_pos = 18166817, n_inserts = 0, n_updates = 0, n_deletes = 0, n_schemaops = 0, 
  orig_server_id = 0, orig_epoch = 0, gci = 321392, next_master_log_file = 0x0, next_master_log_pos = 0, 
  next = 0x0}

How to repeat:
no real idea, it "just happens" without any obvious / strange queries near the end of the binlog or general query log ...

Suggested fix:
Workaround: drop next_file and next_position columns from ndb_binlog_index, the code section that the crashing line is in will be ignored if the fields do not exist (they will probably get restored if mysql_upgrade is run though)

Code level workaround: check for non-null next_master_log_file content before populating the new columns

Real fix: find out how it can be NULL in the first place?
[2 Jul 2012 11:26] Hartmut Holzgraefe
"S2 Serious" as it brings down all replication masters at the same time => cluster change events that happen while these are down will not be binlogged at all
[2 Jul 2012 14:52] Hartmut Holzgraefe
The actual crash is of course in the strlen() call on the line above as strlen(NULL) => segfault ...
[13 Aug 2014 10:08] Frazer Clement
Any chance you still have the other stack frames?
Maybe the scenario is logging an empty epoch...

--ndb-log-empty-epochs switched on?
[30 Oct 2014 12:17] Hartmut Holzgraefe
Unfortunately that information is not available anymore :(
[14 Aug 19:20] Bogdan Kecman
Hi Hartmut,

Since the info about the original issue is lost and we can't reproduce this I'm switching it to can't repeat.

all best
Bogdan