MySQL Bugs: #65979: All SQL Nodes SIGSEGV when writing dupe data to tables where PK spans all cols

Bug #65979	All SQL Nodes SIGSEGV when writing dupe data to tables where PK spans all cols
Submitted:	22 Jul 2012 12:12	Modified:	5 Sep 2014 9:55
Reporter:	Jay Ward	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.5.25a-ndb-7.2.7-gpl-log	OS:	Linux (CentOS release 6.3 (Final) 2.6.32-279.1.1.el6.x86_64)
Assigned to:		CPU Architecture:	Any
Tags:	binary log, ndb_binlog_index, next_filename, next_position, SIGSEGV

Description:
When duplicate data is inserted into a table whose primary key spans all columns, be it through REPLACE, INSERT, INSERT IGNORE, INSERT ... ON DUPLICATE KEY, etc, the data is written to the NDB nodes who send the binary log information back to the SQL nodes. They in turn try to write an entry in the mysql.ndb_binlog_index table, however the next_file is NULL and the next_position is 0. When those invalid data bits hit the static int ndb_binlog_index_table__write_rows(THD *thd,                                    ndb_binlog_index_row *row) function, they eventually get to:

      if (ndb_binlog_index->s->fields > NBICOL_NEXT_POS)
      {
        /* Table has next log pos fields, fill them in */
        ndb_binlog_index->field[NBICOL_NEXT_POS]
          ->store(first->next_master_log_pos, true);
        ndb_binlog_index->field[NBICOL_NEXT_FILE]
          ->store(first->next_master_log_file,
                  (uint)strlen(first->next_master_log_file),
                  &my_charset_bin);
      }

Where they segment fault, since you can't get a strlen of NULL. This causes ALL SQL nodes in the cluster to crash, since they are all trying to log the same invalid data. 

I could not duplicate this on tables that did NOT have all columns as part of the primary key nor did I try it with the columns being only marked as unique, though I imagine that would error out too.

How to repeat:
1. Create a table who's primary keys span all columns: 

CREATE TABLE `GroupsToItems` (
  `GroupID` int(11) NOT NULL,
  `ItemID` int(11) NOT NULL,
  PRIMARY KEY (`GroupID`,`ItemID`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1

2. Insert some data into it:

INSERT INTO test.GroupsToItems VALUES (5, 3);

3. Try to insert it again:

REPLACE INTO test.GroupsToItems VALUES (5, 3);

GDB shows it like so:
[root@mysql mysql]# gdb /usr/local/mysql/bin/mysqld-debug
Reading symbols from /usr/local/mysql-cluster-gpl-7.2.7-linux2.6-x86_64/bin/mysqld-debug...done.
(gdb) set args --defaults-file=/etc/mysql/my.cnf --skip-stack-trace
(gdb) run
Starting program: /usr/local/mysql-cluster-gpl-7.2.7-linux2.6-x86_64/bin/mysqld-debug --defaults-file=/etc/mysql/my.cnf --skip-stack-trace
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff7fef700 (LWP 20190)]
[Thread 0x7ffff7fef700 (LWP 20190) exited]
[New Thread 0x7ffff6626700 (LWP 20191)]
...
[New Thread 0x7ffff62b0700 (LWP 20227)]
[Switching to Thread 0x7ffff63f5700 (LWP 20218)]

Program received signal SIGSEGV, Segmentation fault.
0x00000000008a7db9 in ndb_binlog_index_table__write_rows (thd=0x7fff04000a20, row=0x7ffff63f4a20)
    at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/sql/ha_ndbcluster_binlog.cc:3834
3834    /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/sql/ha_ndbcluster_binlog.cc: No such file or directory.
        in /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/sql/ha_ndbcluster_binlog.cc
(gdb) info local
epoch = 246282014687236
orig_epoch = 0
orig_server_id = <value optimized out>
tmp_disable_binlog__save_options = 2147748608
error = 112
ndb_binlog_index = 0x7fff04027c00
__PRETTY_FUNCTION__ = "int ndb_binlog_index_table__write_rows(THD*, ndb_binlog_index_row*)"
(gdb) up
#1  0x00000000008afb07 in ndb_binlog_thread_func (arg=<value optimized out>)
    at /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/sql/ha_ndbcluster_binlog.cc:7333
7333    in /pb2/build/sb_0-6353317-1342187808.69/mysql-cluster-gpl-7.2.7/sql/ha_ndbcluster_binlog.cc
(gdb) info local
start = {m_file_name = 0x7fff04021060 "./binlog.000046", m_file_pos = 112}
next = {m_file_name = 0x0, m_file_pos = 0}
pOp = 0x0
trans_row_count = 0
trans_slave_row_count = 0
_row = {epoch = 246282014687236, start_master_log_file = 0x7fff04021060 "./binlog.000046", start_master_log_pos = 112, 
  n_inserts = 0, n_updates = 0, n_deletes = 0, n_schemaops = 0, orig_server_id = 0, orig_epoch = 0, gci = 57342, 
  next_master_log_file = 0x0, next_master_log_pos = 0, next = 0x0}
rows = 0x7ffff63f4a20
trans = {m_state = injector::transaction::START_STATE, m_start_pos = {m_file_name = 0x7fff04021060 "./binlog.000046", 
    m_file_pos = 112}, m_next_pos = {m_file_name = 0x0, m_file_pos = 0}, m_thd = 0x7fff04000a20}
gci = 246282014687236
tot_poll_wait = <value optimized out>
schema_res = <value optimized out>
old_root = 0x7fff04003510
mem_root = {free = 0x7fff0402a9a0, used = 0x0, pre_alloc = 0x0, min_malloc = 32, block_size = 4064, block_num = 5, 
  first_block_usage = 0, error_handler = 0x691099 <sql_alloc_error_handler()>}
schema_gci = 246282014687236
res = 1
root_ptr = 0x7fff04000a48
schema_event_handler = {m_thd = 0x7fff04000a20, m_mem_root = 0x7ffff63f4db0, m_own_nodeid = 15, m_post_epoch = false, 
  m_post_epoch_handle_list = {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x12a6380, last = 0x7ffff63f4cf8, 
      elements = 0}, <No data fields>}, m_post_epoch_ack_list = {<base_list> = {<Sql_alloc> = {<No data fields>}, 
      first = 0x12a6380, last = 0x7ffff63f4d10, elements = 0}, <No data fields>}}
thd = 0x7fff04000a20
i_ndb = 0x7fff04020500
s_ndb = 0x7fff0400fcf0
thd_ndb = 0x7fff0400b0a0
inj = 0x7fff040008c0
incident_id = 0
binlog_thread_state = BCCC_running
do_incident = false
_db_stack_frame_ = {func = 0xd6efd2 "?func", file = 0xd6efd8 "?file", level = 2147483649, prev = 0x0}
__PRETTY_FUNCTION__ = "void* ndb_binlog_thread_func(void*)"
db = ""
buf = '\000' <repeats 63 times>
(gdb) c
Continuing.
[Thread 0x7ffff61ed700 (LWP 20513) exited]
...
[Thread 0x7ffff6332700 (LWP 20221) exited]

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.

Suggested fix:
Check the value of first->next_master_log_file and don't let it be NULL. I didn't weed back enough in the code to find where that value is actually coming from, so I can't suggest anything of substance as far as getting a REAL value, but at the very least it shouldn't run strlen on it.

ndb_error_reporter data uploaded to ftp.oracle.com/support/incoming as bug-data-65979.bz2

As a work around, you can drop the next_file and next_position fields from the mysql.ndb_binlog_index table.

Thank you for the report.

I can not repeat described behavior with current version 7.2.7. Please try this version and inform us if problem still exists.

Behavior persists after moving to 5.5.25a-ndb-7.2.7-gpl-log:

key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=201
max_threads=200
thread_count=3
connection_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 172277 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0xeac3af0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 41c3e0f8 thread_stack 0x40000
/usr/local/mysql/bin//mysqld(my_print_stacktrace+0x35)[0x81b5b5]
/usr/local/mysql/bin//mysqld(handle_fatal_signal+0x403)[0x6e8413]
/lib64/libpthread.so.0[0x32f5a0ebe0]
/lib64/libc.so.6(strlen+0x30)[0x32f5278630]
/usr/local/mysql/bin//mysqld[0x84c897]
/usr/local/mysql/bin//mysqld(ndb_binlog_thread_func+0x1af4)[0x854564]
/lib64/libpthread.so.0[0x32f5a0677d]
/lib64/libc.so.6(clone+0x6d)[0x32f52d33ed]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

Here are the binlog specific params in the my.cnf file:

ndb_report_thresh_binlog_mem_usage=20
sync-binlog=1
binlog-stmt-cache-size=2G
binlog-cache-size=2G
binlog-row-event-max-size=5242880
default-storage-engine=ndbcluster
server-id=1093
log-bin=binlog
binlog_format=MIXED

Can you point me in the direction of where first->next_master_log_file is assigned so I can find out what's causing it to be null at that point?

I created a test sql node using RPMs for RedHat EL6 rather than using the binary tarball, and was not able to reproduce. So I will reinstall my nodes using RPMs.

Setting this to "Not a Bug"

If get the same problem at Centos 5.* with version 5.5.27-ndb-7.2.8-gpl-log (64 bit). I tried to solve it by installing rpms, but that did not solve the problem.

I tested this again with 7.2.8 on Centos 6.3 with generic binaries and RPMs.
Both versions crash when doing "replace into" a table with only pk columns.

For more tests I got the source code and patched it, to not crash at:
->store(first->next_master_log_file,
                  (uint)strlen(first->next_master_log_file),
                  &my_charset_bin);
by not filling the columns next_file and next_position in ndb_binlog_index if first->next_master_log_file == NULL. 
This works. During these tests I saw, that in the crash case nothing gets written to the binary log. That makes sense, because nothing has changed eventually. 
But what makes no sense to me is, that a row gets written to ndb_binlog_index when nothing gets written to the binary log. That might explain, why input data is missing.

I re-open the bug. Did you ever experience this in 7.1 (or older)?

No, I have not seen this kind of crash in 7.1.

Thanks for good bug report.

In one of my comments I mentioned a patch, to avoid the crashes. We use a server of version 7.2.8 with this patch in production now, and here it is:

--- ha_ndbcluster_binlog.cc.old     2012-08-22 09:04:01.000000000 -0700
+++ ha_ndbcluster_binlog.cc         2012-11-25 08:52:54.000000000 -0800
@@ -3826,12 +3826,19 @@
       if (ndb_binlog_index->s->fields > NBICOL_NEXT_POS)
       {
         /* Table has next log pos fields, fill them in */
+        if(first->next_master_log_file == NULL)
+        {
+           sql_print_error("NDB Binlog: NULL in next_master_log_file (%llu, %u/%u), position %llu, file %s, ins/upd/del %u %u %u, schemaops %u",epoch,uint(epoch >> 32),uint(epoch),first->start_master_log_pos,first->start_master_log_file,row->n_inserts,row->n_updates,row->n_deletes,row->n_schemaops);
+        }
+        else
+        {
         ndb_binlog_index->field[NBICOL_NEXT_POS]
           ->store(first->next_master_log_pos, true);
         ndb_binlog_index->field[NBICOL_NEXT_FILE]
           ->store(first->next_master_log_file,
                   (uint)strlen(first->next_master_log_file),
                   &my_charset_bin);
+        }
       }
       row= row->next;
     }

Hi,

We believe this bug has been fixed (indirectly) by fixing another bug

If you are using 7.0, please upgrade to 7.0.38 or higher
If you are using 7.1, please upgrade to 7.1.27 or higher
If you are using 7.2, please upgrade to 7.2.11 or higher

Then please test again. If you still see this problem after the upgrade, please reopen this bug. Setting status to "Can't Repeat" for now

Many Thanks,