MySQL Bugs: #28975: SQL node (mysqld) crashes in ndb_binlog_thread

Bug #28975	SQL node (mysqld) crashes in ndb_binlog_thread_func.
Submitted:	8 Jun 2007 13:03	Modified:	18 Sep 2007 15:38
Reporter:	Anatoly Pidruchny (Candidate Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.19	OS:	HP/UX (11.11)
Assigned to:		CPU Architecture:	Any

Description:
SQL node crashes after several seconds to several minutes after start. What I did is:
1. Installed a fresh copy of MySQL 5.1.19. We use a source code distribution and build it ourselves.
2. Started NDB management and data nodes with empty data directories. I did not use --initial, but it is in effect an initial start, because the data directories were empty.
3. Restored our database using ndb_restore. The backup was done on release 5.1.17.
4. Started a SQL node. So far so good. I verified and SQL node stays up if a database with NDB tables is not created.
5. Created the database with NDB tables (just executed the "create database realtime;" statement). After that SQL node worked for a minute or two and started to crash on its own. There is no need to execute any queries.

There is not much information printed in the err file. Please see the attachments. Next thing that I did is I built 5.1.19 software with debug info, then i re-produced the crash with the debug version of mysqld and I produced the core file.

Please see the attached files for more details. Please let me know if any more information is required.

How to repeat:
In my case, just start a SQL node. Please see the description

Information gathered by mysqlbug

Attachment: bugreport.txt (text/plain), 2.86 KiB.

Config file for the cluster.

Attachment: config.ini (application/octet-stream, text), 736 bytes.

Error file from the SQL node.

Attachment: sm62gp01.err (application/octet-stream, text), 5.39 KiB.

GDB output with some information from the core file.

Attachment: gdb-out.txt (text/plain), 3.56 KiB.

Hi Anatoly,

can you provide us with the gdb output of "bt full" in addition to the "bt" output you already provided? Or even better: provide us with the actual mysqld binary and the core file you got so that we can try to debug it ourselves?

As far as i can tell signal 10 is SIGBUS on your platform, so there seems to be an unaligned memory access. The failing line is

  4007:  table->field[2]->store("", 0, &my_charset_bin);

so it looks as if the my_charset_bin pointer was mangled somehow, but without seeing the actual method parameters (which are part of the "bt full" output) 
it is hard to tell whether this is really the case ...

Hi Hartmut,

I have uploaded the mysqld binary and the core file into ftp.mysql.com/pub/mysql/upload. The file name is bug28975-core.tar.gz. The output of the "bt full" is:

(gdb) bt full
#0  0xc00555f8 in _lwp_kill+0x10 () from /usr/lib/libpthread.1
No symbol table info available.
#1  0xc004c5c4 in __vp_kill+0x3c () from /usr/lib/libpthread.1
No symbol table info available.
#2  0xc004fdd0 in pthread_kill+0xb8 () from /usr/lib/libpthread.1
No symbol table info available.
#3  0x48b7f4 in write_core (sig=Cannot access memory at address 0x69662063
) at stacktrace.c:229
No locals.
#4  0x2ce3fc in handle_segfault (sig=-1072571096) at mysqld.cc:2237
        curr_time = 1181305526
        tm = {tm_sec = 26, tm_min = 25, tm_hour = 8, tm_mday = 8, tm_mon = 5, tm_year = 107, tm_wday = 5, tm_yday = 158, tm_isdst = 1}
        thd = (class THD *) 0x0
#5  <signal handler called>
No symbol table info available.
#6  0x1ecb18 in $$dyncall_external_20+0 ()
No symbol table info available.
#7  0x41420c in ndb_binlog_thread_func (arg=0x0) at ha_ndbcluster_binlog.cc:4007
        table = (struct st_table *) 0x0
        name = (const struct LEX_STRING &) @0x0: Error accessing memory address 0x0: Invalid argument.

/Anatoly

The core dump didn't prove to be that helpfull as it seems to have the path to the binary hard coded into it on HP/UX, so gdb gives me 

  "/usr/local/mysql/bin/mysqld": can't read symbols: File format not recognized.

As i don't have permissions to /usr/local on that machine i can't move the mysqld to that path so i'm stuck there.

I'm now waiting for our HP/UX 11 machine to become available for testing so that i can try to reproduce this locally.

What looks strange btw. is that the crash only happens in line 4007 due to table being a NULL pointer, although it already has been dereferenced in 4005 and 4006, too.

Can you verify that your source shows the same line numbers?

  4000              /* 
  4001                 Intialize table->record[0] 
  4002              */
  4003              empty_record(table);
  4004
  4005              table->field[0]->store((longlong)::server_id);
  4006              table->field[1]->store((longlong)gci);
  4007              table->field[2]->store("", 0, &my_charset_bin);
  4008              table->field[3]->store((longlong)0);
  4009              table->field[4]->store((longlong)0);
  4010              trans.write_row(::server_id,
  4011                              injector::transaction::table(table, TRUE),
  4012                              &table->s->all_set, table->s->fields,
  4013                              table->record[0]);

I'm also interested in your exact configure settings, can you attach the mysqlbug script from your installation to this bug report? (mysqlbug lists the full configure string and environment variable settings that were active for a build)

Hi Hartmut,
the fact that it crashes only on line 4007 is also strange to me. I think the table pointer was OK when lines 4005 and 4006 were executed and some memory corruption happens when either line 4006 or line 4007 was executed. My source of ha_ndbcluster_binlog.cc has the same line numbers as yours. You can find the full configure line string in the already attached bugreport.txt file. But I am also going to attach my mysqlbug file.

Did you try to run the GDB on a HP-UX 11.11 machine? You do not need to put the mysqld binary that I sent into the /usr/local/mysql/bin directory. It works for me if I have the mysqld binary and the core file in any directory and I run:

gdb ./mysqld core

/Anatoly.

mysqlbug file from my installation

Attachment: mysqlbug (application/octet-stream, text), 10.71 KiB.

Hi,

that this part of the code is hit at all if you are doing _nothing_ seems strange.  This piece of the code should _only_ run if something happens in one of the tables 

insert/update/delete to table, or drop/truncate/alter table

BR,

Tomas

Hi Tomas,

I only restored the database from the backup using ndb_restore (prior to starting mysqld), then started mysqld and created the "realtime" database. I was not doing any other manipulations with data or tables.

Today I tried to run mysqld process under gdb. Somehow, it did not crash. After the startup mysqld process had been taking a lot of CPU for about 20-40 seconds, then it stopped to consume CPU and everything looked normal. After that, it does not crash any more when I restart it. I can try to re-produce the problem starting from the very beginning (fresh install of MySQL 5.1.19, then restore the database, then start mysqld). But as of now, the problem seems to go away.

/Anatoly.

Hi again,

I am trying to test 5.1.20 now and have the same (or similar) problem with mysqld crashes. Well, now it crashes only when I do some manipulations with the data. I thought that it crashed even without any data changes before, but I am not so sure about that now.

I did some research and what I found is this. ndb_desc shows that the table ndb_apply_status has only 2 attributes (server_id and epoch):

$ ndb_desc --database=mysql ndb_apply_status
-- ndb_apply_status --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 266
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
server_id Unsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY
epoch Bigunsigned NOT NULL AT=FIXED ST=MEMORY

-- Indexes -- 
PRIMARY KEY(server_id) - UniqueHashIndex

NDBT_ProgramExit: 0 - OK

I think the data structures in memory of mysqld process also are created only for 2 attributes for this table:

(gdb) p ndb_apply_status_share->table->s->table_name
$5 = {str = 0x42482acf "ndb_apply_status", length = 16}
(gdb) p ndb_apply_status_share->table->s->fieldnames.count
$27 = 2

The source file ha_ndbcluster_binlog.cc changed in release 5.1.20, but please take a look at lines 4035-4039:
            table->field[0]->store((longlong)::server_id);
            table->field[1]->store((longlong)gci);
            table->field[2]->store("", 0, &my_charset_bin);
            table->field[3]->store((longlong)0);
            table->field[4]->store((longlong)0);

As you can see, the code clearly assumes that the table has (at least) 5 attributes, but really there are only 2. So, when the line 4037 is executed, the variable table->field[2] is NULL and I think this is what causes the crash:

(gdb) p ndb_apply_status_share->table->field[2]
$19 = (class Field *) 0x0

Is it possible that my ndb_apply_status table has only two attributes because the database was restored from the backup taken from release 5.1.17? Is restore from 5.1.17 to 5.1.19 or 5.1.20 supported? Should ndb_restore take special care of the ndb_apply_status table to make sure data is restored, but the table schema has the correct number of attributes?

/Anatoly.

I just (re)-read the chapter 15.5.2. Cluster Upgrade and Downgrade Compatibility in documentation: 
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-upgrade-downgrade-compatibility.html

Also I read chapter 15.10.4. Cluster Replication Schema and Tables.

Now, as I understand, ndb_restore dropped my existing initial and empty ndb_apply_status table with 5 attributes and restored the old version of the table with only 2 attributes. I do not think this is correct. The restore from the backup done with release 5.1.17 or earlier to 5.1.18 or later is supported, so, ndb_restore should take special care about the table ndb_apply_status.

For now, I added the three new fields into the ndb_apply_status table manually:

alter table ndb_apply_status add `log_name`    VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL;
alter table ndb_apply_status add `start_pos`   BIGINT(20) UNSIGNED NOT NULL;
alter table ndb_apply_status add `end_pos`     BIGINT(20) UNSIGNED NOT NULL;

So far so good, I do not have any crashes of the mysqld process any more.

/Anatoly.

Hi,

I just looked over this bug report and decided just to close it. We do not suffer from this bug any more and it involves a restore from a backup taken by old 5.1.x beta releases that I do not think anybody uses now and/or is going to use in future. So, there is no value in providing a fix for this particular bug.

/Anatoly.