Bug #28975 | SQL node (mysqld) crashes in ndb_binlog_thread_func. | ||
---|---|---|---|
Submitted: | 8 Jun 2007 13:03 | Modified: | 18 Sep 2007 15:38 |
Reporter: | Anatoly Pidruchny (Candidate Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.1.19 | OS: | HP/UX (11.11) |
Assigned to: | CPU Architecture: | Any |
[8 Jun 2007 13:03]
Anatoly Pidruchny
[8 Jun 2007 13:04]
Anatoly Pidruchny
Information gathered by mysqlbug
Attachment: bugreport.txt (text/plain), 2.86 KiB.
[8 Jun 2007 13:04]
Anatoly Pidruchny
Config file for the cluster.
Attachment: config.ini (application/octet-stream, text), 736 bytes.
[8 Jun 2007 13:05]
Anatoly Pidruchny
Error file from the SQL node.
Attachment: sm62gp01.err (application/octet-stream, text), 5.39 KiB.
[8 Jun 2007 13:06]
Anatoly Pidruchny
GDB output with some information from the core file.
Attachment: gdb-out.txt (text/plain), 3.56 KiB.
[20 Jun 2007 20:58]
Hartmut Holzgraefe
Hi Anatoly, can you provide us with the gdb output of "bt full" in addition to the "bt" output you already provided? Or even better: provide us with the actual mysqld binary and the core file you got so that we can try to debug it ourselves? As far as i can tell signal 10 is SIGBUS on your platform, so there seems to be an unaligned memory access. The failing line is 4007: table->field[2]->store("", 0, &my_charset_bin); so it looks as if the my_charset_bin pointer was mangled somehow, but without seeing the actual method parameters (which are part of the "bt full" output) it is hard to tell whether this is really the case ...
[20 Jun 2007 21:23]
Anatoly Pidruchny
Hi Hartmut, I have uploaded the mysqld binary and the core file into ftp.mysql.com/pub/mysql/upload. The file name is bug28975-core.tar.gz. The output of the "bt full" is: (gdb) bt full #0 0xc00555f8 in _lwp_kill+0x10 () from /usr/lib/libpthread.1 No symbol table info available. #1 0xc004c5c4 in __vp_kill+0x3c () from /usr/lib/libpthread.1 No symbol table info available. #2 0xc004fdd0 in pthread_kill+0xb8 () from /usr/lib/libpthread.1 No symbol table info available. #3 0x48b7f4 in write_core (sig=Cannot access memory at address 0x69662063 ) at stacktrace.c:229 No locals. #4 0x2ce3fc in handle_segfault (sig=-1072571096) at mysqld.cc:2237 curr_time = 1181305526 tm = {tm_sec = 26, tm_min = 25, tm_hour = 8, tm_mday = 8, tm_mon = 5, tm_year = 107, tm_wday = 5, tm_yday = 158, tm_isdst = 1} thd = (class THD *) 0x0 #5 <signal handler called> No symbol table info available. #6 0x1ecb18 in $$dyncall_external_20+0 () No symbol table info available. #7 0x41420c in ndb_binlog_thread_func (arg=0x0) at ha_ndbcluster_binlog.cc:4007 table = (struct st_table *) 0x0 name = (const struct LEX_STRING &) @0x0: Error accessing memory address 0x0: Invalid argument. /Anatoly
[25 Jun 2007 12:52]
Hartmut Holzgraefe
The core dump didn't prove to be that helpfull as it seems to have the path to the binary hard coded into it on HP/UX, so gdb gives me "/usr/local/mysql/bin/mysqld": can't read symbols: File format not recognized. As i don't have permissions to /usr/local on that machine i can't move the mysqld to that path so i'm stuck there. I'm now waiting for our HP/UX 11 machine to become available for testing so that i can try to reproduce this locally.
[25 Jun 2007 13:49]
Hartmut Holzgraefe
What looks strange btw. is that the crash only happens in line 4007 due to table being a NULL pointer, although it already has been dereferenced in 4005 and 4006, too. Can you verify that your source shows the same line numbers? 4000 /* 4001 Intialize table->record[0] 4002 */ 4003 empty_record(table); 4004 4005 table->field[0]->store((longlong)::server_id); 4006 table->field[1]->store((longlong)gci); 4007 table->field[2]->store("", 0, &my_charset_bin); 4008 table->field[3]->store((longlong)0); 4009 table->field[4]->store((longlong)0); 4010 trans.write_row(::server_id, 4011 injector::transaction::table(table, TRUE), 4012 &table->s->all_set, table->s->fields, 4013 table->record[0]); I'm also interested in your exact configure settings, can you attach the mysqlbug script from your installation to this bug report? (mysqlbug lists the full configure string and environment variable settings that were active for a build)
[25 Jun 2007 16:01]
Anatoly Pidruchny
Hi Hartmut, the fact that it crashes only on line 4007 is also strange to me. I think the table pointer was OK when lines 4005 and 4006 were executed and some memory corruption happens when either line 4006 or line 4007 was executed. My source of ha_ndbcluster_binlog.cc has the same line numbers as yours. You can find the full configure line string in the already attached bugreport.txt file. But I am also going to attach my mysqlbug file. Did you try to run the GDB on a HP-UX 11.11 machine? You do not need to put the mysqld binary that I sent into the /usr/local/mysql/bin directory. It works for me if I have the mysqld binary and the core file in any directory and I run: gdb ./mysqld core /Anatoly.
[25 Jun 2007 16:02]
Anatoly Pidruchny
mysqlbug file from my installation
Attachment: mysqlbug (application/octet-stream, text), 10.71 KiB.
[2 Jul 2007 15:25]
Tomas Ulin
Hi, that this part of the code is hit at all if you are doing _nothing_ seems strange. This piece of the code should _only_ run if something happens in one of the tables insert/update/delete to table, or drop/truncate/alter table BR, Tomas
[2 Jul 2007 15:52]
Anatoly Pidruchny
Hi Tomas, I only restored the database from the backup using ndb_restore (prior to starting mysqld), then started mysqld and created the "realtime" database. I was not doing any other manipulations with data or tables. Today I tried to run mysqld process under gdb. Somehow, it did not crash. After the startup mysqld process had been taking a lot of CPU for about 20-40 seconds, then it stopped to consume CPU and everything looked normal. After that, it does not crash any more when I restart it. I can try to re-produce the problem starting from the very beginning (fresh install of MySQL 5.1.19, then restore the database, then start mysqld). But as of now, the problem seems to go away. /Anatoly.
[9 Jul 2007 19:19]
Anatoly Pidruchny
Hi again, I am trying to test 5.1.20 now and have the same (or similar) problem with mysqld crashes. Well, now it crashes only when I do some manipulations with the data. I thought that it crashed even without any data changes before, but I am not so sure about that now. I did some research and what I found is this. ndb_desc shows that the table ndb_apply_status has only 2 attributes (server_id and epoch): $ ndb_desc --database=mysql ndb_apply_status -- ndb_apply_status -- Version: 1 Fragment type: 5 K Value: 6 Min load factor: 78 Max load factor: 80 Temporary table: no Number of attributes: 2 Number of primary keys: 1 Length of frm data: 266 Row Checksum: 1 Row GCI: 1 SingleUserMode: 0 ForceVarPart: 1 TableStatus: Retrieved -- Attributes -- server_id Unsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY epoch Bigunsigned NOT NULL AT=FIXED ST=MEMORY -- Indexes -- PRIMARY KEY(server_id) - UniqueHashIndex NDBT_ProgramExit: 0 - OK I think the data structures in memory of mysqld process also are created only for 2 attributes for this table: (gdb) p ndb_apply_status_share->table->s->table_name $5 = {str = 0x42482acf "ndb_apply_status", length = 16} (gdb) p ndb_apply_status_share->table->s->fieldnames.count $27 = 2 The source file ha_ndbcluster_binlog.cc changed in release 5.1.20, but please take a look at lines 4035-4039: table->field[0]->store((longlong)::server_id); table->field[1]->store((longlong)gci); table->field[2]->store("", 0, &my_charset_bin); table->field[3]->store((longlong)0); table->field[4]->store((longlong)0); As you can see, the code clearly assumes that the table has (at least) 5 attributes, but really there are only 2. So, when the line 4037 is executed, the variable table->field[2] is NULL and I think this is what causes the crash: (gdb) p ndb_apply_status_share->table->field[2] $19 = (class Field *) 0x0 Is it possible that my ndb_apply_status table has only two attributes because the database was restored from the backup taken from release 5.1.17? Is restore from 5.1.17 to 5.1.19 or 5.1.20 supported? Should ndb_restore take special care of the ndb_apply_status table to make sure data is restored, but the table schema has the correct number of attributes? /Anatoly.
[9 Jul 2007 20:30]
Anatoly Pidruchny
I just (re)-read the chapter 15.5.2. Cluster Upgrade and Downgrade Compatibility in documentation: http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-upgrade-downgrade-compatibility.html Also I read chapter 15.10.4. Cluster Replication Schema and Tables. Now, as I understand, ndb_restore dropped my existing initial and empty ndb_apply_status table with 5 attributes and restored the old version of the table with only 2 attributes. I do not think this is correct. The restore from the backup done with release 5.1.17 or earlier to 5.1.18 or later is supported, so, ndb_restore should take special care about the table ndb_apply_status. For now, I added the three new fields into the ndb_apply_status table manually: alter table ndb_apply_status add `log_name` VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL; alter table ndb_apply_status add `start_pos` BIGINT(20) UNSIGNED NOT NULL; alter table ndb_apply_status add `end_pos` BIGINT(20) UNSIGNED NOT NULL; So far so good, I do not have any crashes of the mysqld process any more. /Anatoly.
[18 Sep 2007 15:38]
Anatoly Pidruchny
Hi, I just looked over this bug report and decided just to close it. We do not suffer from this bug any more and it involves a restore from a backup taken by old 5.1.x beta releases that I do not think anybody uses now and/or is going to use in future. So, there is no value in providing a fix for this particular bug. /Anatoly.