Bug #2922 Crash (signal11) after "load data from master"
Submitted: 23 Feb 2004 4:32 Modified: 17 Mar 2004 1:41
Reporter: Alexander Stagun Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.0.18 OS:Linux (Linux x86)
Assigned to: Guilhem Bichot CPU Architecture:Any

[23 Feb 2004 4:32] Alexander Stagun
Description:
Whenever I issue a "load data from master" after a "reset slave" command, mysql will crash (sometimes restart, sometimes hard):

mysql> reset slave;
Query OK, 0 rows affected (0.00 sec)

mysql> load data from master;
ERROR 2013: Lost connection to MySQL server during query

mysqld got signal 11;
[...]
key_buffer_size=268435456
read_buffer_size=1044480
max_used_connections=0
max_connections=450
threads_connected=1
It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 1181940 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0x86ba938
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xbe7feef8, backtrace may not be correct.
Stack range sanity check OK, backtrace follows [resolved]:

0x80ffbac handle_segfault + 396
0x4007a12b _end + 936543267
0x82bbab4 free_root + 196
0x8194801 mc_mysql_free_result(st_mysql_res*) + 49
0x819719b load_master_data(THD*) + 795
0x8110a0c mysql_execute_command() + 14652
0x8111b41 mysql_parse(THD*, char*, unsigned) + 337
0x810c3db dispatch_command(enum_server_command, THD*, char*, unsigned) + 1131
0x810bf25 do_command(THD*) + 101
0x810b829 handle_one_connection + 905
0x400771b0 _end + 936531112
0x40230e4a _end + 938340674

New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://www.mysql.com/doc/en/Using_stack_trace.html and follow instruct
ions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do 
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x86edbe8 = CREATE TABLE `shortid` (
  `id` int(32) unsigned NOT NULL auto_increment,
  `shortid_number` varchar(20) NOT NULL default '',
  `comment` text,
  `dtinserted` datetime NOT NULL default '0000-00-00 00:00:00',
  `dtupdated` timestamp(14) NOT NULL,
  `dtdeleted` datetime default NULL,
  `premium` int(8) NOT NULL default '0',
  `shared` int(8) NOT NULL default '0',
  `country` varchar(8) NOT NULL default 'DE',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `shortid_number` (`shortid_number`)
) TYPE=MyISAM
thd->thread_id=1
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Number of processes running now: 0
040223 13:18:33  mysqld restarted

Please let me know, if you need more information.

Thanks!

Alex

How to repeat:
This error is reproducible on our site. It does not matter, if I load this specific table from the master or another one. It also does not matter if I drop the db/table I want to load from the master beforehand.
A "load table $table from master" works fine.
[26 Feb 2004 16:23] Guilhem Bichot
Hi!
Inspecting code revealed nothing, but the stack trace is clear: it's a bug. Do you have access to a debugger, like gdb (I see you are using Linux) ? If yes, here's what would be very helpful to me: pick our debug binary (from our website) or compile your own, with --with-debug as a 'configure' option), install it on your slave server; start mysqld of this debug binary. Then attach gdb to the slave's mysqld process, following these steps:
- find the process id of mysqld ("show variables like 'pid_file'" will tell you were the PID file is, then in this file you will find the PID),
- go into the directory where mysqld is,
- type
gdb ./mysqld THE_PID_YOU_READ
- you may have to type 'c' until gdb tells you 'Continuing'
- then open a mysql client and type the commands which produce the crash;
normally gdb should display information to say at which program line the crash occurs. This would already be very helpful information for me.

If this is impossible to do (I would understand it, of course), let me know.

Thanks in advance!
Guilhem
[28 Feb 2004 5:00] Alexander Stagun
This is a gdb session from a 4.0.18 debug build:
It is always the same source line (my_alloc.c:239). As always this happens, when issuing a "load data from master" after a "reset slave" command. It does not matter if I try to load only a single table or a database from the master.

(gdb) c
Continuing.
[New Thread 9226 (LWP 8906)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 9226 (LWP 8906)]
0x08301866 in free_root (root=0x88ad25c, MyFlags=0) at my_alloc.c:239
239         old=next; next= next->next;
Current language:  auto; currently c
(gdb) 
Continuing.

Program exited with code 01.
[28 Feb 2004 6:15] Guilhem Bichot
Excellent!
Can you please copy-paste the output of 'bt' in gdb after the crash, so that I know where this free_root() comes from ?
Thanks a lot!
[28 Feb 2004 6:26] Alexander Stagun
-> gdb backtrace:

[New Thread 9226 (LWP 9721)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 9226 (LWP 9721)]
0x08301866 in free_root (root=0x88ad25c, MyFlags=0) at my_alloc.c:239
239         old=next; next= next->next;
Current language:  auto; currently c
(gdb) bt
#0  0x08301866 in free_root (root=0x88ad25c, MyFlags=0) at my_alloc.c:239
#1  0x081b7088 in mc_mysql_free_result(st_mysql_res*) (result=0x88ad240)
    at mini_client.cc:988
#2  0x081bac20 in load_master_data(THD*) (thd=0x88b2178)
    at repl_failsafe.cc:899
#3  0x081171c5 in mysql_execute_command() () at sql_parse.cc:1552
#4  0x0811aafb in mysql_parse(THD*, char*, unsigned) (thd=0x88b2178, 
    inBuf=0x88c8838 "load data from master", length=21) at sql_parse.cc:2987
#5  0x08115e12 in dispatch_command(enum_server_command, THD*, char*, unsigned)
    (command=COM_QUERY, thd=0x88b2178, 
    packet=0x88c07d9 "load data from master", packet_length=21)
    at sql_parse.cc:1072
#6  0x0811587e in do_command(THD*) (thd=0x88b2178) at sql_parse.cc:943
#7  0x08114e71 in handle_one_connection (arg=0x88b2178) at sql_parse.cc:727
#8  0x400771b0 in pthread_start_thread () from /lib/libpthread.so.0
#9  0x4007722f in pthread_start_thread_event () from /lib/libpthread.so.0
#10 0x40230e4a in thread_start () from /lib/libc.so.6
[28 Feb 2004 7:04] Guilhem Bichot
I'm confused: what do you mean by "It does not matter if I try to load only a single table or a database from the master" ? I thought you were doing LOAD DATA FROM MASTER. What is the command you use to load a single table, and a database?
Thank you.
Btw, in the gdb session, I guess the segfault comes from the fact that 'next' is 0?
[28 Feb 2004 8:54] Alexander Stagun
Sorry for the confusion. What I tried to say was, that it does not matter, if I restrict the tables to be downloaded via "load data from master" with the "replicate-*" options in my.cnf. So I get the same behaviour when using either "replicate-do-table" or "replicate-do-db".
[28 Feb 2004 14:42] Guilhem Bichot
Did the problem rise after an upgrade?
At segfault time, is 'next' equal to 0 ?
[29 Feb 2004 8:51] Guilhem Bichot
Does the problem happen with a MySQL binary built by yourself or with the official ones (in the latter case, which binary exactly?)? In the first case, it would be great to test the bug again with the official binary.
Thank you!
[29 Feb 2004 8:52] Guilhem Bichot
Another question: does the crash happen when you only do LOAD DATA FROM MASTER and not RESET SLAVE before ?
[7 Mar 2004 7:56] Alexander Stagun
Hi Guilhem,

sorry for the delay, but here is some more information:

1. The problem did not rise after an upgrade. It happened with 4.0.12 and after the upgrade to 4.0.18, too.

2. The 'next' pointer is not 0 at crash time:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 9226 (LWP 17394)]
0x08301866 in free_root (root=0x88af3c4, MyFlags=0) at my_alloc.c:239
239         old=next; next= next->next;
Current language:  auto; currently c
(gdb) print next
$1 = (struct st_used_mem *) 0x30736d73
(gdb) ptype next
type = struct st_used_mem {
    struct st_used_mem *next;
    unsigned int left;
    unsigned int size;
} *

3. The problem happens with a self built binary (debug/non-debug) and with an official binary downloaded from mysql.com (MySQL-server-4.0.18-0.i386.rpm - Standard)

4. The crash only happens when I do a "reset slave" before "load data from master".
[7 Mar 2004 11:06] Guilhem Bichot
Hi!
Glad to see you back, Alexander! I'll test with MySQL-server-4.0.18-0.i386.rpm in the next days (tomorrow probably). Meanwhile, could you please copy-paste your master's and slave's my.cnf (or if this is secret information, just email it to guilhem@mysql.com). I am looking for all info to reproduce the problem (which is a true bug, as I already had one report in the past).
Thank you!!
[15 Mar 2004 6:23] Guilhem Bichot
Hi,
I was just able to repeat the segfault, 3 times. Every time, the stack trace was partly different, but the segfault always occured in a free(); so it's probably a memory corruption which has delayed effects. I am now working on hunting this.
Thank you Alexander for your bug report!
[17 Mar 2004 1:41] Guilhem Bichot
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Fixed in ChangeSet@1.1759.2.1, 2004-03-17 10:35:03+01:00, guilhem@mysql.com.
It was writing to already-freed memory. Valgrind (http://valgrind.kde.org/) found it immediately :)
Here is a patch before 4.0.19 is out. If you apply it and it does not fix the bug at your site, please re-open the bug report.
--- 1.40/sql/repl_failsafe.cc   Thu Mar 11 16:23:30 2004
+++ 1.41/sql/repl_failsafe.cc   Wed Mar 17 10:35:00 2004
@@ -874,7 +874,7 @@
 
     cleanup_mysql_results(db_res, cur_table_res - 1, table_res);
 
-    // adjust position in the master
+    // adjust replication coordinates from the master
     if (master_status_res)
     {
       MYSQL_ROW row = mc_mysql_fetch_row(master_status_res);
@@ -887,10 +887,18 @@
       */
       if (row && row[0] && row[1])
       {
+        /*
+          If the slave's master info is not inited, we init it, then we write
+          the new coordinates to it. Must call init_master_info() *before*
+          setting active_mi, because init_master_info() sets active_mi with
+          defaults.
+        */
+        if (init_master_info(active_mi, master_info_file, relay_log_info_file, 0))
+          send_error(&thd->net, ER_MASTER_INFO);
        strmake(active_mi->master_log_name, row[0],
                sizeof(active_mi->master_log_name));
        active_mi->master_log_pos = strtoull(row[1], (char**) 0, 10);
-       // don't hit the magic number
+        // at least in recent versions, the condition below should be false
        if (active_mi->master_log_pos < BIN_LOG_HEADER_SIZE)
          active_mi->master_log_pos = BIN_LOG_HEADER_SIZE;
        active_mi->rli.pending = 0;

Thank you a lot for your bug report!