Bug #2922 | Crash (signal11) after "load data from master" | ||
---|---|---|---|
Submitted: | 23 Feb 2004 4:32 | Modified: | 17 Mar 2004 1:41 |
Reporter: | Alexander Stagun | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 4.0.18 | OS: | Linux (Linux x86) |
Assigned to: | Guilhem Bichot | CPU Architecture: | Any |
[23 Feb 2004 4:32]
Alexander Stagun
[26 Feb 2004 16:23]
Guilhem Bichot
Hi! Inspecting code revealed nothing, but the stack trace is clear: it's a bug. Do you have access to a debugger, like gdb (I see you are using Linux) ? If yes, here's what would be very helpful to me: pick our debug binary (from our website) or compile your own, with --with-debug as a 'configure' option), install it on your slave server; start mysqld of this debug binary. Then attach gdb to the slave's mysqld process, following these steps: - find the process id of mysqld ("show variables like 'pid_file'" will tell you were the PID file is, then in this file you will find the PID), - go into the directory where mysqld is, - type gdb ./mysqld THE_PID_YOU_READ - you may have to type 'c' until gdb tells you 'Continuing' - then open a mysql client and type the commands which produce the crash; normally gdb should display information to say at which program line the crash occurs. This would already be very helpful information for me. If this is impossible to do (I would understand it, of course), let me know. Thanks in advance! Guilhem
[28 Feb 2004 5:00]
Alexander Stagun
This is a gdb session from a 4.0.18 debug build: It is always the same source line (my_alloc.c:239). As always this happens, when issuing a "load data from master" after a "reset slave" command. It does not matter if I try to load only a single table or a database from the master. (gdb) c Continuing. [New Thread 9226 (LWP 8906)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 9226 (LWP 8906)] 0x08301866 in free_root (root=0x88ad25c, MyFlags=0) at my_alloc.c:239 239 old=next; next= next->next; Current language: auto; currently c (gdb) Continuing. Program exited with code 01.
[28 Feb 2004 6:15]
Guilhem Bichot
Excellent! Can you please copy-paste the output of 'bt' in gdb after the crash, so that I know where this free_root() comes from ? Thanks a lot!
[28 Feb 2004 6:26]
Alexander Stagun
-> gdb backtrace: [New Thread 9226 (LWP 9721)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 9226 (LWP 9721)] 0x08301866 in free_root (root=0x88ad25c, MyFlags=0) at my_alloc.c:239 239 old=next; next= next->next; Current language: auto; currently c (gdb) bt #0 0x08301866 in free_root (root=0x88ad25c, MyFlags=0) at my_alloc.c:239 #1 0x081b7088 in mc_mysql_free_result(st_mysql_res*) (result=0x88ad240) at mini_client.cc:988 #2 0x081bac20 in load_master_data(THD*) (thd=0x88b2178) at repl_failsafe.cc:899 #3 0x081171c5 in mysql_execute_command() () at sql_parse.cc:1552 #4 0x0811aafb in mysql_parse(THD*, char*, unsigned) (thd=0x88b2178, inBuf=0x88c8838 "load data from master", length=21) at sql_parse.cc:2987 #5 0x08115e12 in dispatch_command(enum_server_command, THD*, char*, unsigned) (command=COM_QUERY, thd=0x88b2178, packet=0x88c07d9 "load data from master", packet_length=21) at sql_parse.cc:1072 #6 0x0811587e in do_command(THD*) (thd=0x88b2178) at sql_parse.cc:943 #7 0x08114e71 in handle_one_connection (arg=0x88b2178) at sql_parse.cc:727 #8 0x400771b0 in pthread_start_thread () from /lib/libpthread.so.0 #9 0x4007722f in pthread_start_thread_event () from /lib/libpthread.so.0 #10 0x40230e4a in thread_start () from /lib/libc.so.6
[28 Feb 2004 7:04]
Guilhem Bichot
I'm confused: what do you mean by "It does not matter if I try to load only a single table or a database from the master" ? I thought you were doing LOAD DATA FROM MASTER. What is the command you use to load a single table, and a database? Thank you. Btw, in the gdb session, I guess the segfault comes from the fact that 'next' is 0?
[28 Feb 2004 8:54]
Alexander Stagun
Sorry for the confusion. What I tried to say was, that it does not matter, if I restrict the tables to be downloaded via "load data from master" with the "replicate-*" options in my.cnf. So I get the same behaviour when using either "replicate-do-table" or "replicate-do-db".
[28 Feb 2004 14:42]
Guilhem Bichot
Did the problem rise after an upgrade? At segfault time, is 'next' equal to 0 ?
[29 Feb 2004 8:51]
Guilhem Bichot
Does the problem happen with a MySQL binary built by yourself or with the official ones (in the latter case, which binary exactly?)? In the first case, it would be great to test the bug again with the official binary. Thank you!
[29 Feb 2004 8:52]
Guilhem Bichot
Another question: does the crash happen when you only do LOAD DATA FROM MASTER and not RESET SLAVE before ?
[7 Mar 2004 7:56]
Alexander Stagun
Hi Guilhem, sorry for the delay, but here is some more information: 1. The problem did not rise after an upgrade. It happened with 4.0.12 and after the upgrade to 4.0.18, too. 2. The 'next' pointer is not 0 at crash time: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 9226 (LWP 17394)] 0x08301866 in free_root (root=0x88af3c4, MyFlags=0) at my_alloc.c:239 239 old=next; next= next->next; Current language: auto; currently c (gdb) print next $1 = (struct st_used_mem *) 0x30736d73 (gdb) ptype next type = struct st_used_mem { struct st_used_mem *next; unsigned int left; unsigned int size; } * 3. The problem happens with a self built binary (debug/non-debug) and with an official binary downloaded from mysql.com (MySQL-server-4.0.18-0.i386.rpm - Standard) 4. The crash only happens when I do a "reset slave" before "load data from master".
[7 Mar 2004 11:06]
Guilhem Bichot
Hi! Glad to see you back, Alexander! I'll test with MySQL-server-4.0.18-0.i386.rpm in the next days (tomorrow probably). Meanwhile, could you please copy-paste your master's and slave's my.cnf (or if this is secret information, just email it to guilhem@mysql.com). I am looking for all info to reproduce the problem (which is a true bug, as I already had one report in the past). Thank you!!
[15 Mar 2004 6:23]
Guilhem Bichot
Hi, I was just able to repeat the segfault, 3 times. Every time, the stack trace was partly different, but the segfault always occured in a free(); so it's probably a memory corruption which has delayed effects. I am now working on hunting this. Thank you Alexander for your bug report!
[17 Mar 2004 1:41]
Guilhem Bichot
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bugfix, yourself. More information about accessing the source trees is available at http://www.mysql.com/doc/en/Installing_source_tree.html Additional info: Fixed in ChangeSet@1.1759.2.1, 2004-03-17 10:35:03+01:00, guilhem@mysql.com. It was writing to already-freed memory. Valgrind (http://valgrind.kde.org/) found it immediately :) Here is a patch before 4.0.19 is out. If you apply it and it does not fix the bug at your site, please re-open the bug report. --- 1.40/sql/repl_failsafe.cc Thu Mar 11 16:23:30 2004 +++ 1.41/sql/repl_failsafe.cc Wed Mar 17 10:35:00 2004 @@ -874,7 +874,7 @@ cleanup_mysql_results(db_res, cur_table_res - 1, table_res); - // adjust position in the master + // adjust replication coordinates from the master if (master_status_res) { MYSQL_ROW row = mc_mysql_fetch_row(master_status_res); @@ -887,10 +887,18 @@ */ if (row && row[0] && row[1]) { + /* + If the slave's master info is not inited, we init it, then we write + the new coordinates to it. Must call init_master_info() *before* + setting active_mi, because init_master_info() sets active_mi with + defaults. + */ + if (init_master_info(active_mi, master_info_file, relay_log_info_file, 0)) + send_error(&thd->net, ER_MASTER_INFO); strmake(active_mi->master_log_name, row[0], sizeof(active_mi->master_log_name)); active_mi->master_log_pos = strtoull(row[1], (char**) 0, 10); - // don't hit the magic number + // at least in recent versions, the condition below should be false if (active_mi->master_log_pos < BIN_LOG_HEADER_SIZE) active_mi->master_log_pos = BIN_LOG_HEADER_SIZE; active_mi->rli.pending = 0; Thank you a lot for your bug report!