Bug #57773 | RQG: Unable to start server with new server module: Timeout on SPARC | ||
---|---|---|---|
Submitted: | 27 Oct 2010 14:37 | Modified: | 4 Nov 2010 15:23 |
Reporter: | John Embretsen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | Tools: Random Query Generator | Severity: | S1 (Critical) |
Version: | trunk | OS: | Solaris (SPARC) |
Assigned to: | John Embretsen | CPU Architecture: | Any |
Tags: | rqg_pb2 |
[27 Oct 2010 14:37]
John Embretsen
[28 Oct 2010 13:58]
John Embretsen
It seems like this is related to the --pid-file option of mysqld. When starting the server without this option I see the same error about database name in the error log, and the server starts properly. I will change the status of the bug once I have this verified. So far I have only been able to reproduce on Solaris 10 SPARC with Pushbuild builds. Upon server re-start using --pid-file=$PWD/pidfile.txt, I got a segmentation fault with stacktrace as follows: t@1 (l@1) terminated by signal SEGV (Segmentation Fault) 0xffffffff7e4da210: __lwp_kill+0x0008: bcc,a,pt %icc,__lwp_kill+0x18 ! 0xffffffff7e4da220 Current function is my_write_core 328 pthread_kill(pthread_self(), sig); (dbx) where current thread: t@1 [1] __lwp_kill(0x0, 0xb, 0xffffffff7e648100, 0xffffffff7e64753c, 0xffffffff7f500200, 0x4), at 0xffffffff7e4da210 =>[2] my_write_core(sig = 11), line 328 in "stacktrace.c" [3] handle_segfault(sig = 11), line 2507 in "mysqld.cc" [4] __sighndlr(0xb, 0x0, 0xffffffff7fffc2f0, 0x1001ce700, 0x0, 0x0), at 0xffffffff7e4d62e0 ---- called from signal handler with signal 11 (SIGSEGV) ------ [5] buf_block_align_instance(buf_pool = 0x102433738, ptr = 0xffffffff74434000 ""), line 2742 in "buf0buf.c" [6] buf_block_align(ptr = 0xffffffff74434000 ""), line 2771 in "buf0buf.c" [7] mtr_memo_contains_page(mtr = 0xffffffff7fffcac8, ptr = 0xffffffff74434000 "", type = 2U), line 394 in "mtr0mtr.c" [8] ibuf_bitmap_page_get_bits(page = 0xffffffff74434000 "", page_no = 7U, zip_size = 0, bit = 2U, mtr = 0xffffffff7fffcac8), line 686 in "ibuf0ibuf.c" [9] ibuf_merge_or_delete_for_page(block = 0xffffffff74000000, space = 0, page_no = 7U, zip_size = 0, update_ibuf_bitmap = 1U), line 4358 in "ibuf0ibuf.c" [10] buf_page_io_complete(bpage = 0xffffffff74000000), line 4389 in "buf0buf.c" [11] buf_read_page_low(err = 0xffffffff7fffd3c0, sync = 1U, mode = 132U, space = 0, zip_size = 0, unzip = 0, tablespace_version = 1LL, offset = 7U), line 158 in "buf0rea.c" [12] buf_read_page(space = 0, zip_size = 0, offset = 7U), line 190 in "buf0rea.c" [13] buf_page_get_gen(space = 0, zip_size = 0, offset = 7U, rw_latch = 2U, guess = (nil), mode = 10U, file = 0x100fd6420 "/export/home/tmp/je159969/bzr/server/mysql-trunk-bugfixing-pushbuild/storage/innobase/dict/dict0boot.c", line = 56U, mtr = 0xffffffff7fffd738), line 2966 in "buf0buf.c" [14] dict_hdr_get(mtr = 0xffffffff7fffd738), line 56 in "dict0boot.c" [15] dict_boot(), line 267 in "dict0boot.c" [16] innobase_start_or_create_for_mysql(), line 1640 in "srv0start.c" [17] innobase_init(p = 0x10155c0f0), line 2507 in "ha_innodb.cc" [18] ha_initialize_handlerton(plugin = 0x101d96440), line 464 in "handler.cc" [19] plugin_initialize(plugin = 0x101d96440), line 1041 in "sql_plugin.cc" [20] plugin_init(argc = 0x10152562c, argv = 0x10155b6b0, flags = 0), line 1327 in "sql_plugin.cc" [21] init_server_components(), line 4026 in "mysqld.cc" [22] mysqld_main(argc = 9, argv = 0x10155b6b0), line 4592 in "mysqld.cc" [23] main(argc = 9, argv = 0xffffffff7ffff068), line 24 in "main.cc"
[28 Oct 2010 16:15]
John Embretsen
It turns out that the issues mentioned earlier were not the real issues behind the server start failures in Pushbuild. The root cause is that some recent code change has made the server startup process significatly slower, especially on the Solaris SPARC machines used in Pushbuild (these machines are generally slower than the Linux and Windows machines). What seems to take the most time is the step "InnoDB: Doublewrite buffer not found: creating new" Further analysis is required to pinpoint the revision that caused this slowdown, and to determine if Solaris SPARC is the only affected platform and if so why. Some investigation should also be made into the issues mentioned earlier, i.e. the error message on startup (which occurs regardless of the timeout) and the segmentation fault upon re-starting from a previously failed attempt. In the meantime I will increase the wait time used by the RQG's Server module, so that it will wait longer for the server to start.
[28 Oct 2010 16:41]
John Embretsen
Patch pushed to lp:randgen. Bernt, you may do post-push review ;) ------------------------------------------------------------ revno: 579 revision-id: john.embretsen@oracle.com-20101028163659-e0z5x3g12poz0ajm parent: john.embretsen@oracle.com-20101028161736-oz15ehqxnh7d45nl committer: John H. Embretsen <john.embretsen@oracle.com> branch nick: randgen-push timestamp: Thu 2010-10-28 18:36:59 +0200 message: Server: Increase wait time from 20 s to 120 s until timing out a startup attempt. Startup time may drag out in some cases (bug#57773). modified: lib/DBServer/MySQL/MySQLd.pm mysql.pm-20100211152339-5ptwue9ddkbole9o-2 ------------------------------------------------------------ revno: 578 revision-id: john.embretsen@oracle.com-20101028161736-oz15ehqxnh7d45nl parent: bernt.johnsen@sun.com-20101028094730-fveu19b2eacbv5ab committer: John H. Embretsen <john.embretsen@oracle.com> branch nick: randgen-push timestamp: Thu 2010-10-28 18:17:36 +0200 message: Server: Print total wait time when server startup fails on Unix. modified: lib/DBServer/MySQL/MySQLd.pm mysql.pm-20100211152339-5ptwue9ddkbole9o-2 Diff: === modified file 'lib/DBServer/MySQL/MySQLd.pm' --- lib/DBServer/MySQL/MySQLd.pm 2010-08-11 19:40:07 +0000 +++ lib/DBServer/MySQL/MySQLd.pm 2010-10-28 16:23:53 +0000 @@ -325,14 +325,15 @@ $self->[MYSQLD_AUXPID] = fork(); if ($self->[MYSQLD_AUXPID]) { ## Wait for the pid file to have been created + my $wait_time = 0.2; my $waits = 0; - while (!-f $self->pidfile && $waits < 100) { - Time::HiRes::sleep(0.2); + while (!-f $self->pidfile && $waits < 600) { + Time::HiRes::sleep($wait_time); $waits++; } if (!-f $self->pidfile) { sayFile($self->logfile); - croak("Could not start mysql server"); + croak("Could not start mysql server, waited ".($waits*$wait_time)." seconds for pid file"); } my $pidfile = $self->pidfile; my $pid = `cat \"$pidfile\"`;
[4 Nov 2010 15:23]
John Embretsen
The increase of the server start wait timeout seems to have worked well. We still have an issue of an RQG test failing sometimes due to this performance degradation (most likely), although for other things than server startup. The test is called rqg_partn_pruning, and involves partitioning. The cause of the slow-down seems to be the enabling of UNIV_DEBUG in debug builds. For details, see Bug#57981. Closing this bug.