MySQL Bugs: #57773: RQG: Unable to start server with new server module: Timeout on SPARC

Bug #57773	RQG: Unable to start server with new server module: Timeout on SPARC
Submitted:	27 Oct 2010 14:37	Modified:	4 Nov 2010 15:23
Reporter:	John Embretsen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Tools: Random Query Generator	Severity:	S1 (Critical)
Version:	trunk	OS:	Solaris (SPARC)
Assigned to:	John Embretsen	CPU Architecture:	Any
Tags:	rqg_pb2

Description:
As of 2010-10-26, daily RQG test runs in trunk-bugfixing and next-mr-bugfixing branches fail on all Solaris platforms when using pb2gentest-new.pl or runall-new.pl, which uses RQG's own server management module rather than MTR to start the server.

Symptoms:

[ERROR] Invalid (old?) table or database name '-@'

in the server's error log, resulting in:

Could not start mysql server at runall-new.pl line 260
serverqa.rqg_partn_pruning [ fail ]
runall-new.pl failed with exit code 2 

How to repeat:
perl ./runall-new.pl --basedir=$CODE --grammar=conf/examples/example.yy --duration=20 --threads=1

The issue seems to be specific to Solaris.

It seems like this is related to the --pid-file option of mysqld. When starting the server without this option I see the same error about database name in the error log, and the server starts properly. I will change the status of the bug once I have this verified.

So far I have only been able to reproduce on Solaris 10 SPARC with Pushbuild builds.

Upon server re-start using --pid-file=$PWD/pidfile.txt, I got a segmentation fault with stacktrace as follows:

t@1 (l@1) terminated by signal SEGV (Segmentation Fault)
0xffffffff7e4da210: __lwp_kill+0x0008:	bcc,a,pt  %icc,__lwp_kill+0x18	! 0xffffffff7e4da220
Current function is my_write_core
  328     pthread_kill(pthread_self(), sig);
(dbx) where
current thread: t@1
  [1] __lwp_kill(0x0, 0xb, 0xffffffff7e648100, 0xffffffff7e64753c, 0xffffffff7f500200, 0x4), at 0xffffffff7e4da210 
=>[2] my_write_core(sig = 11), line 328 in "stacktrace.c"
  [3] handle_segfault(sig = 11), line 2507 in "mysqld.cc"
  [4] __sighndlr(0xb, 0x0, 0xffffffff7fffc2f0, 0x1001ce700, 0x0, 0x0), at 0xffffffff7e4d62e0 
  ---- called from signal handler with signal 11 (SIGSEGV) ------
  [5] buf_block_align_instance(buf_pool = 0x102433738, ptr = 0xffffffff74434000 ""), line 2742 in "buf0buf.c"
  [6] buf_block_align(ptr = 0xffffffff74434000 ""), line 2771 in "buf0buf.c"
  [7] mtr_memo_contains_page(mtr = 0xffffffff7fffcac8, ptr = 0xffffffff74434000 "", type = 2U), line 394 in "mtr0mtr.c"
  [8] ibuf_bitmap_page_get_bits(page = 0xffffffff74434000 "", page_no = 7U, zip_size = 0, bit = 2U, mtr = 0xffffffff7fffcac8), line 686 in "ibuf0ibuf.c"
  [9] ibuf_merge_or_delete_for_page(block = 0xffffffff74000000, space = 0, page_no = 7U, zip_size = 0, update_ibuf_bitmap = 1U), line 4358 in "ibuf0ibuf.c"
  [10] buf_page_io_complete(bpage = 0xffffffff74000000), line 4389 in "buf0buf.c"
  [11] buf_read_page_low(err = 0xffffffff7fffd3c0, sync = 1U, mode = 132U, space = 0, zip_size = 0, unzip = 0, tablespace_version = 1LL, offset = 7U), line 158 in "buf0rea.c"
  [12] buf_read_page(space = 0, zip_size = 0, offset = 7U), line 190 in "buf0rea.c"
  [13] buf_page_get_gen(space = 0, zip_size = 0, offset = 7U, rw_latch = 2U, guess = (nil), mode = 10U, file = 0x100fd6420 "/export/home/tmp/je159969/bzr/server/mysql-trunk-bugfixing-pushbuild/storage/innobase/dict/dict0boot.c", line = 56U, mtr = 0xffffffff7fffd738), line 2966 in "buf0buf.c"
  [14] dict_hdr_get(mtr = 0xffffffff7fffd738), line 56 in "dict0boot.c"
  [15] dict_boot(), line 267 in "dict0boot.c"
  [16] innobase_start_or_create_for_mysql(), line 1640 in "srv0start.c"
  [17] innobase_init(p = 0x10155c0f0), line 2507 in "ha_innodb.cc"
  [18] ha_initialize_handlerton(plugin = 0x101d96440), line 464 in "handler.cc"
  [19] plugin_initialize(plugin = 0x101d96440), line 1041 in "sql_plugin.cc"
  [20] plugin_init(argc = 0x10152562c, argv = 0x10155b6b0, flags = 0), line 1327 in "sql_plugin.cc"
  [21] init_server_components(), line 4026 in "mysqld.cc"
  [22] mysqld_main(argc = 9, argv = 0x10155b6b0), line 4592 in "mysqld.cc"
  [23] main(argc = 9, argv = 0xffffffff7ffff068), line 24 in "main.cc"

It turns out that the issues mentioned earlier were not the real issues behind the server start failures in Pushbuild.

The root cause is that some recent code change has made the server startup process significatly slower, especially on the Solaris SPARC machines used in Pushbuild (these machines are generally slower than the Linux and Windows machines).

What seems to take the most time is the step
"InnoDB: Doublewrite buffer not found: creating new"

Further analysis is required to pinpoint the revision that caused this slowdown, and to determine if Solaris SPARC is the only affected platform and if so why.

Some investigation should also be made into the issues mentioned earlier, i.e. the error message on startup (which occurs regardless of the timeout) and the segmentation fault upon re-starting from a previously failed attempt.

In the meantime I will increase the wait time used by the RQG's Server module, so that it will wait longer for the server to start.

Patch pushed to lp:randgen. Bernt, you may do post-push review ;)

------------------------------------------------------------
revno: 579
revision-id: john.embretsen@oracle.com-20101028163659-e0z5x3g12poz0ajm
parent: john.embretsen@oracle.com-20101028161736-oz15ehqxnh7d45nl
committer: John H. Embretsen <john.embretsen@oracle.com>
branch nick: randgen-push
timestamp: Thu 2010-10-28 18:36:59 +0200
message:
  Server: Increase wait time from 20 s to 120 s until timing out a startup attempt. Startup time may drag out in some cases (bug#57773).
modified:
  lib/DBServer/MySQL/MySQLd.pm   mysql.pm-20100211152339-5ptwue9ddkbole9o-2
------------------------------------------------------------
revno: 578
revision-id: john.embretsen@oracle.com-20101028161736-oz15ehqxnh7d45nl
parent: bernt.johnsen@sun.com-20101028094730-fveu19b2eacbv5ab
committer: John H. Embretsen <john.embretsen@oracle.com>
branch nick: randgen-push
timestamp: Thu 2010-10-28 18:17:36 +0200
message:
  Server: Print total wait time when server startup fails on Unix.
modified:
  lib/DBServer/MySQL/MySQLd.pm   mysql.pm-20100211152339-5ptwue9ddkbole9o-2

Diff:

=== modified file 'lib/DBServer/MySQL/MySQLd.pm'
--- lib/DBServer/MySQL/MySQLd.pm	2010-08-11 19:40:07 +0000
+++ lib/DBServer/MySQL/MySQLd.pm	2010-10-28 16:23:53 +0000
@@ -325,14 +325,15 @@
         $self->[MYSQLD_AUXPID] = fork();
         if ($self->[MYSQLD_AUXPID]) {
             ## Wait for the pid file to have been created
+            my $wait_time = 0.2;
             my $waits = 0;
-            while (!-f $self->pidfile && $waits < 100) {
-                Time::HiRes::sleep(0.2);
+            while (!-f $self->pidfile && $waits < 600) {
+                Time::HiRes::sleep($wait_time);
                 $waits++;
             }
             if (!-f $self->pidfile) {
                 sayFile($self->logfile);
-                croak("Could not start mysql server");
+                croak("Could not start mysql server, waited ".($waits*$wait_time)." seconds for pid file");
             }
             my $pidfile = $self->pidfile;
             my $pid = `cat \"$pidfile\"`;

The increase of the server start wait timeout seems to have worked well. 

We still have an issue of an RQG test failing sometimes due to this performance degradation (most likely), although for other things than server startup. The test is called rqg_partn_pruning, and involves partitioning.

The cause of the slow-down seems to be the enabling of UNIV_DEBUG in debug builds. For details, see Bug#57981.

Closing this bug.