MySQL Bugs: #41004: events_scheduling times out sporadically on pushbuild

Bug #41004	events_scheduling times out sporadically on pushbuild
Submitted:	25 Nov 2008 10:57	Modified:	8 Sep 2009 9:14
Reporter:	Sven Sandberg	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	5.1, 6.0	OS:	Any
Assigned to:	Georgi Kodinov	CPU Architecture:	Any
Tags:	events_scheduling, pushbuild, sporadic, test failure

Description:
events_scheduling times out sporadically in pushbuild, all trees. Here is a failure from 5.1-rpl where there is some debug info:

main.events_scheduling                   [ fail ]  timeout after 15 minutes

Test case timeout after 15 minute(s)

 == /dev/shm/var-n_mix-120/tmp/analyze-timeout-mysqld.1.err ==
SHOW PROCESSLIST;
Id	User	Host	db	Command	Time	State	Info
544	root	localhost	events_test	Query	902	Waiting scheduler to stop	SET GLOBAL event_scheduler=0
545	event_scheduler	localhost	NULL	Killed	902	Waiting on empty queue	NULL
546	root	localhost	NULL	Query	0	NULL	SHOW PROCESSLIST

 - saving '/dev/shm/var-n_mix-120/log/main.events_scheduling/' to '/dev/shm/var-n_mix-120/log/main.events_scheduling/'

Retrying test, attempt(2/3)...

How to repeat:
https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-5.1-rpl&order=82 sles10-ia64-a-1/n_mix
xref: http://tinyurl.com/5jsez7

Some estimation
---------------
  There are 9 places where include/wait_condition.inc
  is sourced. Worst case is that we need in every case
  nearly $wait_timeout seconds (default of 30 seconds
  is never overwritten).
  ==> 270 seconds
    +  30 seconds for execution of the other stuff
    -----
      300 seconds
  which is far away of the usual 900 seconds testcase
  timeout.
  So it looks like the problem is outside of the existing
  wait routines.

analyze-timeout-mysqld.1.err shows a process with info
like "%SET GLOBAL event_scheduler=0".
This command is only in line 14 of the script.

Replay script:
--------------
--disable_query_log
let $num = 10000;
while ($num)
{
   SET GLOBAL event_scheduler=1;
   SET GLOBAL event_scheduler=0;
   dec $num;
}
--enable_query_log
--echo # End

Expected results:
-----------------
# End

How to replay:
--------------
./mysql-test-run.pl --mem \
  `perl -e 'for (1 .. 100) { print "<script> "}'`
A parallel compile increases the likelihood to get
the problem. You do not need to wait till the
testcase timeout of 900 seconds has passed.
If one test has not passed after ~ 10 seconds
and several executions of
   wc -l  var/log/master.log
do not show an increasing amount of line than
you have replayed the problem.

This is a bug within the events scheduler.
IMHO this bug is
- most probably not very critical for customers
  1. It is not extreme likely because it requires
     some load on the testing box.
  2. It is rather unlikely that somebody runs
     switch event_scheduler on / switch event_scheduler off
     without nearly no timespan between.
  3. The bug itself does not produce additional heavy
     problems like for example excessive CPU load.
  4. Of course the events_scheduling is somehow out
     of order till next server shutdown/restart
  5. I tried from another connection
     CREATE TABLE, INSERT and they worked
     Output of "SELECT * FROM information_schema.processlist" +-------+-----------------+-----------+------+---------+------+---------------------------+----------------------------------------------+
| ID    | USER            | HOST      | DB   | COMMAND | TIME | STATE                     | INFO                                         |
+-------+-----------------+-----------+------+---------+------+---------------------------+----------------------------------------------+
| 18991 | root            | localhost | NULL | Query   |    0 | executing                 | SELECT * FROM information_schema.processlist |
| 18986 | event_scheduler | localhost | NULL | Killed  |  624 | Waiting on empty queue    | NULL                                         |
| 10002 | root            | localhost | test | Query   |  624 | Waiting scheduler to stop | SET GLOBAL event_scheduler=0                 |
+-------+-----------------+-----------+------+---------+------+---------------------------+----------------------------------------------+
     This fits very good to the reported effect.
- is very annoying when running our tests
  ~ 157 times testcase timeout when running this test
    during the last 15 months
  - there are other events tests which also fail often with
    testcase timeout although the reason might be sometimes
    a different one (unfortunate poll routines etc.)

Tried the latest 5.1-bugteam codebase. I was able to run the test case from Matthias 100 times without a timeout (through the perl line provided). And since the last pb occurrence was on 5 May 2009 I assume the problem is no longer repeatable in 5.1.