Bug #41004 events_scheduling times out sporadically on pushbuild
Submitted: 25 Nov 2008 11:57 Modified: 8 Sep 11:14
Reporter: Sven Sandberg
Status: Can't repeat
Category:Server Severity:S2 (Serious)
Version:5.1, 6.0 OS:Any
Assigned to: Georgi Kodinov Target Version:5.1+
Tags: pushbuild, sporadic, test failure, events_scheduling
Triage: Triaged: D3 (Medium)

[25 Nov 2008 11:57] Sven Sandberg
Description:
events_scheduling times out sporadically in pushbuild, all trees. Here is a failure from
5.1-rpl where there is some debug info:

main.events_scheduling                   [ fail ]  timeout after 15 minutes

Test case timeout after 15 minute(s)

 == /dev/shm/var-n_mix-120/tmp/analyze-timeout-mysqld.1.err ==
SHOW PROCESSLIST;
Id	User	Host	db	Command	Time	State	Info
544	root	localhost	events_test	Query	902	Waiting scheduler to stop	SET GLOBAL
event_scheduler=0
545	event_scheduler	localhost	NULL	Killed	902	Waiting on empty queue	NULL
546	root	localhost	NULL	Query	0	NULL	SHOW PROCESSLIST

 - saving '/dev/shm/var-n_mix-120/log/main.events_scheduling/' to
'/dev/shm/var-n_mix-120/log/main.events_scheduling/'

Retrying test, attempt(2/3)...

How to repeat:
https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=bzr_mysql-5.1-rpl&order=82
sles10-ia64-a-1/n_mix
xref: http://tinyurl.com/5jsez7
[8 Dec 2008 19:54] Matthias Leich
Some estimation
---------------
  There are 9 places where include/wait_condition.inc
  is sourced. Worst case is that we need in every case
  nearly $wait_timeout seconds (default of 30 seconds
  is never overwritten).
  ==> 270 seconds
    +  30 seconds for execution of the other stuff
    -----
      300 seconds
  which is far away of the usual 900 seconds testcase
  timeout.
  So it looks like the problem is outside of the existing
  wait routines.
[8 Dec 2008 22:41] Matthias Leich
analyze-timeout-mysqld.1.err shows a process with info
like "%SET GLOBAL event_scheduler=0".
This command is only in line 14 of the script.

Replay script:
--------------
--disable_query_log
let $num = 10000;
while ($num)
{
   SET GLOBAL event_scheduler=1;
   SET GLOBAL event_scheduler=0;
   dec $num;
}
--enable_query_log
--echo # End

Expected results:
-----------------
# End

How to replay:
--------------
./mysql-test-run.pl --mem \
  `perl -e 'for (1 .. 100) { print "<script> "}'`
A parallel compile increases the likelihood to get
the problem. You do not need to wait till the
testcase timeout of 900 seconds has passed.
If one test has not passed after ~ 10 seconds
and several executions of
   wc -l  var/log/master.log
do not show an increasing amount of line than
you have replayed the problem.

This is a bug within the events scheduler.
IMHO this bug is
- most probably not very critical for customers
  1. It is not extreme likely because it requires
     some load on the testing box.
  2. It is rather unlikely that somebody runs
     switch event_scheduler on / switch event_scheduler off
     without nearly no timespan between.
  3. The bug itself does not produce additional heavy
     problems like for example excessive CPU load.
  4. Of course the events_scheduling is somehow out
     of order till next server shutdown/restart
  5. I tried from another connection
     CREATE TABLE, INSERT and they worked
     Output of "SELECT * FROM information_schema.processlist"
+-------+-----------------+-----------+------+---------+------+---------------------------+----------------------------------------------+
| ID    | USER            | HOST      | DB   | COMMAND | TIME | STATE                    
| INFO                                         |
+-------+-----------------+-----------+------+---------+------+---------------------------+----------------------------------------------+
| 18991 | root            | localhost | NULL | Query   |    0 | executing                
| SELECT * FROM information_schema.processlist |
| 18986 | event_scheduler | localhost | NULL | Killed  |  624 | Waiting on empty queue   
| NULL                                         |
| 10002 | root            | localhost | test | Query   |  624 | Waiting scheduler to stop
| SET GLOBAL event_scheduler=0                 |
+-------+-----------------+-----------+------+---------+------+---------------------------+----------------------------------------------+
     This fits very good to the reported effect.
- is very annoying when running our tests
  ~ 157 times testcase timeout when running this test
    during the last 15 months
  - there are other events tests which also fail often with
    testcase timeout although the reason might be sometimes
    a different one (unfortunate poll routines etc.)
[8 Sep 11:14] Georgi Kodinov
Tried the latest 5.1-bugteam codebase. I was able to run the test case from Matthias 100
times without a timeout (through the perl line provided). And since the last pb
occurrence was on 5 May 2009 I assume the problem is no longer repeatable in 5.1.