Description:
Many tests are slow, causing sporadic failures on pushbuild due to timeouts.
Sporadic failures are difficult to debug, hence cost a lot of developer time. Also, the more spurious failures we have, the more real failures will be ignored.
How to repeat:
This shows the top ten tests that failed with timeout most frequently from July 2008 to now:
$ echo '
use pushbuild;
SELECT sum(1) as _sum, SUBSTRING_INDEX(tfl_name, " ", 1) as _name
FROM test_failure
LEFT JOIN push
ON test_failure.tfl_psh_name = push.psh_name
WHERE psh_date >= "2008-07-01"
AND tfl_text like "%fail ] timeout%"
GROUP BY _name
ORDER BY _sum DESC
LIMIT 50' | mysql -P 8966 -ureadonly -hproduction.mysql.com
_sum _name
401 main.kill
234 main.completion_type_func
221 main.mysql_client_test
215 rpl.rpl_ssl1
210 rpl.rpl_ssl
136 main.innodb_autoinc_lock_mode_func
117 ndb_binlog.ndb_binlog_restore
106 rpl.rpl_heartbeat
80 main.query_prealloc_size_basic_64
79 ndb_team.ndb_autodiscover3
77 main.bootstrap
76 rpl_ndb.rpl_ndb_idempotent
73 main.information_schema
69 ndb.ndb_alter_table_online2
63 main.events_bugs
48 rpl.rpl_sporadic_master
47 rpl_ndb.rpl_ndb_multi
45 main.events_scheduling
45 main.maria3
42 main.mysqldump
35 falcon.falcon_bug_30124
35 falcon.falcon_bug_34351_A
30 main.transaction_prealloc_size_basic_64
29 falcon.falcon_bug_37080
28 ndb_binlog.ndb_binlog_multi
28 binlog.binlog_killed
27 rpl_ndb.rpl_ndb_row_001
26 ndb_binlog.ndb_binlog_ddl_multi
26 ndb_binlog.ndb_binlog_log_bin
26 rpl.rpl_row_max_relay_size
25 main.count_distinct3
23 main.innodb_max_dirty_pages_pct_func
19 ndb.ndb_condition_pushdown
19 rpl.rpl_stm_reset_slave
19 ndb.ndb_restore
19 main.event_scheduler_basic
18 main.locktrans_innodb
18 rpl_ndb.rpl_ndb_sync
17 main.subselect
17 ndb.ndb_index_unique
16 rpl_ndb.rpl_ndb_extraCol
16 ndb.ndb_blob_restore
14 ndb.ndb_insert
14 main.subselect_no_mat
13 ndb.ndb_dd_restore_compat
13 main.backup_db_grants
13 rpl_ndb.rpl_ndb_ddl
12 rpl.rpl_start_stop_slave
12 maria.maria3
12 main.index_merge_myisam
That's 3000 failures. The tests that failed more than 100 times account for roughly half of them.
Suggested fix:
Augment mtr with a way to specify individual timeouts for tests.
Make sure that the tests accounting for 90% or so of the timeouts increase their timeouts.
Some possible approaches:
(1) Add a test language command to increase the timeout for a test. This has the advantage that include/wait_condition.inc , include/wait_slave_param.inc etc can increase the test case timeout with the time they slept, so tests that wait inherently will automatically increase their timeouts.
(2) Instead of adding a test language command, we could just make the test's sleep function increase the timeout. (less flexible, but more automatic)
Not sure how to handle the fact that timeouts are controlled by mtr, which is agnostic of the test language.