Description:
Many tests are slow, causing sporadic failures on pushbuild due to timeouts.
Sporadic failures are difficult to debug, hence cost a lot of developer time. Also, the more spurious failures we have, the more real failures will be ignored.
How to repeat:
This shows the top ten tests that failed with timeout most frequently from July 2008 to now:
$ echo '
  use pushbuild;
  SELECT sum(1) as _sum, SUBSTRING_INDEX(tfl_name, " ", 1) as _name
    FROM test_failure
    LEFT JOIN push
      ON test_failure.tfl_psh_name = push.psh_name
      WHERE psh_date >= "2008-07-01"
        AND tfl_text like "%fail ]  timeout%"
    GROUP BY _name
    ORDER BY _sum DESC
    LIMIT 50' | mysql -P 8966 -ureadonly -hproduction.mysql.com
_sum	_name
401	main.kill
234	main.completion_type_func
221	main.mysql_client_test
215	rpl.rpl_ssl1
210	rpl.rpl_ssl
136	main.innodb_autoinc_lock_mode_func
117	ndb_binlog.ndb_binlog_restore
106	rpl.rpl_heartbeat
80	main.query_prealloc_size_basic_64
79	ndb_team.ndb_autodiscover3
77	main.bootstrap
76	rpl_ndb.rpl_ndb_idempotent
73	main.information_schema
69	ndb.ndb_alter_table_online2
63	main.events_bugs
48	rpl.rpl_sporadic_master
47	rpl_ndb.rpl_ndb_multi
45	main.events_scheduling
45	main.maria3
42	main.mysqldump
35	falcon.falcon_bug_30124
35	falcon.falcon_bug_34351_A
30	main.transaction_prealloc_size_basic_64
29	falcon.falcon_bug_37080
28	ndb_binlog.ndb_binlog_multi
28	binlog.binlog_killed
27	rpl_ndb.rpl_ndb_row_001
26	ndb_binlog.ndb_binlog_ddl_multi
26	ndb_binlog.ndb_binlog_log_bin
26	rpl.rpl_row_max_relay_size
25	main.count_distinct3
23	main.innodb_max_dirty_pages_pct_func
19	ndb.ndb_condition_pushdown
19	rpl.rpl_stm_reset_slave
19	ndb.ndb_restore
19	main.event_scheduler_basic
18	main.locktrans_innodb
18	rpl_ndb.rpl_ndb_sync
17	main.subselect
17	ndb.ndb_index_unique
16	rpl_ndb.rpl_ndb_extraCol
16	ndb.ndb_blob_restore
14	ndb.ndb_insert
14	main.subselect_no_mat
13	ndb.ndb_dd_restore_compat
13	main.backup_db_grants
13	rpl_ndb.rpl_ndb_ddl
12	rpl.rpl_start_stop_slave
12	maria.maria3
12	main.index_merge_myisam
That's 3000 failures. The tests that failed more than 100 times account for roughly half of them.
Suggested fix:
Augment mtr with a way to specify individual timeouts for tests.
Make sure that the tests accounting for 90% or so of the timeouts increase their timeouts.
Some possible approaches:
 (1) Add a test language command to increase the timeout for a test. This has the advantage that include/wait_condition.inc , include/wait_slave_param.inc etc can increase the test case timeout with the time they slept, so tests that wait inherently will automatically increase their timeouts.
 (2) Instead of adding a test language command, we could just make the test's sleep function increase the timeout. (less flexible, but more automatic)
Not sure how to handle the fact that timeouts are controlled by mtr, which is agnostic of the test language.
  
 
 
Description: Many tests are slow, causing sporadic failures on pushbuild due to timeouts. Sporadic failures are difficult to debug, hence cost a lot of developer time. Also, the more spurious failures we have, the more real failures will be ignored. How to repeat: This shows the top ten tests that failed with timeout most frequently from July 2008 to now: $ echo ' use pushbuild; SELECT sum(1) as _sum, SUBSTRING_INDEX(tfl_name, " ", 1) as _name FROM test_failure LEFT JOIN push ON test_failure.tfl_psh_name = push.psh_name WHERE psh_date >= "2008-07-01" AND tfl_text like "%fail ] timeout%" GROUP BY _name ORDER BY _sum DESC LIMIT 50' | mysql -P 8966 -ureadonly -hproduction.mysql.com _sum _name 401 main.kill 234 main.completion_type_func 221 main.mysql_client_test 215 rpl.rpl_ssl1 210 rpl.rpl_ssl 136 main.innodb_autoinc_lock_mode_func 117 ndb_binlog.ndb_binlog_restore 106 rpl.rpl_heartbeat 80 main.query_prealloc_size_basic_64 79 ndb_team.ndb_autodiscover3 77 main.bootstrap 76 rpl_ndb.rpl_ndb_idempotent 73 main.information_schema 69 ndb.ndb_alter_table_online2 63 main.events_bugs 48 rpl.rpl_sporadic_master 47 rpl_ndb.rpl_ndb_multi 45 main.events_scheduling 45 main.maria3 42 main.mysqldump 35 falcon.falcon_bug_30124 35 falcon.falcon_bug_34351_A 30 main.transaction_prealloc_size_basic_64 29 falcon.falcon_bug_37080 28 ndb_binlog.ndb_binlog_multi 28 binlog.binlog_killed 27 rpl_ndb.rpl_ndb_row_001 26 ndb_binlog.ndb_binlog_ddl_multi 26 ndb_binlog.ndb_binlog_log_bin 26 rpl.rpl_row_max_relay_size 25 main.count_distinct3 23 main.innodb_max_dirty_pages_pct_func 19 ndb.ndb_condition_pushdown 19 rpl.rpl_stm_reset_slave 19 ndb.ndb_restore 19 main.event_scheduler_basic 18 main.locktrans_innodb 18 rpl_ndb.rpl_ndb_sync 17 main.subselect 17 ndb.ndb_index_unique 16 rpl_ndb.rpl_ndb_extraCol 16 ndb.ndb_blob_restore 14 ndb.ndb_insert 14 main.subselect_no_mat 13 ndb.ndb_dd_restore_compat 13 main.backup_db_grants 13 rpl_ndb.rpl_ndb_ddl 12 rpl.rpl_start_stop_slave 12 maria.maria3 12 main.index_merge_myisam That's 3000 failures. The tests that failed more than 100 times account for roughly half of them. Suggested fix: Augment mtr with a way to specify individual timeouts for tests. Make sure that the tests accounting for 90% or so of the timeouts increase their timeouts. Some possible approaches: (1) Add a test language command to increase the timeout for a test. This has the advantage that include/wait_condition.inc , include/wait_slave_param.inc etc can increase the test case timeout with the time they slept, so tests that wait inherently will automatically increase their timeouts. (2) Instead of adding a test language command, we could just make the test's sleep function increase the timeout. (less flexible, but more automatic) Not sure how to handle the fact that timeouts are controlled by mtr, which is agnostic of the test language.