Description:
Many tests are slow, causing sporadic failures on pushbuild due to timeouts.
Sporadic failures are difficult to debug, hence cost a lot of developer time. Also, the more spurious failures we have, the more real failures will be ignored.
How to repeat:
This shows the top ten tests that failed with timeout most frequently from July 2008 to now:
$ echo '
use pushbuild;
SELECT sum(1) as _sum, SUBSTRING_INDEX(tfl_name, " ", 1) as _name
FROM test_failure
LEFT JOIN push
ON test_failure.tfl_psh_name = push.psh_name
WHERE psh_date >= "2008-07-01"
AND tfl_text like "%fail ] timeout%"
GROUP BY _name
ORDER BY _sum DESC
LIMIT 50' | mysql -P 8966 -ureadonly -hproduction.mysql.com
_sum _name
401 main.kill
234 main.completion_type_func
221 main.mysql_client_test
215 rpl.rpl_ssl1
210 rpl.rpl_ssl
136 main.innodb_autoinc_lock_mode_func
117 ndb_binlog.ndb_binlog_restore
106 rpl.rpl_heartbeat
80 main.query_prealloc_size_basic_64
79 ndb_team.ndb_autodiscover3
77 main.bootstrap
76 rpl_ndb.rpl_ndb_idempotent
73 main.information_schema
69 ndb.ndb_alter_table_online2
63 main.events_bugs
48 rpl.rpl_sporadic_master
47 rpl_ndb.rpl_ndb_multi
45 main.events_scheduling
45 main.maria3
42 main.mysqldump
35 falcon.falcon_bug_30124
35 falcon.falcon_bug_34351_A
30 main.transaction_prealloc_size_basic_64
29 falcon.falcon_bug_37080
28 ndb_binlog.ndb_binlog_multi
28 binlog.binlog_killed
27 rpl_ndb.rpl_ndb_row_001
26 ndb_binlog.ndb_binlog_ddl_multi
26 ndb_binlog.ndb_binlog_log_bin
26 rpl.rpl_row_max_relay_size
25 main.count_distinct3
23 main.innodb_max_dirty_pages_pct_func
19 ndb.ndb_condition_pushdown
19 rpl.rpl_stm_reset_slave
19 ndb.ndb_restore
19 main.event_scheduler_basic
18 main.locktrans_innodb
18 rpl_ndb.rpl_ndb_sync
17 main.subselect
17 ndb.ndb_index_unique
16 rpl_ndb.rpl_ndb_extraCol
16 ndb.ndb_blob_restore
14 ndb.ndb_insert
14 main.subselect_no_mat
13 ndb.ndb_dd_restore_compat
13 main.backup_db_grants
13 rpl_ndb.rpl_ndb_ddl
12 rpl.rpl_start_stop_slave
12 maria.maria3
12 main.index_merge_myisam
That's 3000 failures. The tests that failed more than 100 times account for roughly half of them.
Suggested fix:
Augment mtr with a way to specify individual timeouts for tests.
Make sure that the tests accounting for 90% or so of the timeouts increase their timeouts.
Some possible approaches:
(1) Add a test language command to increase the timeout for a test. This has the advantage that include/wait_condition.inc , include/wait_slave_param.inc etc can increase the test case timeout with the time they slept, so tests that wait inherently will automatically increase their timeouts.
(2) Instead of adding a test language command, we could just make the test's sleep function increase the timeout. (less flexible, but more automatic)
Not sure how to handle the fact that timeouts are controlled by mtr, which is agnostic of the test language.
Description: Many tests are slow, causing sporadic failures on pushbuild due to timeouts. Sporadic failures are difficult to debug, hence cost a lot of developer time. Also, the more spurious failures we have, the more real failures will be ignored. How to repeat: This shows the top ten tests that failed with timeout most frequently from July 2008 to now: $ echo ' use pushbuild; SELECT sum(1) as _sum, SUBSTRING_INDEX(tfl_name, " ", 1) as _name FROM test_failure LEFT JOIN push ON test_failure.tfl_psh_name = push.psh_name WHERE psh_date >= "2008-07-01" AND tfl_text like "%fail ] timeout%" GROUP BY _name ORDER BY _sum DESC LIMIT 50' | mysql -P 8966 -ureadonly -hproduction.mysql.com _sum _name 401 main.kill 234 main.completion_type_func 221 main.mysql_client_test 215 rpl.rpl_ssl1 210 rpl.rpl_ssl 136 main.innodb_autoinc_lock_mode_func 117 ndb_binlog.ndb_binlog_restore 106 rpl.rpl_heartbeat 80 main.query_prealloc_size_basic_64 79 ndb_team.ndb_autodiscover3 77 main.bootstrap 76 rpl_ndb.rpl_ndb_idempotent 73 main.information_schema 69 ndb.ndb_alter_table_online2 63 main.events_bugs 48 rpl.rpl_sporadic_master 47 rpl_ndb.rpl_ndb_multi 45 main.events_scheduling 45 main.maria3 42 main.mysqldump 35 falcon.falcon_bug_30124 35 falcon.falcon_bug_34351_A 30 main.transaction_prealloc_size_basic_64 29 falcon.falcon_bug_37080 28 ndb_binlog.ndb_binlog_multi 28 binlog.binlog_killed 27 rpl_ndb.rpl_ndb_row_001 26 ndb_binlog.ndb_binlog_ddl_multi 26 ndb_binlog.ndb_binlog_log_bin 26 rpl.rpl_row_max_relay_size 25 main.count_distinct3 23 main.innodb_max_dirty_pages_pct_func 19 ndb.ndb_condition_pushdown 19 rpl.rpl_stm_reset_slave 19 ndb.ndb_restore 19 main.event_scheduler_basic 18 main.locktrans_innodb 18 rpl_ndb.rpl_ndb_sync 17 main.subselect 17 ndb.ndb_index_unique 16 rpl_ndb.rpl_ndb_extraCol 16 ndb.ndb_blob_restore 14 ndb.ndb_insert 14 main.subselect_no_mat 13 ndb.ndb_dd_restore_compat 13 main.backup_db_grants 13 rpl_ndb.rpl_ndb_ddl 12 rpl.rpl_start_stop_slave 12 maria.maria3 12 main.index_merge_myisam That's 3000 failures. The tests that failed more than 100 times account for roughly half of them. Suggested fix: Augment mtr with a way to specify individual timeouts for tests. Make sure that the tests accounting for 90% or so of the timeouts increase their timeouts. Some possible approaches: (1) Add a test language command to increase the timeout for a test. This has the advantage that include/wait_condition.inc , include/wait_slave_param.inc etc can increase the test case timeout with the time they slept, so tests that wait inherently will automatically increase their timeouts. (2) Instead of adding a test language command, we could just make the test's sleep function increase the timeout. (less flexible, but more automatic) Not sure how to handle the fact that timeouts are controlled by mtr, which is agnostic of the test language.