MySQL Bugs: #40148: Slow tests cause many sporadic failures on pushbuild

Bug #40148	Slow tests cause many sporadic failures on pushbuild
Submitted:	19 Oct 2008 17:34	Modified:	4 Nov 2018 17:58
Reporter:	Sven Sandberg	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	Tests: Server	Severity:	S7 (Test Cases)
Version:	5.1	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	pushbuild, sporadic, test failure, timeout

Description:
Many tests are slow, causing sporadic failures on pushbuild due to timeouts.

Sporadic failures are difficult to debug, hence cost a lot of developer time. Also, the more spurious failures we have, the more real failures will be ignored.

How to repeat:
This shows the top ten tests that failed with timeout most frequently from July 2008 to now:

$ echo '
  use pushbuild;
  SELECT sum(1) as _sum, SUBSTRING_INDEX(tfl_name, " ", 1) as _name
    FROM test_failure
    LEFT JOIN push
      ON test_failure.tfl_psh_name = push.psh_name
      WHERE psh_date >= "2008-07-01"
        AND tfl_text like "%fail ]  timeout%"
    GROUP BY _name
    ORDER BY _sum DESC
    LIMIT 50' | mysql -P 8966 -ureadonly -hproduction.mysql.com
_sum	_name
401	main.kill
234	main.completion_type_func
221	main.mysql_client_test
215	rpl.rpl_ssl1
210	rpl.rpl_ssl
136	main.innodb_autoinc_lock_mode_func
117	ndb_binlog.ndb_binlog_restore
106	rpl.rpl_heartbeat
80	main.query_prealloc_size_basic_64
79	ndb_team.ndb_autodiscover3
77	main.bootstrap
76	rpl_ndb.rpl_ndb_idempotent
73	main.information_schema
69	ndb.ndb_alter_table_online2
63	main.events_bugs
48	rpl.rpl_sporadic_master
47	rpl_ndb.rpl_ndb_multi
45	main.events_scheduling
45	main.maria3
42	main.mysqldump
35	falcon.falcon_bug_30124
35	falcon.falcon_bug_34351_A
30	main.transaction_prealloc_size_basic_64
29	falcon.falcon_bug_37080
28	ndb_binlog.ndb_binlog_multi
28	binlog.binlog_killed
27	rpl_ndb.rpl_ndb_row_001
26	ndb_binlog.ndb_binlog_ddl_multi
26	ndb_binlog.ndb_binlog_log_bin
26	rpl.rpl_row_max_relay_size
25	main.count_distinct3
23	main.innodb_max_dirty_pages_pct_func
19	ndb.ndb_condition_pushdown
19	rpl.rpl_stm_reset_slave
19	ndb.ndb_restore
19	main.event_scheduler_basic
18	main.locktrans_innodb
18	rpl_ndb.rpl_ndb_sync
17	main.subselect
17	ndb.ndb_index_unique
16	rpl_ndb.rpl_ndb_extraCol
16	ndb.ndb_blob_restore
14	ndb.ndb_insert
14	main.subselect_no_mat
13	ndb.ndb_dd_restore_compat
13	main.backup_db_grants
13	rpl_ndb.rpl_ndb_ddl
12	rpl.rpl_start_stop_slave
12	maria.maria3
12	main.index_merge_myisam

That's 3000 failures. The tests that failed more than 100 times account for roughly half of them.

Suggested fix:
Augment mtr with a way to specify individual timeouts for tests.

Make sure that the tests accounting for 90% or so of the timeouts increase their timeouts.

Some possible approaches:

 (1) Add a test language command to increase the timeout for a test. This has the advantage that include/wait_condition.inc , include/wait_slave_param.inc etc can increase the test case timeout with the time they slept, so tests that wait inherently will automatically increase their timeouts.

 (2) Instead of adding a test language command, we could just make the test's sleep function increase the timeout. (less flexible, but more automatic)

Not sure how to handle the fact that timeouts are controlled by mtr, which is agnostic of the test language.

Old report, no longer relevant as such.