Description:
When you have a lot of rules scheduled for a lot of instances, then you never close events this can result in a large purge of rule_eval_results due to the tracking of changes overtime of the open event. The delete you'll see taking a long time is:
DELETE FROM rule_eval_result_vars
USING rule_eval_result_vars, rule_eval_results
WHERE rule_eval_result_vars.result_id = rule_eval_results.result_id
AND alarm_id in (14, 70, 109, 112, 123, 126, 154, 168, 182, 196, 210, 224, 237, 238, 252, 293, 294, 308, 312, 321, 322, 350, 354, 364, 390, 406, 420, 434, 438, 448, 452, 462, 466, 476, 490, 517, 518, 531, 532, 546, 564, 574, 588, 602, 616, 644, 658, 672, 742, 798, 812, 840, 854, 868, 882, 886, 896, 910, 924, 938, 952, 966, 980, 994, 1008, 1027, 1036, 1041, 1050, 1064, 1077, 1078, 1092, 1096, 1106, 1120, 1124, 1134, 1148, 118 <snip long list with thousands of alarm_ids>
The large where in delete can take several hours on a busy MEM instance and result in excessive io.
How to repeat:
Have a lot of instances being monitored, do not close events watch as the alarm_id in (list) grows and takes longer.
Suggested fix:
* Lower how many alarm_id's can be in the limit at a time.
* Warn customers more clearly about closing open events.