MySQL Bugs: #106839: provide counters for error 013730 (wait

Bug #106839	provide counters for error 013730 (wait_timeout period exceeded)
Submitted:	25 Mar 2022 14:43	Modified:	30 Mar 2022 10:59
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Performance Schema	Severity:	S4 (Feature request)
Version:	8.0.28	OS:	Any
Assigned to:	Marc ALFF	CPU Architecture:	Any
Tags:	counter, Monitoring, wait_timeout

Description:

Sometimes I see in my logs:

2022-03-25T08:32:49.966330Z 1298480 [Note] [MY-013730] [Server] 'wait_timeout' period of 9999 seconds was exceeded for `some_user`@`%`. The idle time since last command was too long.

This generates quite a lot of noise on the server which is sort of ok but I don't believe there is a counter anywhere of the number of times this occurs or even (in  P_S) the number of times this occurs per username.

Many clients seems to be lazy and not check wait_timeout and so get disconnected unexpectedly, which means on their next query they'll get an error and need to reconnect so it would be good to monitor how many times this happens, ideally by username so the appropriate user or application can be notified that it is not behaving appropriately. (often the devs or users are not really aware of this).

How to repeat:
Trigger the issue and then look for metrics of how often it happens.

I see on one server I was looking at 22k entries just for today in the mysqld.log (log-error)

Suggested fix:
- Provide a single counter for this type of event (e.g. Wait_timeout_exceeded), or
- provide a counter per username for this type of event.

Hello Simon,

Thank you for the feature request!

regards,
Umesh

Instrumentation for errors is available already, since version 8.0.

See
dev.mysql.com/doc/refman/8.0/en/performance-schema-error-summary-tables.html

mysql> show tables like "%errors_summary%";
+-------------------------------------------------+
| Tables_in_performance_schema (%errors_summary%) |
+-------------------------------------------------+
| events_errors_summary_by_account_by_error       |
| events_errors_summary_by_host_by_error          |
| events_errors_summary_by_thread_by_error        |
| events_errors_summary_by_user_by_error          |
| events_errors_summary_global_by_error           |
+-------------------------------------------------+
5 rows in set (0.01 sec)

mysql> select * from events_errors_summary_global_by_error where ERROR_NUMBER
= 13730 \G
*************************** 1. row ***************************
     ERROR_NUMBER: 13730
       ERROR_NAME: ER_NET_WAIT_ERROR2
        SQL_STATE: HY000
 SUM_ERROR_RAISED: 0
SUM_ERROR_HANDLED: 0
       FIRST_SEEN: NULL
        LAST_SEEN: NULL
1 row in set (0.01 sec)

Closing,
already implemented by:
WL#8058 PERFORMANCE_SCHEMA, INSTRUMENT SERVER ERRORS
in version 8.0.0

This doesn't seem to work as well as I would like:

root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(1000) */ count(*) from events_errors_summary_by_user_by_error ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_user_by_error ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(100000) */ count(*) from events_errors_summary_by_user_by_error ;
+----------+
| count(*) |
+----------+
|   311236 |
+----------+
1 row in set (1 min 22.35 sec)

I'm not sure what the query is doing but it concerns me it takes > ~10 seconds to run. Is this going to disrupt other users of the system? If so running it does not seem practical atm.

Seen on 8.0.28.

Comparing the other tables I see:

root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_account_by_error  ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_host_by_error   ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_thread_by_error   ;
+----------+
| count(*) |
+----------+
|  1711016 |
+----------+
1 row in set (0.29 sec)

root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_global_by_error   ;
+----------+
| count(*) |
+----------+
|     5035 |
+----------+
1 row in set (0.63 sec)

So clearly there's quite a bit of information there. Maybe good for me to investigate further but polling this (periodically) as things stand looks rather dangerous as I'm unsure on the impact it may trigger on current users of the system.  Optimising retrieval speeds of this data would be good.

For some context of this system it's doing about 30k qps of which ~500 are DML changes/sec. 900 connected users and 100 active threads.  The system is clearly busy but not overloaded in any way.

Hi Simon,

Thanks for sharing some details.

The performance issue seen with error summaries
relates to this bug:
  Bug#89379 summary_by_user tables get extremely slow with many users.

Please see additional comments there.

Regards,
-- Marc