Bug #106839 provide counters for error 013730 (wait_timeout period exceeded)
Submitted: 25 Mar 2022 14:43 Modified: 30 Mar 2022 10:59
Reporter: Simon Mudd (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Performance Schema Severity:S4 (Feature request)
Version:8.0.28 OS:Any
Assigned to: Marc ALFF CPU Architecture:Any
Tags: counter, Monitoring, wait_timeout

[25 Mar 2022 14:43] Simon Mudd
Description:

Sometimes I see in my logs:

2022-03-25T08:32:49.966330Z 1298480 [Note] [MY-013730] [Server] 'wait_timeout' period of 9999 seconds was exceeded for `some_user`@`%`. The idle time since last command was too long.

This generates quite a lot of noise on the server which is sort of ok but I don't believe there is a counter anywhere of the number of times this occurs or even (in  P_S) the number of times this occurs per username.

Many clients seems to be lazy and not check wait_timeout and so get disconnected unexpectedly, which means on their next query they'll get an error and need to reconnect so it would be good to monitor how many times this happens, ideally by username so the appropriate user or application can be notified that it is not behaving appropriately. (often the devs or users are not really aware of this).

How to repeat:
Trigger the issue and then look for metrics of how often it happens.

I see on one server I was looking at 22k entries just for today in the mysqld.log (log-error)

Suggested fix:
- Provide a single counter for this type of event (e.g. Wait_timeout_exceeded), or
- provide a counter per username for this type of event.
[25 Mar 2022 15:47] MySQL Verification Team
Hello Simon,

Thank you for the feature request!

regards,
Umesh
[28 Mar 2022 9:54] Marc ALFF
Instrumentation for errors is available already, since version 8.0.

See
dev.mysql.com/doc/refman/8.0/en/performance-schema-error-summary-tables.html

mysql> show tables like "%errors_summary%";
+-------------------------------------------------+
| Tables_in_performance_schema (%errors_summary%) |
+-------------------------------------------------+
| events_errors_summary_by_account_by_error       |
| events_errors_summary_by_host_by_error          |
| events_errors_summary_by_thread_by_error        |
| events_errors_summary_by_user_by_error          |
| events_errors_summary_global_by_error           |
+-------------------------------------------------+
5 rows in set (0.01 sec)

mysql> select * from events_errors_summary_global_by_error where ERROR_NUMBER
= 13730 \G
*************************** 1. row ***************************
     ERROR_NUMBER: 13730
       ERROR_NAME: ER_NET_WAIT_ERROR2
        SQL_STATE: HY000
 SUM_ERROR_RAISED: 0
SUM_ERROR_HANDLED: 0
       FIRST_SEEN: NULL
        LAST_SEEN: NULL
1 row in set (0.01 sec)
[28 Mar 2022 9:58] Marc ALFF
Closing,
already implemented by:
WL#8058 PERFORMANCE_SCHEMA, INSTRUMENT SERVER ERRORS
in version 8.0.0
[30 Mar 2022 10:59] Simon Mudd
This doesn't seem to work as well as I would like:

root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(1000) */ count(*) from events_errors_summary_by_user_by_error ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_user_by_error ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(100000) */ count(*) from events_errors_summary_by_user_by_error ;
+----------+
| count(*) |
+----------+
|   311236 |
+----------+
1 row in set (1 min 22.35 sec)

I'm not sure what the query is doing but it concerns me it takes > ~10 seconds to run. Is this going to disrupt other users of the system? If so running it does not seem practical atm.

Seen on 8.0.28.

Comparing the other tables I see:

root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_account_by_error  ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_host_by_error   ;
ERROR 1317 (70100): Query execution was interrupted
root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_by_thread_by_error   ;
+----------+
| count(*) |
+----------+
|  1711016 |
+----------+
1 row in set (0.29 sec)

root@busymaster [performance_schema]> select /*+ MAX_EXECUTION_TIME(10000) */ count(*) from events_errors_summary_global_by_error   ;
+----------+
| count(*) |
+----------+
|     5035 |
+----------+
1 row in set (0.63 sec)

So clearly there's quite a bit of information there. Maybe good for me to investigate further but polling this (periodically) as things stand looks rather dangerous as I'm unsure on the impact it may trigger on current users of the system.  Optimising retrieval speeds of this data would be good.

For some context of this system it's doing about 30k qps of which ~500 are DML changes/sec. 900 connected users and 100 active threads.  The system is clearly busy but not overloaded in any way.
[30 Mar 2022 13:47] Marc ALFF
Hi Simon,

Thanks for sharing some details.

The performance issue seen with error summaries
relates to this bug:
  Bug#89379 summary_by_user tables get extremely slow with many users.

Please see additional comments there.

Regards,
-- Marc