Bug #108769 unexpected epoll messages
Submitted: 13 Oct 2022 23:41 Modified: 14 Oct 2022 15:45
Reporter: Marcos Albe (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Router Severity:S3 (Non-critical)
Version:8.0.29 OS:Linux
Assigned to: CPU Architecture:x86

[13 Oct 2022 23:41] Marcos Albe
Description:
Log is full of these messages:
```
Jun 26 10:01:59 hostname mysqlrouter[3179808]: after_event_fired(1158, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 10:10:19 hostname mysqlrouter[3179808]: after_event_fired(1106, 00000000000000000000000000011000) not in after_event_fired(after_event_fired(11000000000000000000000000000000
Jun 26 10:10:19 hostname mysqlrouter[3179808]: after_event_fired(1106, 00000000000000000000000000011000) generic:33 1152Numerical argument out of domain
Jun 26 10:10:19 hostname mysqlrouter[3179808]: , 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 10:10:19 hostname mysqlrouter[3179808]: after_event_fired(1152, 00000000000000000000000000011000) 1149, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 10:10:19 hostname mysqlrouter[3179808]: generic:33 Numerical argument out of domain
Jun 26 10:10:19 hostname mysqlrouter[3179808]: after_event_fired(1149, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 10:55:06 hostname mysqlrouter[3179808]: after_event_fired(1122, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 10:55:06 hostname mysqlrouter[3179808]: after_event_fired(1122, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 11:16:11 hostname mysqlrouter[3179808]: after_event_fired(1102, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 11:16:11 hostname mysqlrouter[3179808]: after_event_fired(1102, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 11:55:51 hostname mysqlrouter[3179808]: after_event_fired(1133, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 11:55:51 hostname mysqlrouter[3179808]: after_event_fired(1133, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 12:53:45 hostname mysqlrouter[3179808]: after_event_fired(1116, after_event_fired(00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 12:53:45 hostname mysqlrouter[3179808]: after_event_fired(1116, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 12:53:45 hostname mysqlrouter[3179808]: 1010, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 12:53:45 hostname mysqlrouter[3179808]: after_event_fired(1010, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 13:20:10 hostname mysqlrouter[3179808]: after_event_fired(1162, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 13:20:10 hostname mysqlrouter[3179808]: after_event_fired(1162, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 13:26:27 hostname mysqlrouter[3179808]: after_event_fired(1134, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 13:26:27 hostname mysqlrouter[3179808]: after_event_fired(1134, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 14:21:20 hostname mysqlrouter[3179808]: after_event_fired(1136, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 14:21:20 hostname mysqlrouter[3179808]: after_event_fired(1136, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 15:58:20 hostname mysqlrouter[3179808]: after_event_fired(1154, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 15:58:20 hostname mysqlrouter[3179808]: after_event_fired(1154, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 16:12:57 hostname mysqlrouter[3179808]: after_event_fired(1139, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 16:12:57 hostname mysqlrouter[3179808]: after_event_fired(1139, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 16:37:23 hostname mysqlrouter[3179808]: after_event_fired(1096, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 16:37:23 hostname mysqlrouter[3179808]: after_event_fired(1096, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 22:40:38 hostname mysqlrouter[3179808]: after_event_fired(1218, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 22:40:38 hostname mysqlrouter[3179808]: after_event_fired(1218, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
Jun 26 23:00:22 hostname mysqlrouter[3179808]: after_event_fired(1200, 00000000000000000000000000011000) not in 11000000000000000000000000000000
Jun 26 23:00:22 hostname mysqlrouter[3179808]: after_event_fired(1200, 00000000000000000000000000011000) generic:33 Numerical argument out of domain
```
The epoll event mask is always the same in all error messages, only file descriptor number differs:
```
Sep 24 12:25:24 hostname mysqlrouter[3690615]: after_event_fired(285, 00000000000000000000000000011000) not in 11000000000000000000000000000000
```
 
Let's interpret both numbers (mask expected, event received):
```
$ cat test.cc ; g++ test.cc -o test && ./test
#include <iostream>
#include <bitset>
#include <sys/epoll.h>
int main() {
 std::cout
  << std::bitset<32>(EPOLLIN) << "	EPOLLIN
"
  << std::bitset<32>(EPOLLOUT) << "	EPOLLOUT
"
  << std::bitset<32>(EPOLLERR) << "	EPOLLERR
"
  << std::bitset<32>(EPOLLONESHOT) << "	EPOLLONESHOT
"
  << std::bitset<32>(EPOLLHUP) << "	EPOLLHUP
"
  << std::bitset<32>(EPOLLET) << "	EPOLLET
"
  << "00000000000000000000000000011000	Received from epoll
"
  << "11000000000000000000000000000000	Mask set for this particular file descriptor"
  << std::endl;
 return 0;
}
00000000000000000000000000000001 EPOLLIN
00000000000000000000000000000100 EPOLLOUT
00000000000000000000000000001000 EPOLLERR
01000000000000000000000000000000 EPOLLONESHOT
00000000000000000000000000010000 EPOLLHUP
10000000000000000000000000000000 EPOLLET
00000000000000000000000000011000 Received from epoll
11000000000000000000000000000000 Mask set for this particular file descriptor
```
Mask expected for the event is: EPOLLONESHOT combined with EPOLLET.
Technically, mysqlrouter disabled all events for this particular file descriptor.
What event kernel sends to mysqlrouter?  It sends EPOLLERR with EPOLLHUP
In https://elixir.bootlin.com/linux/v4.18/source/net/ipv4/tcp.c#L524 we can see
```
if (sk->sk_shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
  mask |= EPOLLHUP;
 /* This barrier is coupled with smp_wmb() in tcp_reset() */
 smp_rmb();
 if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
  mask |= EPOLLERR;
```
MySQL router expects the descriptor to have mask ONESHOT and ET; In https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.29/router/src/harness/tests/t... we can see:
```
  SCOPED_TRACE("// check fd-interest after remove");
  interest_res = io_svc.interest(fds.first);
  ASSERT_TRUE(interest_res);
  EXPECT_EQ(interest_res.value(), EPOLLET | EPOLLONESHOT);
```
Thus the problematic scenario is:
- The socket is closed
- mysqlrouter removes "interest" for fd
- kernel notifies (probably for the second time, regardless to ONESHOT and ET) 
The whole situation could be related to https://lwn.net/Articles/864947/
Credit of analysis goes to Nickolay Ihalainen.

How to repeat:
Unknown; Happens in customer environment and we are unable to record strace or equivalent for long enough to capture the epoll sequence of events

Suggested fix:
- Handle the unexpected kernel reply
- Make those lines debug only (they don't seem to add any valuable information for users, and the issue should be innocuous)
[14 Oct 2022 6:32] MySQL Verification Team
Hello Marcos,

Thank you for the report and feedback.

Thanks,
Umesh
[5 Dec 2022 13:43] Jan Kneschke
Posted by developer:
 
Thanks for reporting and analyzing the bug.

The bug is a duplicate of Bug#33674644