Description:
We have observed periodic crashes (Segmentation fault) on multiple instances of MySQL Router running on a production workload. After enabling and analysing core dumps (see attached file) it appears that the error arises in the connection pooling logic in combination with TLS/SSL handling.
We run MySQL Router in front of MySQL instances running with InnoDB group replication in a Kubernetes environment. We have seen the issue happening with different versions of MySQL Router (9.3.0 as well as 8.0.40).
Configuration used on the MySQL Router instance is the one generated during the bootstrap process, no other config values are modified (attached file for reference).
After reviewing the stack trace and files that might be involved with the issue, we created other environments and tried different settings in an attempt to prevent the issue from happening. Setting `DEFAULT.max_idle_server_connections=0` in order to disable connection pooling seems to have resolved the issue for us. Ideally we still would like to make use of connection pooling.
Based on the observations described above, we strongly believe that this is a bug in the MySQL Router connection pooling logic.
How to repeat:
We have not found a clear, deterministic path to reproduce the issue. We came up with a script that tries different approaches in terms of spawning multiple connections to test connection pooling and alternate connections with SSL enabled/disabled. The issue was observed with this setup at least once or twice a day. On our production workloads, the issue happened more frequently on instances with higher load patterns.
We have attached the PHP script used for for the reproduction. It was deployed as a client pod in our Kubernetes environment. Dockerhub image is also available via `champgoblem/mysql-segfault:latest`. The script can be started with following command: `php ./main.php <HOST> 3306 <USER> <PASSWORD> <DB>`.
Suggested fix:
Error likely occurs in connection_pool.h (https://dev.mysql.com/doc/dev/mysql-server/9.3.0/connection__pool_8h_source.html) but we lack context to properly assess the exact root cause.