Bug #81838 Socket lockfile is not cleaned up if write fails, leading to crashlooping.
Submitted: 13 Jun 2016 21:14 Modified: 2 Sep 2016 1:54
Reporter: David Gow Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Locking Severity:S3 (Non-critical)
Version:5.7.11, 5.7.13 OS:Ubuntu (Kernel 3.13.0-87-generic)
Assigned to: CPU Architecture:Any

[13 Jun 2016 21:14] David Gow
Description:
If writing the pid to the mysql.sock.lock file fails (because, for example, the disk is full), the empty file remains. This prevents the server from starting up, because an empty (or otherwise invalid) lockfile is treated the same as a lockfile with a currently running pid: the server does not delete it, and shuts down.

Should the lockfile not be deleted if it cannot be correctly written to? This would prevent the server from crashlooping even after the issue which prevented the file from being written in the first place was fixed.

How to repeat:
Start the server with the unix socket path on a disk with no free space.
Note that a file (probably called mysql.sock.lock) exists, but is empty.

Note also that, even if space is freed on the disk in question, if this lockfile remains, the server will not start up.

Suggested fix:
unlink(2) the lockfile in Unix_socket::create_lockfile() (sql/conn_handler/socket_connection.cc) if the write() call fails (the error message here being "Could not write unix socket lock file %s errno %d.").
It'd probably be a good idea to similarly unlink(2) the file if fsync() or close() fail.

It may also be worth having the server delete lockfiles and retry rather than failing if read() fails or the file is empty, but this is more subjective.
[14 Jun 2016 8:17] MySQL Verification Team
Hello David Gow,

Thank you for the report and feedback!

Thanks,
Umesh
[2 Sep 2016 1:54] Paul DuBois
Posted by developer:
 
Noted in 8.0.1 changelog.

During startup, the server creates a lock file for the Unix socket
file (for example, mysql.sock.lock as a lock file for mysql.sock). If
the server failed to write the process ID to the lock file, it failed
to remove that file, which could cause subsequent server startups to
fail until the file was removed manually.