MySQL Bugs: #70308: RACE CONDITION CAN CAUSE MYSQLD TO REMOVE SOCKET FILE ERRANTLY

Bug #70308	RACE CONDITION CAN CAUSE MYSQLD TO REMOVE SOCKET FILE ERRANTLY
Submitted:	12 Sep 2013 6:36	Modified:	14 Jul 2016 20:18
Reporter:	Norvald Ryeng	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Command-line Clients	Severity:	S2 (Serious)
Version:	5.1.69	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
1) Start mysqld_safe (via the standard sysv init script)

A) mysqld/innodb begins a crash recovery or other startup steps, so the process takes X seconds before we actually create the unix socket file and bind to the tcp port. For example:
130507 12:15:38 InnoDB: Starting an apply batch of log records to the
database...
InnoDB: Progress in percents: 34 ... ...

2) While 1A is happening, we make a second call to start mysqld_safe (via the standard init script again)

A) InnoDB tries to get an exclusive RW lock on the ibdata1 file, but it cannot:
InnoDB: Unable to lock ./ibdata1, error: 11
InnoDB: Check that you do not already have another mysqld process
InnoDB: using the same InnoDB data or log files.

B) InnoDB continues to try until eventually giving up.

3) While 2A-2B is running, 1A completes successfully and binds to the tcp port and creates the socket file:
130507 12:15:38 InnoDB: Started; log sequence number 0 3035729
130507 12:15:38 [Note] Event Scheduler: Loaded 0 events
130507 12:15:38 [Note] /opt/mysql/bin/mysqld: ready for
connections.
Version: '5.5.33-enterprise-commercial-pro' socket: '/tmp/mysql.sock' port: 3306 MySQL Enterprise Server - Pro Edition (Commercial

4) Now 2B finishes and InnoDB gives up trying to get the lock. mysqld then also tries to bind to the tcp port and that fails:
130507 13:44:22 [ERROR] Plugin 'InnoDB' init function returned error.
130507 13:44:22 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
130507 13:44:22 [ERROR] Can't start server: Bind on TCP/IP port: Address already in use
130507 13:44:22 [ERROR] Do you already have another mysqld server running on port: 3306 ?

5) Now the problematic piece:

A) mysqld (the second instance) now does an abort and shutdown:
130507 13:44:22 [ERROR] Aborting
130507 13:44:22 [Note] /usr/local/mysql/bin/mysqld: Shutdown complete

B) As part of 5A, it seems to be making the shutdown() and closesocket() calls. This then *removes* the socket file (/tmp/mysql.sock) that was created by the first instance that successfully started.

So in the end, the first instance is running, but the socket file is gone. The underlying issue being that we don't check to see whether our PID was the one that created the pid file before removing it, thus allowing for the possibility of stomping on another instance.

How to repeat:
1)Start first instance: ./bin/mysqld_safe --no-defaults &
# without any delay
2)Start second instance: ./bin/mysqld_safe --no-defaults &

Suggested fix:
We need to alter the startup and shutdown process to accommodate for this potential scenario. Perhaps it's simply a matter of NOT doing an abort and shutdown if InnoDB fails to start, or if the TCP bind fails.
Or more generally, we just need to ensure that it was *our* PID that created the pid file, if not, then don't remove it.

This bug did not sync properly, but it was fixed as of 5.7.2. Here is the latest comment re: this bug:

"
Per email discussion with Srini, updated changelog entry to read:

        A race condition in the server could cause problems with the
        mysqld process ID file when startup was aborted. As part of the
        fix for this issue, mysqld_safe now creates its own PID file
        mysqld_safe.pid in the server's data directory.

Also noted behaviour change in the Manual's description of mysqld_safe.

No status change--bug was already closed.
"