Description:
# Problem
On a busy MySQL server, if a DDL statement is blocked waiting for a metadata lock, the lock queue glows, causing high CPU load and, in the worst case, mysqld process may hang.
DBAs often mitigate this issue by reducing lock_wait_timeout, but this causes the DDL statement itself to abort, making schema changes difficult to apply reliably.
Operators today work around this through some conbination of:
- Restricting DDL to maintenance windows (not always available)
- Online schema change tools (pt-osc, gh-ost, spirit), which add operational overhead and still have difficulty to execute DDL
- Manually killing individual contending sessions, which is futile when contention is workload-wide rather than caused by a specific session
The common operational reality is that DDL completion is sometimes more
important than uninterrupted session liveness for some production
workloads, but the server provides no option to prioritize DDL execution.
# Proposal
Add a new SESSION + GLOBAL boolean system variable `force_ddl_execution` (default OFF). When ON, if a DDL statement executed in the session hits ER_LOCK_WAIT_TIMEOUT while acquiring an MDL, the server kills all other user sessions (excluding the DDL issuer itself and system threads) and retries the lock acquisition once before reporting the timeout to the client.
The trigger requires three conditions simultaneously:
1. force_ddl_execution = ON in the requesting session,
2. the requesting thread is currently executing a DDL statement
3. MDL_context::acquire_lock returned MDL_wait::TIMEOUT.
# Consideration
This feature is aggressive by design. It kills sessions that the operator did not individually identify as blockers, because under sustained concurrency identification is not possible (MDL fast path does not record per-holder identity).
As a side note, TiDB(MySQL compatible Database) provides an option to prioritize DDL execution by disabling metadata locking with the `tidb_enable_metadata_lock` system variable. When this variable is set to OFF, TiDB does not coordinate DDL with concurrent DML through metadata locks. As a result, DDL can proceed without being blocked by metadata lock conflicts, while concurrent DML transactions that observe an outdated schema may be aborted, for example with an Information schema is changed error.
(https://docs.pingcap.com/tidb/stable/system-variables/#tidb_enable_metadata_lock-new-in-v6...)
How to repeat:
Proposed method works like below.
-- [session 1]
CREATE TABLE t1 (id INT PRIMARY KEY, v INT) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1,10),(2,20),(3,30);
-- [session 2]
BEGIN;
SELECT COUNT(*), SUM(v) FROM t1; -- holds MDL_SHARED_READ
-- leave open
-- [session 1]
SET SESSION force_ddl_execution = ON;
SET SESSION lock_wait_timeout = 1;
ALTER TABLE t1 ADD COLUMN c INT;
-- wait MDL of session 1
-- when timeout occur, proposed function kills other threads
-- [session 2]
-- session is killed
Suggested fix:
(I'll send a patch)