Bug #118705 when call que_eval_sql, may crash.
Submitted: 23 Jul 8:06 Modified: 25 Jul 6:12
Reporter: jie xu Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:8.0 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[23 Jul 8:06] jie xu
Description:
I have added a custom system table inside InnoDB to collect some statistical information. This system table may be modified by multiple threads concurrently. When modifying the data in this table, I used the que_eval_sql interface. During actual production env, we observed that the instance occasionally crashes at the following location.

```que0que.cc

/** Run a query thread. Handles lock waits. */
void que_run_threads(que_thr_t *thr) /*!< in: query thread */
{
  ut_ad(!trx_mutex_own(thr_get_trx(thr)));

loop:
  ut_a(thr_get_trx(thr)->error_state == DB_SUCCESS); //crash here!!!!

  que_run_threads_low(thr);

  switch (thr->state) {
    case QUE_THR_RUNNING:
      /* There probably was a lock wait, but it already ended
      before we came here: continue running thr */

      goto loop;

    case QUE_THR_LOCK_WAIT:
    ...

```

We were able to reproduce this issue and eventually identified the root cause of the crash.

Assume there are two transactions, T1 and T2, modifying the same row concurrently. T1 has already acquired the lock, while T2 is waiting for the lock. It is possible that the following situation occurs:

T2 happens to reach the end of que_run_threads_low. Since T1 holds the lock, T2 is in a lock wait state, and thus the thread's (thr) state is QUE_THR_LOCK_WAIT, and the transaction's (trx) error_state is DB_LOCK_WAIT. At this point, T2 has already released the mutex of this transaction at the end of que_run_threads_low.

Then, T1 starts to commit and finds that there is a transaction (T2) waiting for the lock. It therefore acquires the lock on T2's transaction and enters que_thr_end_lock_wait. Inside this function, since T1’s wait_thr is T2’s thr, it calls que_thr_move_to_run_state, which changes the state of T2’s thr state from QUE_THR_LOCK_WAIT to QUE_THR_RUNNING.

Now, T2's thr is in state QUE_THR_RUNNING, but its trx->error_state is still DB_LOCK_WAIT.

Later, in que_run_threads, the code hits the case QUE_THR_RUNNING branch and executes goto loop;. At this point, the following assertion is triggered:
ut_a(thr_get_trx(thr)->error_state == DB_SUCCESS);

This leads to a crash, because the trx->error_state is still DB_LOCK_WAIT.

How to repeat:
When using the que_eval_sql interface to concurrently modify the same table from multiple threads, the transactions from different threads may need to trigger a lock wait due to concurrent access.

Suggested fix:
In the QUE_THR_RUNNING branch of the que_run_threads function, add a check for the transaction's (trx) error_state. If it is not equal to DB_SUCCESS, explicitly reset it to DB_SUCCESS to prevent assertion failures during thread resumption.
[24 Jul 12:47] MySQL Verification Team
Thank you for the report. What is the exact version you are experiencing this with?
[25 Jul 6:12] jie xu
We are using MySQL 8.0.18, but I have checked the latest code and this issue still exists.