Bug #62401 errors during DDL crash InnoDB
Submitted: 10 Sep 2011 0:44 Modified: 4 Jun 2012 19:02
Reporter: Mark Callaghan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S2 (Serious)
Version:5.1.52 OS:Any
Assigned to: Sunny Bains CPU Architecture:Any
Tags: crash, DDL, error, innodb

[10 Sep 2011 0:44] Mark Callaghan
Description:
It is easy to crash InnoDB:
1) run a DDL statement
2) make that statement fail because there are too many (1023) undo slots in use

Given that InnoDB does not make use of the error injection facility in mtr I will guess that there are no deterministic error injection tests for it. Code inspection shows a few too many assert (ut_a) statements.

The stack for my crash is below. There is no query text because of another bug in 5.1 that prevented the query text from being dumped in most crashes.

0x7be9d2 que_eval_sql + 290
0x7c7aa1 row_merge_drop_index + 113
0x7c7b74 row_merge_drop_indexes + 68
0x786111 _ZN11ha_innobase9add_indexEP8st_tableP6st_keyj + 2689
0x6f13be _Z17mysql_alter_tableP3THDPcS1_P24st_ha_create_informationP10TABLE_LISTP10Alter_infojP8st_orderb + 6078
0x5f5a00 _Z21mysql_execute_commandP3THDPy + 11664
0x5faa23 _Z11mysql_parseP3THDPcjPPKcPy + 803
0x5fbf5c _Z16dispatch_command19enum_server_commandP3THDPcj + 5276
0x5fc5b3 _Z10do_commandP3THD + 275
0x5ebeaa handle_one_connection + 1994
0x375ae062f7 _end + 1465968927
0x375a6d1e3d _end + 1458414693

How to repeat:
I think I hit this assert. Note that when que_eval_sql returns something other than DB_SUCCESS then trx->error_state is assigned that value.

que_eval_sql(
/*=========*/
        pars_info_t*    info,   /*!< in: info struct, or NULL */
        const char*     sql,    /*!< in: SQL string */
        ibool           reserve_dict_mutex,
                                /*!< in: if TRUE, acquire/release
                                dict_sys->mutex around call to pars_sql. */
        trx_t*          trx)    /*!< in: trx */
{
        que_thr_t*      thr;
        que_t*          graph;

        ut_a(trx->error_state == DB_SUCCESS);
...

        return(trx->error_state);
}

row_merge_drop_index has this assert that worries me, but I don't think it caused a crash in this case:

        err = que_eval_sql(info, str1, FALSE, trx);

        ut_a(err == DB_SUCCESS);

What I think happened is that trx->error_state != DB_SUCCESS on entry to the row_merge_drop_indexes call. Then, row_merge_drop_indexes calls row_merge_drop_index which calls row_merge_drop_index. Since trx->error_state != DB_SUCCESS on entry, the assert is raised.

This code in alter_index calls row_merge_drop_indexes on an error and I assume that trx->error_state might != DB_SUCCESS when that is done which can explain this assert.

                if (!new_primary) {
                        error = row_merge_rename_indexes(trx, indexed_table);

                        if (error != DB_SUCCESS) {
                                row_merge_drop_indexes(trx, indexed_table,
                                                       index, num_created);
                        }

Later in alter_index, there is code to reset trx->error_state before calling row_merge_drop_indexes:

        default:
                trx->error_state = DB_SUCCESS;

                if (new_primary) {
                        if (indexed_table != innodb_table) {
                                row_merge_drop_table(trx, indexed_table);
                        }
                } else {
                        if (!dict_locked) {
                                row_mysql_lock_data_dictionary(trx);
                                dict_locked = TRUE;
                        }

                        row_merge_drop_indexes(trx, indexed_table,
                                               index, num_created);
                }

Suggested fix:
1) Add error injection tests for mtr
2) Clear trx->error_state in error handlers before calling functions that call que_eval_sql
3) Confirm that it is OK to assert that the return value from que_eval_sql == DB_SUCCESS

Crashes during DDL are very painful because DDL isn't atomic in MySQL so the MySQL and InnoDB dictionaries can easily get out of sync and that requires manual intervention. It also is painful because DDL operations are frequently very long running and nobody wants to repeat a long running DDL statement.
[13 Feb 2012 22:55] John Russell
Added to changelog for 5.1.62, 5.5.22, 5.6.5: 

A DDL operation for an InnoDB table could cause a busy MySQL server
to halt with an assertion error:

InnoDB: Failing assertion: trx->error_state == DB_SUCCESS

The error occurred if the DDL operation was run while all 1023 undo slots were in use by concurrent transactions. This error was less likely to occur in MySQL 5.5 and
5.6, because raising the number of InnoDB undo slots increased the
number of simultaneous transactions (corresponding to the number of
undo slots) from 1K to 128K.
[26 Apr 2012 22:55] Mark Callaghan
I am happy that you backported this to 5.1, but the backport does not include any tests
[27 Apr 2012 0:05] Sunny Bains
The test exists in the 5.6 branch. 5.1 doesn't have the error injection framework.
[4 Jun 2012 19:02] Mark Callaghan
Added tests that use debug-only my.cnf variables
http://bazaar.launchpad.net/~mysqlatfacebook/mysqlatfacebook/5.1/revision/3833