Bug #97523 Infinite loop on query end re TrxInInnoDB::wait()
Submitted: 6 Nov 2019 21:02 Modified: 7 Dec 2019 16:55
Reporter: Trey Raymond Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.7.26 OS:Any
Assigned to: CPU Architecture:Any
Tags: hang, loop, mutex

[6 Nov 2019 21:02] Trey Raymond
Description:
This thread is stuck in an unchanging state of 'closing tables' for almost a day:

      Command: Query
         Time: 75777
        State: closing tables
         Info: select straight_join
                        concat("insert into ",if(p.database='',"db_tmp_all","db_tmp")," (host,db,user,",group_concat(a.column separator ','),") VALUES ('",var_host,"','",substitute_database_name(p.database),"','",u.username,"',",group_concat("'Y'" separator ','),");")
                    from users u
                        join user_privileges_database p on p.user_id=u.id
                        join available_privileges_database a on p.privilege=a.privilege
                    where u.id=var_user
                        and find_in_set(var_version,a.version)>0
                        and substitute_database_name(p.database) is not null
                    group by p.database
    Rows_sent: 46926
Rows_examined: 181625

From perf schema, it's not doing much, but count on wait/synch/mutex/innodb/trx_mutex is rapidly rising.  1257354 as of that 75K seconds.

For context:
- the statement is inside a stored procedure
- the substitute_database_name is an sql function, NO SQL + DETERMINISTIC, just some string operations
- there is no explicit transaction, the statement is AC-NL-RO and should be running in that context

Stack trace for the thread will be pasted below, too long for form.  Some vars:

(gdb) p (trx->in_innodb & TRX_FORCE_ROLLBACK)>0
$12 = true
(gdb) p/x trx->in_innodb
$13 = 0xffffffff
(gdb) p/x TRX_FORCE_ROLLBACK
$14 = 0x80000000

Notes:
- Per mysql, the transaction is not started re innodb status, and last state committed from perf schema, nothing active for this thread.
- The thread is stuck in an infinite loop unlocking and relocking the mutex with a 100ms sleep, here: https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/include/trx0trx.h#L1553-L1...
- It is unkillable in this state, only restarting the server can clear it
- AC-NL-RO statements should never be in this code path at all per https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/trx/trx0trx.cc#L1957
- 'in_innodb' has all 32 bits set to 1 here, which should not be possible.
-- most of them are used as a counter, ref:
--- https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/include/trx0trx.h#L916-L91...
--- https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/include/trx0trx.h#L1496
-- the high bits are flags, ref https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/include/trx0types.h#L63-L7...
- with all bits set to 1, the while condition will always return true, per https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/include/trx0trx.h#L1536-L1...

How to repeat:
This is tough.  It recurs somewhat regularly, and on more than one server, but I can't yet trigger a recurrence.  I need to work around this in production (going to move the loop around that query to an explicit RO trx, should hide the problem) but we've collected a lot of info here.

Suggested fix:
this raises a few points:
- how did an ac-nl-ro trx get to this state in the first place?  something with SP handling different from normal selects?  or issue parsing the function?
- why is trx->in_innodb overflowed?  is it safe to use as a counter+bitmap combo, and if so how could that get so high?
- is the while supposed to be checking bit TRX_FORCE_ROLLBACK, instead of TRX_FORCE_ROLLBACK_ASYNC ?
[6 Nov 2019 21:02] Trey Raymond
Thread 4 (Thread 0x7fde860ff700 (LWP 212096)):
#0  0x00007ffb6799200d in nanosleep () from /lib64/libpthread.so.0
#1  0x00000000010a33f0 in os_thread_sleep (tm=<value optimized out>)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/innobase/os/os0thread.cc:303
#2  0x00000000010254d2 in wait (trx=0x7ffa66801a90, disable=<value optimized out>)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/innobase/include/trx0trx.h:1815
#3  TrxInInnoDB::enter (trx=0x7ffa66801a90, disable=<value optimized out>)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/innobase/include/trx0trx.h:1735
#4  0x000000000101c25a in TrxInInnoDB (this=0x7f85a1cf2230, thd=0x7f85ba070e00, lock_type=2)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/innobase/include/trx0trx.h:1620
#5  update_thd (this=0x7f85a1cf2230, thd=0x7f85ba070e00, lock_type=2)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/innobase/handler/ha_innodb.cc:3361
#6  ha_innobase::external_lock (this=0x7f85a1cf2230, thd=0x7f85ba070e00, lock_type=2)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/innobase/handler/ha_innodb.cc:17159
#7  0x0000000000806b5a in handler::ha_external_lock (this=0x7f85a1cf2230, thd=0x7f85ba070e00, lock_type=2)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/handler.cc:8444

#8  0x0000000000e1c415 in unlock_external (thd=0x7f85ba070e00, sql_lock=0x7f8585786d90)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/lock.cc:667
#9  mysql_unlock_tables (thd=0x7f85ba070e00, sql_lock=0x7f8585786d90) at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/lock.cc:416
#10 0x0000000000c9408f in close_thread_tables (thd=0x7f85ba070e00) at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_base.cc:1732
#11 0x0000000000ceb428 in mysql_execute_command (thd=0x7f85ba070e00, first_level=false)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_parse.cc:5326
#12 0x0000000000c6a53d in sp_instr_stmt::exec_core (this=<value optimized out>, thd=<value optimized out>, nextp=0x7fde860fd228)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sp_instr.cc:1057
#13 0x0000000000c6aa6c in sp_lex_instr::reset_lex_and_exec_core (this=0x7f85852fcfd8, thd=0x7f85ba070e00, nextp=0x7fde860fd228, open_tables=false)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sp_instr.cc:446
#14 0x0000000000c6c544 in sp_lex_instr::validate_lex_and_execute_core (this=0x7f85852fcfd8, thd=0x7f85ba070e00, nextp=0x7fde860fd228, open_tables=false)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sp_instr.cc:748
#15 0x0000000000c6d2c8 in sp_instr_stmt::execute (this=0x7f85852fcfd8, thd=0x7f85ba070e00, nextp=0x7fde860fd228)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sp_instr.cc:944
#16 0x0000000000c65925 in sp_head::execute (this=0x7f8583cfb030, thd=0x7f85ba070e00, merge_da_on_success=true)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sp_head.cc:790
#17 0x0000000000c664fb in sp_head::execute_procedure (this=0x7f8583cfb030, thd=<value optimized out>, args=0x7f85ba0735f8)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sp_head.cc:1526
---Type <return> to continue, or q <return> to quit---
#18 0x0000000000ceccd0 in mysql_execute_command (thd=0x7f85ba070e00, first_level=true)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_parse.cc:4835
#19 0x0000000000cf074d in mysql_parse (thd=0x7f85ba070e00, parser_state=<value optimized out>, update_userstat=false)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_parse.cc:5905
#20 0x0000000000cf1fb8 in dispatch_command (thd=0x7f85ba070e00, com_data=0x7fde860fec30, command=COM_QUERY)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_parse.cc:1532
#21 0x0000000000cf2964 in do_command (thd=0x7f85ba070e00) at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_parse.cc:1053
#22 0x0000000000dcf1a4 in handle_connection (arg=<value optimized out>)
    at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/conn_handler/connection_handler_per_thread.cc:318
#23 0x0000000000f3e761 in pfs_spawn_thread (arg=0x7f87478c1a20) at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/storage/perfschema/pfs.cc:2190
#24 0x00007ffb6798aaa1 in start_thread () from /lib64/libpthread.so.0
#25 0x00007ffb65c81bdd in clone () from /lib64/libc.so.6
[7 Nov 2019 16:55] MySQL Verification Team
Thank you for the bug report.
#10 0x0000000000c9408f in close_thread_tables (thd=0x7f85ba070e00) at /var/builds/workspace/155674-v3-component/BUILD_CONTAINER/rhel6/label/DOCKER-LOW/app_root/x86_64-linux-gcc/percona-server-5.7.26-29/sql/sql_base.cc:1732

This bug system is for MySQL Oracle product, Percona server isn't supported here please use binary/source for MySQL Oracle. Thanks.
[8 Dec 2019 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[27 Jan 2021 5:59] MySQL Verification Team
Please try 5.7.32 where I hope it's fixed.  
See more info on https://bugs.mysql.com/bug.php?id=88757
[27 Jan 2021 9:53] zhai weixiang
I think this is possiblely duplicate of the bug#99643 I reported last year
[27 Jan 2021 10:18] MySQL Verification Team
so this is the fix i hope solved this,  but not sure. (we haven't had any new reports on versions containing this fix).

https://github.com/mysql/mysql-server/search?q=30594501&type=commits