Bug #92975 SQL thread stuck in init state waiting for asynchronous rollback to complete
Submitted: 27 Oct 2018 1:04 Modified: 22 Dec 2018 13:22
Reporter: Jaime Lichauco Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.19, 5.7.21 OS:Any
Assigned to: Bogdan Kecman CPU Architecture:Any
Tags: asynchronous, init, replication, rollback

[27 Oct 2018 1:04] Jaime Lichauco
Description:
We have seen recent occurrences of replication hanging indefinitely on slaves running MySQL 5.7.19 and 5.7.21. The process list shows the system user thread with command Connect and state init.

In this case we have automation that attempts to run STOP SLAVE if it detects a possible replication stall. Unfortunately attempts at running STOP SLAVE are blocked leading to a pile-up of STOP SLAVE requests. Running pt-pmp against the database shows the following stack trace for the SQL thread.

nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:279),wait(trx0trx.h:1557),enter(trx0trx.h:1557),TrxInInnoDB(trx0trx.h:1557),ha_innobase::store_lock(trx0trx.h:1557),ha_innopart::store_lock(ha_innopart.cc:4142),get_lock_data(lock.cc:743),mysql_lock_tables(lock.cc:743),lock_tables(sql_base.cc:6632),mysql_update(sql_update.cc:366),Sql_cmd_update::try_single_table_update(sql_update.cc:2881),Sql_cmd_update::execute(sql_update.cc:3008),mysql_execute_command(sql_parse.cc:4969),mysql_parse(sql_parse.cc:5723),Query_log_event::do_apply_event(log_event.cc:4711),Log_event::apply_event(log_event.cc:3319),apply_event_and_update_pos(rpl_slave.cc:4713),exec_relay_log_event(rpl_slave.cc:5240),handle_slave_sql(rpl_slave.cc:5240),pfs_spawn_thread(pfs.cc:2190),start_thread(libpthread.so.0),clone(libc.so.6)

From reviewing the InnoDB code it appears the SQL thread is in waiting for an asynchronous rollback to complete.

        /*
        @return true if it is a forced rollback, asynchronously */
        static bool is_forced_rollback(const trx_t* trx)
        {
                ut_ad(trx_mutex_own(trx));

                return((trx->in_innodb & TRX_FORCE_ROLLBACK)) > 0;
        }

        /**
        Wait for the asynchronous rollback to complete, if it is in progress */
        static void wait(trx_t* trx)
        {
                ut_ad(trx_mutex_own(trx));

                ulint   loop_count = 0;
                /* start with optimistic sleep time - 20 micro seconds. */
                ulint   sleep_time = 20;

                while (is_forced_rollback(trx)) {

                        /* Wait for the async rollback to complete */

                        trx_mutex_exit(trx);

                        loop_count++;
                        /* If the wait is long, don't hog the cpu. */
                        if (loop_count < 100) {
                                /* 20 microseconds */
                                sleep_time = 20;
                        } else if (loop_count < 1000) {
                                /* 1 millisecond */
                                sleep_time = 1000;
                        } else {
                                /* 100 milliseconds */
                                sleep_time = 100000;
                        }

                        os_thread_sleep(sleep_time);

                        trx_mutex_enter(trx);
                }
        }

However checking other MySQL and InnoDB metrics (e.g. SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX) show no other transaction in a rollback state.

Is it possible that some other thread incorrectly modified trx->in_innodb for the SQL thread?

How to repeat:
To be determined.

Suggested fix:
Restart mysqld
[7 Nov 2018 12:25] sudheer gadipathi
We are facing this issue. Is there any update from the MySQL team?
[22 Nov 2018 13:22] Bogdan Kecman
Hi,

I cannot reproduce this, not with 5.7.24 nor with 8.0.13!

Can you upload - from the affected system, next time it happens:
 - both config files
 - both log files
 - binlog
 - relay log
 - show master status (from master)
 - show slave status (from slave)

thanks
Bogdan
[23 Dec 2018 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[28 Feb 8:26] Zhenghu Wen
We are facing this issue.in 5.7.20

Linux rdsrds-4-mobi-check-online1524624309467 3.2.73 #1 SMP Tue Feb 16 16:06:26 CST 2016 x86_64 GNU/Linux

Thread 15 (Thread 0x7f205b3a0700 (LWP 23175)):
#0  0x00007f22857dc91d in nanosleep () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000001158090 in os_thread_sleep (tm=<optimized out>) at /home/hzwenzhh/mysql-5.7.20-v1a/storage/innobase/os/os0thread.cc:279
#2  0x00000000010ec430 in TrxInInnoDB::wait (trx=trx@entry=0x7f227a999b10) at /home/hzwenzhh/mysql-5.7.20-v1a/storage/innobase/include/trx0trx.h:1557
#3  0x00000000010dfac3 in TrxInInnoDB::enter (trx=0x7f227a999b10, disable=<optimized out>) at /home/hzwenzhh/mysql-5.7.20-v1a/storage/innobase/include/trx0trx.h:1477
#4  0x00000000010e37d0 in innobase_xa_prepare (hton=0x3a8bf40, thd=0x1188ba10, prepare_trx=true) at /home/hzwenzhh/mysql-5.7.20-v1a/storage/innobase/handler/ha_innodb.cc:17032
#5  0x000000000098f7e2 in ha_prepare (thd=0x1188ba10) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/handler.cc:1468
#6  0x0000000000e8f79c in Sql_cmd_xa_prepare::trans_xa_prepare (this=<optimized out>, thd=thd@entry=0x1188ba10) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/xa.cc:794
#7  0x0000000000e90448 in Sql_cmd_xa_prepare::execute (this=<optimized out>, thd=0x1188ba10) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/xa.cc:835
#8  0x0000000000f6c47e in XA_prepare_log_event::do_commit (this=0x7f20449be090, thd=0x1188ba10) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/log_event.cc:7773
#9  0x0000000000f68f73 in Xid_apply_log_event::do_apply_event_worker (this=0x7f20449be178, w=0x14b04820) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/log_event.cc:7451
#10 0x0000000000fc772b in slave_worker_exec_job_group (worker=0x14b04820, rli=0x118924d0) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/rpl_rli_pdb.cc:2594
#11 0x0000000000faccdb in handle_slave_worker (arg=arg@entry=0x14b04820) at /home/hzwenzhh/mysql-5.7.20-v1a/sql/rpl_slave.cc:6143
#12 0x00000000010591b1 in pfs_spawn_thread (arg=0x11df68b0) at /home/hzwenzhh/mysql-5.7.20-v1a/storage/perfschema/pfs.cc:2190
#13 0x00007f22857d4b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007f228453afbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x0000000000000000 in ?? ()