MySQL Bugs: #73816: MySQL instance stalling “doing SYNC index”

Bug #73816	MySQL instance stalling “doing SYNC index”
Submitted:	5 Sep 2014 12:56	Modified:	23 Feb 2016 19:28
Reporter:	Denis Jedig	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S2 (Serious)
Version:	5.6.20	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	DML, fulltext, insert, mutex, stall, UPDATE

Description:
In an instance of ours, where we are using FULLTEXT indexes on InnoDB tables, we see occasional blocked DML operations on any table using any engine (InnoDB as well as MyISAM) for the duration of ~2 minutes. The SHOW INNODB STATUS reveals that the oldest running transaction is an INSERT into a table with a FULLTEXT index:

---TRANSACTION 162269409, ACTIVE 122 sec doing SYNC index
6 lock struct(s), heap size 1184, 0 row lock(s), undo log entries 19942
TABLE LOCK table "vw"."FTS_000000000000224a_00000000000036b9_INDEX_1" trx id 162269409 lock mode IX
TABLE LOCK table "vw"."FTS_000000000000224a_00000000000036b9_INDEX_2" trx id 162269409 lock mode IX
TABLE LOCK table "vw"."FTS_000000000000224a_00000000000036b9_INDEX_3" trx id 162269409 lock mode IX
TABLE LOCK table "vw"."FTS_000000000000224a_00000000000036b9_INDEX_4" trx id 162269409 lock mode IX
TABLE LOCK table "vw"."FTS_000000000000224a_00000000000036b9_INDEX_5" trx id 162269409 lock mode IX
TABLE LOCK table "vw"."FTS_000000000000224a_00000000000036b9_INDEX_6" trx id 162269409 lock mode IX
---TRANSACTION 162269408, ACTIVE (PREPARED) 122 sec committing
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s), undo log entries 1
MySQL thread id 165998, OS thread handle 0x7fe0e239c700, query id 91208956 192.168.10.153 root query end
INSERT INTO files (id_files, filename, content) VALUES ('f19e63340fad44841580c0371bc51434', '1237716_File_70380a686effd6b66592bb5eeb3d9b06.doc', '[...]
TABLE LOCK table `vw`.`files` trx id 162269408 lock mode IX

From the logs it seems like the "undo log entries" number for "doing SYNC index" is advancing at ~150/s until it reaches 20,000, at which point the operation is done and the stall is resolved.

The FTS size of this specific table is quite impressive:
# du -c FTS_000000000000224a_00000000000036b9_*
614404  FTS_000000000000224a_00000000000036b9_INDEX_1.ibd
2478084 FTS_000000000000224a_00000000000036b9_INDEX_2.ibd
1576964 FTS_000000000000224a_00000000000036b9_INDEX_3.ibd
1630212 FTS_000000000000224a_00000000000036b9_INDEX_4.ibd
1978372 FTS_000000000000224a_00000000000036b9_INDEX_5.ibd
1159172 FTS_000000000000224a_00000000000036b9_INDEX_6.ibd
9437208 total

although the issue is also triggered by tables with significantly less massive FTS data size like this one:
# du -c FTS_0000000000002467_0000000000003a21_INDEX*
49156   FTS_0000000000002467_0000000000003a21_INDEX_1.ibd
225284  FTS_0000000000002467_0000000000003a21_INDEX_2.ibd
147460  FTS_0000000000002467_0000000000003a21_INDEX_3.ibd
135172  FTS_0000000000002467_0000000000003a21_INDEX_4.ibd
155652  FTS_0000000000002467_0000000000003a21_INDEX_5.ibd
106500  FTS_0000000000002467_0000000000003a21_INDEX_6.ibd
819224  total

The time of the stall in those cases is roughly the same, too. For the duration of the stall, the disk does take some random read as well as write I/O.

How to repeat:
No idea if this is easily reproducible. For details on this particular setup see http://dba.stackexchange.com/questions/75439/mysql-instance-stalling-doing-sync-index

Suggested fix:
It looks like some kind of lengthy index maintenance operation  is running for the FULLTEXT index upon insertion of an entry which keeps holding a lock for a critical mutex, which again is involved in any DML operation.

MyISAM FULLTEXT index inserts seem unaffected, so converting all tables containing larger (>100M) FULLTEXT indexes could be considered a workaround.

This is a rather critical issue. It's causing 2 minute locks on all tables.

I have some further evidence for this. This is happening about once in 24h on our machine. When it happens, we always see the following mutex waits:

--Thread 91109132288 has waited at fts0que.cc line 1287 for 59.00 seconds the semaphore:
X-lock on RW-latch at 0x152695fc68 created in file fts0fts.cc line 645
a writer (thread id 90488269824) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file not yet reserved line 0
Last time write locked in file /wrkdirs/usr/ports/databases/mysql56-server/work/mysql-5.6.21/storage/innobase/fts/fts0fts.cc line 4453
--Thread 90943929344 has waited at fts0que.cc line 1287 for 49.00 seconds the semaphore:
X-lock on RW-latch at 0x152695fc68 created in file fts0fts.cc line 645
a writer (thread id 90488269824) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file not yet reserved line 0
Last time write locked in file /wrkdirs/usr/ports/databases/mysql56-server/work/mysql-5.6.21/storage/innobase/fts/fts0fts.cc line 4453
--Thread 90487953408 has waited at fts0que.cc line 1287 for 48.00 seconds the semaphore:
X-lock on RW-latch at 0x152695fc68 created in file fts0fts.cc line 645
a writer (thread id 90488269824) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file not yet reserved line 0
Last time write locked in file /wrkdirs/usr/ports/databases/mysql56-server/work/mysql-5.6.21/storage/innobase/fts/fts0fts.cc line 4453
...

Hence, it's a &cache->lock wait in fts_query_intersect() function here:

		/* Search the cache for a matching word first. */

		rw_lock_x_lock(&cache->lock);

With our configuration it takes ~70s to complete and while it is running, there's hundreds of write queries waiting in a 'query end' state. Not sure though which mutex they are waiting on as I didn't have a chance to run PMP on it.

And here's the thread doing the sync (and holding the lock):

---TRANSACTION 16597707497, ACTIVE 64 sec doing SYNC index
6 lock struct(s), heap size 1184, 0 row lock(s), undo log entries 18706
TABLE LOCK table "db"."FTS_0000000000001483_000000000000356e_INDEX_1" trx id 16597707497 lock mode IX
TABLE LOCK table "db"."FTS_0000000000001483_000000000000356e_INDEX_2" trx id 16597707497 lock mode IX
TABLE LOCK table "db"."FTS_0000000000001483_000000000000356e_INDEX_3" trx id 16597707497 lock mode IX
TABLE LOCK table "db"."FTS_0000000000001483_000000000000356e_INDEX_4" trx id 16597707497 lock mode IX
TABLE LOCK table "db"."FTS_0000000000001483_000000000000356e_INDEX_5" trx id 16597707497 lock mode IX
TABLE LOCK table "db"."FTS_0000000000001483_000000000000356e_INDEX_6" trx id 16597707497 lock mode IX

MySQL 5.6.21, running on FreeBSD 10.1-RELEASE with the following innodb ft config:

| innodb_ft_aux_table                      |                                  |
| innodb_ft_cache_size                     | 8000000                          |
| innodb_ft_enable_diag_print              | OFF                              |
| innodb_ft_enable_stopword                | ON                               |
| innodb_ft_max_token_size                 | 84                               |
| innodb_ft_min_token_size                 | 3                                |
| innodb_ft_num_word_optimize              | 2000                             |
| innodb_ft_result_cache_limit             | 2000000000                       |
| innodb_ft_server_stopword_table          |                                  |
| innodb_ft_sort_pll_degree                | 2                                |
| innodb_ft_total_cache_size               | 640000000                        |
| innodb_ft_user_stopword_table            | path/search_stopwords |

I was able to reproduce this by loading a 1GB size FTS indexed table on one thread and running inserts into an unrelated table on the other thread. With replication disabled, unrelated inserts are running just fine. However, when I enable replication, inserts are blocked while the FTS SYNC is in progress. Here's PMP sample from when this is happening. I've used community mysql 5.6.28 for this:

9 blocked threads (generated with mysqlslap --auto-generate-sql-load-type=write):

      9 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1162),Stage_manager::enroll_for(mysql_thread.h:1162),MYSQL_BIN_LOG::change_stage(binlog.cc:6783),MYSQL_BIN_LOG::ordered_commit(binlog.cc:7187),MYSQL_BIN_LOG::commit(binlog.cc:6500),ha_commit_trans(handler.cc:1436),trans_commit(transaction.cc:228),mysql_execute_command(sql_parse.cc:4280),mysql_parse(sql_parse.cc:6386),dispatch_command(sql_parse.cc:1340),do_command(sql_parse.cc:1037),do_handle_one_connection(sql_connect.cc:982),handle_one_connection(sql_connect.cc:898),pfs_spawn_thread(pfs.cc:1860),start_thread,clone

And here's the SYNC thread and FTS table load thread:

      1 __lll_lock_wait,_L_lock_1081(libpthread.so.0),__GI___pthread_mutex_lock,inline_mysql_mutex_lock(mysql_thread.h:688),MYSQL_BIN_LOG::change_stage(mysql_thread.h:688),MYSQL_BIN_LOG::ordered_commit(binlog.cc:7187),MYSQL_BIN_LOG::commit(binlog.cc:6500),ha_commit_trans(handler.cc:1436),trans_commit(transaction.cc:228),mysql_execute_command(sql_parse.cc:4280),mysql_parse(sql_parse.cc:6386),dispatch_command(sql_parse.cc:1340),do_command(sql_parse.cc:1037),do_handle_one_connection(sql_connect.cc:982),handle_one_connection(sql_connect.cc:898),pfs_spawn_thread(pfs.cc:1860),start_thread,clone
      1 fsync,os_file_fsync(os0file.cc:2225),os_file_flush_func(os0file.cc:2225),fil_flush(os0file.ic:352),buf_dblwr_write_single_page(buf0dblwr.cc:1164),buf_flush_write_block_low(buf0flu.cc:959),buf_flush_page(buf0flu.cc:959),buf_flush_single_page_from_LRU(buf0flu.cc:1992),buf_LRU_get_free_block(buf0lru.cc:1367),buf_page_create(buf0buf.cc:3866),fsp_page_create(fsp0fsp.cc:1296),fseg_alloc_free_page_low(fsp0fsp.cc:2570),fseg_alloc_free_page_general(fsp0fsp.cc:2639),btr_page_alloc(btr0btr.cc:1141),btr_page_split_and_insert(btr0btr.cc:1141),btr_cur_pessimistic_insert(btr0cur.cc:1604),row_ins_clust_index_entry_low(row0ins.cc:2492),row_ins_clust_index_entry(row0ins.cc:2890),row_ins(row0ins.cc:2967),row_ins_step(row0ins.cc:2967),que_thr_step(que0que.cc:1059),que_run_threads_low(que0que.cc:1059),que_run_threads(que0que.cc:1059),fts_eval_sql(fts0sql.cc:286),fts_write_node(fts0fts.cc:3877),fts_sync_write_words(fts0fts.cc:4014),fts_sync_index(fts0fts.cc:4014),fts_sync(fts0fts.cc:4014),fts_add_doc_by_id(fts0fts.cc:3565),fts_add(fts0fts.cc:3565),fts_modify(fts0fts.cc:3028),fts_commit_table(fts0fts.cc:3028),fts_commit(fts0fts.cc:3028),trx_commit_low(trx0trx.cc:1352),trx_commit(trx0trx.cc:1416),trx_commit_for_mysql(trx0trx.cc:1416),innobase_commit_low(ha_innodb.cc:3400),innobase_commit(ha_innodb.cc:3400),ha_commit_low(handler.cc:1493),MYSQL_BIN_LOG::process_commit_stage_queue(binlog.cc:6678),MYSQL_BIN_LOG::ordered_commit(binlog.cc:7198),MYSQL_BIN_LOG::commit(binlog.cc:6500),ha_commit_trans(handler.cc:1436),trans_commit_stmt(transaction.cc:434),mysql_execute_command(sql_parse.cc:5037),mysql_parse(sql_parse.cc:6386),dispatch_command(sql_parse.cc:1340),do_command(sql_parse.cc:1037),do_handle_one_connection(sql_connect.cc:982),handle_one_connection(sql_connect.cc:898),pfs_spawn_thread(pfs.cc:1860),start_thread,clone

The simple workaround is to switch off binlog_order_commits:

SET global binlog_order_commits = off;

I'm still not sure how binlog_order_commits works though, so can't guarantee this is safe.

Just to follow-up, I have made the same tests on 5.7.10-log MySQL Community Server (GPL) and got the same results, hence upgrading to MySQL 5.7 isn't going to solve this. Here's backtraces from 5.7.10:

Threads waiting to commit:

      8 pthread_cond_wait,native_cond_wait(thr_cond.h:136),my_cond_wait(thr_cond.h:136),inline_mysql_cond_wait(thr_cond.h:136),Stage_manager::enroll_for(thr_cond.h:136),MYSQL_BIN_LOG::change_stage(binlog.cc:8376),MYSQL_BIN_LOG::ordered_commit(binlog.cc:8764),MYSQL_BIN_LOG::commit(binlog.cc:8061),ha_commit_trans(handler.cc:1689),trans_commit(transaction.cc:239),mysql_execute_command(sql_parse.cc:4206),mysql_parse(sql_parse.cc:5508),dispatch_command(sql_parse.cc:1425),do_command(sql_parse.cc:993),handle_connection(connection_handler_per_thread.cc:301),pfs_spawn_thread(pfs.cc:2192),start_thread,clone

And the FTS loading and SYNC:

      1 __lll_lock_wait,_L_lock_1081(libpthread.so.0),__GI___pthread_mutex_lock,native_mutex_lock(thr_mutex.h:84),my_mutex_lock(thr_mutex.h:84),inline_mysql_mutex_lock(thr_mutex.h:84),MYSQL_BIN_LOG::change_stage(thr_mutex.h:84),MYSQL_BIN_LOG::ordered_commit(binlog.cc:8888),MYSQL_BIN_LOG::commit(binlog.cc:8061),ha_commit_trans(handler.cc:1689),trans_commit(transaction.cc:239),mysql_execute_command(sql_parse.cc:4206),mysql_parse(sql_parse.cc:5508),dispatch_command(sql_parse.cc:1425),do_command(sql_parse.cc:993),handle_connection(connection_handler_per_thread.cc:301),pfs_spawn_thread(pfs.cc:2192),start_thread,clone
      1 fsync,os_file_fsync_posix(os0file.cc:2872),os_file_flush_func(os0file.cc:2872),pfs_os_file_flush_func(os0file.ic:402),fil_flush(os0file.ic:402),buf_dblwr_write_single_page(buf0dblwr.cc:1282),buf_flush_write_block_low(buf0flu.cc:1096),buf_flush_page(buf0flu.cc:1096),buf_flush_single_page_from_LRU(buf0flu.cc:2217),buf_LRU_get_free_block(buf0lru.cc:1401),buf_page_init_for_read(buf0buf.cc:5188),buf_read_page_low(buf0rea.cc:149),buf_read_page(buf0rea.cc:406),buf_page_get_gen(buf0buf.cc:4239),btr_cur_search_to_nth_level(btr0cur.cc:1107),btr_pcur_open_low(btr0pcur.ic:465),row_ins_clust_index_entry_low(btr0pcur.ic:465),row_ins_clust_index_entry(row0ins.cc:3158),row_ins_index_entry(row0ins.cc:3288),row_ins_index_entry_step(row0ins.cc:3288),row_ins(row0ins.cc:3288),row_ins_step(row0ins.cc:3288),que_thr_step(que0que.cc:1031),que_run_threads_low(que0que.cc:1031),que_run_threads(que0que.cc:1031),fts_eval_sql(fts0sql.cc:245),fts_write_node(fts0fts.cc:3958),fts_sync_write_words(fts0fts.cc:4099),fts_sync_index(fts0fts.cc:4099),fts_sync(fts0fts.cc:4534),fts_add_doc_by_id(fts0fts.cc:3644),fts_add(fts0fts.cc:2989),fts_commit_table(fts0fts.cc:2989),fts_commit(fts0fts.cc:2989),trx_commit_low(trx0trx.cc:2113),trx_commit(trx0trx.cc:2205),trx_commit_for_mysql(trx0trx.cc:2424),innobase_commit_low(ha_innodb.cc:3881),innobase_commit(ha_innodb.cc:4040),ha_commit_low(handler.cc:1794),MYSQL_BIN_LOG::process_commit_stage_queue(binlog.cc:8260),MYSQL_BIN_LOG::ordered_commit(binlog.cc:8918),MYSQL_BIN_LOG::commit(binlog.cc:8061),ha_commit_trans(handler.cc:1689),trans_commit_stmt(transaction.cc:458),mysql_execute_command(sql_parse.cc:4946),mysql_parse(sql_parse.cc:5508),dispatch_command(sql_parse.cc:1425),do_command(sql_parse.cc:993),handle_connection(connection_handler_per_thread.cc:301),pfs_spawn_thread(pfs.cc:2192),start_thread,clone

Similarly, disabling binlog_order_commits solves this here too. So looks like they're all waiting on &LOCK_commit otherwise.

take a look at the problem discussed here:
http://www.speedemy.com/mysql-troubleshooting-crash-course/

Posted by developer:
 
There are 2 potential fix of this bug:

1. While running sync, not to hold fts cache lock, as we will eliminate duplicates later anyway. Only when we delete the rbt tree in memory we will hold the lock

2. Do the sync in a background optimizer thread

As so many people seem to be suffering badly (and still actively) from this bug, I would like to repost a workaround found by Daniel Golding and posted at http://dba.stackexchange.com/a/108259/5861:

set global binlog_order_commits = 0;
[...]
Explanation:

InnoDB full-text uses a cache (by default 8M in size) containing changes that need to be applied to the actual full-text index on disk.

Once the cache fills up, a couple of transactions are created to perform the work of merging the data that was contained in the cache - this tends to be a large amount of random IO, so unless your entire full-text index can be loaded into the buffer pool, it's a long and slow transaction.

With binlog_order_commits set to true, all transactions containing inserts and updates, started after the long-running fts_sync_index transaction, must wait until it has completed before they can commit.

This is only an issue if binary logging is enabled

Note that disabling binlog_order_commits will not fix the issue fully. FTS queries against InnoDB tables with FTS index will still block, but at least in this case unrelated queries will not be affected.

I have added a few more details on this in the following post: http://www.speedemy.com/mysql-troubleshooting-crash-course/

@Aurimas Nice. Looks like a fair bunch of work has been done to produce the writeup. 

What irritates me is that it took the MySQL dev team more than 15 months just to acknowledge the issue.

fix

Attachment: bug73816.patch (text/x-patch), 17.84 KiB.

We are working on this fix, please give us your feedback if it solves your problem.

new patch

Attachment: bug73816-v2.patch (text/x-patch), 25.56 KiB.

patch v3

Attachment: bug73816-v3.patch (text/x-patch), 18.14 KiB.

Posted by developer:
 
Fixed as of the upcoming 5.6.30, 5.7.12, 5.8.0 release, and here's the changelog entry:

MySQL stalled when syncing the InnoDB full-text index cache.

Thank you for the bug report.

Bug 22996488 - CRASH IN FTS_SYNC_INDEX WHEN DOING DDL IN A LOOP

the fix here caused a regression that can lead to crashes.
folks should rather wait for 5.6.31.