Bug #78277 InnoDB deadlock, thread stuck on kernel calls from transparent page compression
Submitted: 30 Aug 2015 16:57 Modified: 22 Oct 2015 10:53
Reporter: Mark Callaghan Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.7.8 OS:CentOS
Assigned to: CPU Architecture:Any

[30 Aug 2015 16:57] Mark Callaghan
Description:
I am using a host with CentOS 6.6, XFS and kernel 3.10.53 enhanced with patches by my employer. I ran linkbench long enough to load 85G of data with InnoDB transparent page compression. Prior to that load I used it for a few days to do many iterations of loading ~20G of data.  The 20G loads succeeded. The 85G load should really have been ~400G load, but things went bad. 

1) vmstat shows swap space is being used. I have tests running for the same workload using MySql 5.6.26 and 5.7.8. I only see swap space in use for 5.7.8 with transparent page compression. I wonder if this is because of the extra memory required for filesystem metadata when hole-punch is used after almost every page write.

2) There are many errors listed in dmesg from XFS. These only occur on the host using transparent page compression:
[  +1.999379] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

3) There were long lock waits reported for InnoDB threads. This is probably caused by #4 below.

4) mysqld got stuck during the 85G load. It took ~15 minutes to attach gdb and what I see is many thread stuck on calls into the kernel including one thread stuck on a call to fallocate64 called from os_file_punch_hole.

At this point I will reboot the server, drop all files and restart the test.

I will liveblog my fun at http://smalldatum.blogspot.com/2015/08/first-day-with-innodb-transparent-page.html

How to repeat:
Using linkbench with XFS and InnoDB transparent page compression.

Suggested fix:
Don't use transparent page compression.
[30 Aug 2015 17:45] Mark Callaghan
I have two servers with CentOS 6.6 and Linux kernel 3.10.53. Both had InnoDB deadlocks from long lock waits. One didn't get stuck so the background InnoDB thread killed the server and I didn't get thread stacks. That was during a DROP DATABASE command and the InnoDB dictionary and the filesystem were out of sync after recovery -- some ibd files were missing.
[31 Aug 2015 1:46] Mark Callaghan
Reboot #2 has arrived for the host with CentOS 6.6 and kernel 3.10.53. Did kill -9 on mysqld and the process is still there in state Z after 40 minutes. Delete the database directory but space is still in use according to df. I assume that space won't be released until the zombie process goes away. I give up on trying to use transparent compression on this host.
[4 Sep 2015 0:50] Mark Callaghan
On the host for which the feature seemed to work there are still many problems.
http://smalldatum.blogspot.com/2015/09/second-day-with-innodb-transparent-page.html
[8 Sep 2015 18:16] Mark Callaghan
A test where DROP TABLE is 43X slower with transparent page compression
http://smalldatum.blogspot.com/2015/09/third-day-with-innodb-transparent-page.html
[1 Oct 2015 15:13] Mark Callaghan
For the setup described in http://bugs.mysql.com/bug.php?id=78672

I use recent ubuntu (14.04) with XFS and DROP DATABASE takes 247 seconds after running linkbench for ~14 hours. The database is 3 non-partitioned tables with ~10G of data. Had I run this for a larger database there would have been a long lock wait and InnoDB would have killed mysqld.

For ext4 with old-style compression and a 14G database it take 1 second for DROP DATABASE.

I am trying to get a similar test for ext4 but bug 78672 gets in the way.
[1 Oct 2015 16:02] Mark Callaghan
For ext4 I did DROP TABLE immediately after the load step. So this wasn't subject to 14 hours of page writes with holepunch per write. The database was ~11G before the drop and the drop took 175 seconds. This is all because unlink() is slow after holepunch. The problem reproduces on very system I have tested, and the response I get after each discussion is that it works, just on something I have yet to try. Maybe ReiserFS?
[2 Oct 2015 2:53] Mark Callaghan
When unlink() during DROP DATABASE is slow then the InnoDB dict mutex is locked and other InnoDB DDL is blocked for however many minutes the unlink takes. Here is PMP output showing where the threads are stuck.

unlink,os_file_delete_func,pfs_os_file_delete_func,fil_delete_tablespace,row_drop_single_table_tablespace,row_drop_table_for_mysql,ha_innobase::delete_table,ha_delete_table,mysql_rm_table_no_locks,mysql_rm_db,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,clone

pthread_cond_wait@@GLIBC_2.3.2,wait,os_event::wait_low,os_event_wait_low,sync_array_wait_event,wait,spin_and_wait,enter,PolicyMutex<TTASEventMutex<GenericPolicy>,dict_table_open_on_name,create_table_info_t::create_table_update_dict,ha_innobase::create,ha_create_table,rea_create_table,create_table_impl,mysql_create_table_no_lock,mysql_create_table,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,clone
[2 Oct 2015 2:58] Mark Callaghan
And inserts into existing InnoDB tables are blocked for the duration of the unlink() call. 
   unlink,os_file_delete_func,pfs_os_file_delete_func,fil_delete_tablespace,row_drop_single_table_tablespace,row_drop_table_for_mysql,ha_innobase::delete_table,ha_delete_table,mysql_rm_table_no_locks,mysql_rm_db,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,c

pthread_cond_wait@@GLIBC_2.3.2,wait,os_event::wait_low,os_event_wait_low,sync_array_wait_event,wait,spin_and_wait,enter,PolicyMutex<TTASEventMutex<GenericPolicy>,dict_table_open_on_name,ha_innobase::open,handler::ha_open,open_table_from_share,open_table,open_and_process_table,open_tables,open_tables_for_query,Sql_cmd_insert::mysql_insert,Sql_cmd_insert::execute,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_connection,pfs_spawn_thread,start_thread,clone
[2 Oct 2015 15:02] Mark Callaghan
More fun with a database that was ~50GB on disk. Slow DROP TABLE and slow DROP DATABASE are not good.

time bin/mysql -uroot -ppw -e 'drop database linkdb'

real    14m23.933s
...

--- And then I repeated the test and attempted an insert into an innodb table in a database that wasn't being dropped after the slow drop table started ...

2015-10-02T14:58:49.198385Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
[2 Oct 2015 15:25] Mark Callaghan
Rebooted my host with XFS after the test to reclaim kernel memory from XFS metadata and the filesystem cam back corrupt. Maybe too much holepunching.
[8 Oct 2015 21:42] Sunny Bains
I did some testing on 5.7.9. and could not reproduce the problem. I also compared the performance of old compression, transparent page compression and no compression on  XFS

v5.7.8 calls punch hole on every write and that could be one source of this problem:

Configuration:

Linux supra06 3.8.13-98.2.2.el7uek.x86_64 #2 SMP Tue Sep 15 16:23:09 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux

Same config as mentioned here:  http://mysqlserverteam.com/innodb-transparent-page-compression/

The test is a short one, will run a longer test, probably 24hrs over the weekend.

linkbench -D requesters=64 -Dmaxid1=1000000001 -c config/MyConfig.properties -D requests=500000000 -D maxtime=1800 -r

Without compression:

INFO 2015-10-08 11:12:43,451 [main]: REQUEST PHASE COMPLETED. 125664086 requests done in 1802 seconds. Requests/second = 69727

mysql> select name, file_size/(1024*1024) as size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+------------+
| name              | size_in_mb |
+-------------------+------------+
| linkdb/linktable  | 88860.0000 |
| linkdb/counttable |  7528.0000 |
| linkdb/nodetable  | 32208.0000 |
+-------------------+------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;                                                   
Query OK, 3 rows affected (8.00 sec)

With transparent compression:

INFO 2015-10-08 12:48:17,163 [main]: REQUEST PHASE COMPLETED. 159619892 requests done in 1802 seconds. Requests/second = 88561

mysql> select name, ((file_size-allocated_size)*100)/(file_size+1) as compressed_pct, allocated_size/(1024*1024) as allocated_size_in_mb, file_size/(1024*1024) as file_size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+----------------+----------------------+-----------------+
| name              | compressed_pct | allocated_size_in_mb | file_size_in_mb |
+-------------------+----------------+----------------------+-----------------+
| linkdb/linktable  |        46.2316 |           48378.6484 |      89976.0000 |
| linkdb/counttable |        40.7257 |            4680.3008 |       7896.0000 |
| linkdb/nodetable  |        21.6917 |           25425.1406 |      32468.0000 |
+-------------------+----------------+----------------------+-----------------+
3 rows in set (0.00 sec)

The Req/s is actually much  better with transparent compression enabled. This matches EXT4 BTW. I haven't had time to figure out exactly why.

However, drop table does take *much* longer on XFS (compressed).

mysql> drop database linkdb;
Query OK, 3 rows affected (3 min 35.49 sec)

As a comparison, the old compression results are:

mysql> select name, file_size/(1024*1024) as size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+------------+
| name              | size_in_mb |
+-------------------+------------+
| linkdb/linktable  | 43480.0000 |
| linkdb/counttable |  4512.0000 |
| linkdb/nodetable  | 20684.0000 |
+-------------------+------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;
Query OK, 3 rows affected (5.12 sec)

INFO 2015-10-08 08:47:37,565 [main]: REQUEST PHASE COMPLETED. 33765045 requests done in 1802 seconds. Requests/second = 18735

Conclusion:
Old compression performance is 4.5x worse than the transparent page compression
Old compression file size only a little bit better than transparent page compression.
Block on drop table needs to be fixed. Probably rename the file and use a background
thread to unlink that file. Which will help the general case too.
A 24hr test will hopefully confirm these results.
[9 Oct 2015 22:04] Mark Callaghan
The primary problem is unlink() taking too long. You reproduced it.
[13 Oct 2015 4:19] Sunny Bains
With a background thread the database drop  time is now down to:

mysql> select name, ((file_size-allocated_size)*100)/file_size as compressed_pct, allocated_size/(1024*1024) as allocated_size_in_mb, file_size/(1024*1024) as file_size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+----------------+----------------------+-----------------+
| name              | compressed_pct | allocated_size_in_mb | file_size_in_mb |
+-------------------+----------------+----------------------+-----------------+
| linkdb/linktable  |        45.4606 |           44213.9844 |      81068.0000 |
| linkdb/counttable |        44.3847 |            3105.5586 |       5584.0000 |
| linkdb/nodetable  |        24.4819 |           21320.2734 |      28232.0000 |
+-------------------+----------------+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;                                                   
Query OK, 3 rows affected (29.00 sec)

Next, figure out where the 29 sec overhead is coming from.
[14 Oct 2015 22:30] Sunny Bains
ysql> select name, ((file_size-allocated_size)*100)/file_size as compressed_pct, allocated_size/(1024*1024) as allocated_size_in_mb, file_size/(1024*1024) as file_size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+----------------+----------------------+-----------------+
| name              | compressed_pct | allocated_size_in_mb | file_size_in_mb |
+-------------------+----------------+----------------------+-----------------+
| linkdb/linktable  |        45.4656 |           44199.0742 |      81048.0000 |
| linkdb/counttable |        44.3970 |            3107.0977 |       5588.0000 |
| linkdb/nodetable  |        24.4180 |           21401.8047 |      28316.0000 |
+-------------------+----------------+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;
Query OK, 3 rows affected (3.06 sec)

With the background drop table we can side step the issue.
[19 Oct 2015 6:38] Sunny Bains
mysql> select name,((file_size-allocated_size)*100)/file_size as compressed_pct, allocated_size/(1024*1024) as allocated_size_in_mb,file_size/(1024*1024) as file_size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+----------------+----------------------+-----------------+
| name              | compressed_pct | allocated_size_in_mb | file_size_in_mb |
+-------------------+----------------+----------------------+-----------------+
| linkdb/linktable  |        45.3182 |           44143.5625 |      80728.0000 |
| linkdb/counttable |        43.9180 |            3109.1875 |       5544.0000 |
| linkdb/nodetable  |        23.6898 |           16400.5977 |      21492.0000 |
+-------------------+----------------+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;
Query OK, 3 rows affected (22.09 sec)

On NVMFS without any background queue. XFS just happens to be really bad at it.
[19 Oct 2015 15:56] Mark Callaghan
Clearly this is a problem with XFS and not that InnoDB is misusing a feature. 

Slow on XFS, slow on ext4, btrfs core team tells me it will be slow there. But we can celebrate that it isn't slow on NVMFS - closed source, not GA, can't even find out where to buy it, not aware of anyone running it.
[19 Oct 2015 22:44] Sunny Bains
I can't comment on how many users NVMFS has :-). But the driver we got we were told is GA and production ready. I have no information beyond that.

The drop time goes something like this:

With punch hole, EXT4 is 2x worse and XFS is 3x worse than EXT4. This suggests that XFS needs  to be at least as fast as EXT4.

EXT4 no compression:

mysql> select name, file_size/(1024*1024) as size_in_mb from information_schema.INNO
+-------------------+------------+
| name              | size_in_mb |
+-------------------+------------+
| linkdb/linktable  | 80716.0000 |
| linkdb/counttable |  5544.0000 |
| linkdb/nodetable  | 27924.0000 |
+-------------------+------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;
Query OK, 3 rows affected (26.13 sec)
mysql> select name, ((file_size-allocated_size)*100)/file_size as compressed_pct, allocated_size/(1024*1024) as size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+----------------+------------+
| name              | compressed_pct | size_in_mb |
+-------------------+----------------+------------+
| linkdb/linktable  |        45.3242 | 44149.6367 |
| linkdb/counttable |        43.9315 |  3108.4375 |
| linkdb/nodetable  |        24.7472 | 21004.5703 |
+-------------------+----------------+------------+
3 rows in set (0.00 sec)

mysql> drop database linkdb;
Query OK, 3 rows affected (50.54 sec)
[20 Oct 2015 18:55] Mark Callaghan
Why does XFS need to match ext4 on this? A likely response from the XFS team is that InnoDB is misusing this feature. But it probably is better for you to ask them than to rely on my speculation.
[20 Oct 2015 23:31] Sunny Bains
Because it demonstrates that the performance can be improved.  There is no theoretical limit on this.
[21 Oct 2015 20:39] Sunny Bains
For completeness, this is the latest SanDisk driver on the same HW and host, this driver is not GA.
The results are very good and worth sharing.

mysql> select name,((file_size-(allocated_size - 1024))*100)/file_size as compressed_pct, allocated_size/(1024*1024) as allocated_size_in_mb,file_size/(1024*1024) as file_size_in_mb from information_schema.INNODB_SYS_TABLESPACES WHERE name like 'linkdb%';
+-------------------+----------------+----------------------+-----------------+
| name              | compressed_pct | allocated_size_in_mb | file_size_in_mb |
+-------------------+----------------+----------------------+-----------------+
| linkdb/linktable  |        57.1966 |           40005.7769 |      93464.0000 |
| linkdb/counttable |        58.5372 |            3729.9961 |       8996.0000 |
| linkdb/nodetable  |        34.7377 |           21732.3477 |      33300.0000 |
+-------------------+----------------+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> drop database linked;
Query OK, 3 rows affected (9.78 sec)
[22 Oct 2015 10:53] Mark Callaghan
Great, but the only people who talk about using NVMFS with InnoDB are the MariaDB and MySQL core teams.