Bug #76627 MySQL/InnoDB mix buffered and direct IO
Submitted: 9 Apr 2015 3:44 Modified: 25 May 2015 13:51
Reporter: yin ye Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.5.24, 5.6.24 OS:Linux (CentOS6.5)
Assigned to: CPU Architecture:Any

[9 Apr 2015 3:44] yin ye
Description:
Hi,
I found one performance problem when I run MySQL on XFS. More details please see here(http://oss.sgi.com/pipermail/xfs/2015-April/041290.html).

At last, I found that InnoDB mix buffered and direct IO when set innodb_flush_method=O_DIRECT. This will cause bad performance on XFS.

[pid 13478] open("./test/big_tb.ibd", O_RDONLY) = 37
[pid 13478] pread(37, "W\346\203@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\37c\225\263\0\10\0\0\0\0\0\0"..., 16384, 0) = 16384
[pid 13478] close(37)                   = 0
[pid 13478] open("./test/big_tb.ibd", O_RDWR) = 37
[pid 13478] fcntl(37, F_SETFL, O_RDONLY|O_DIRECT) = 0
[pid 13478] fcntl(37, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
[pid 13478] pread(37, "\350\301\270\271\0\0\0\3\377\377\377\377\377\377\377\377\0\0\0\v\37c\225\263E\277\0\0\0\0\0\0"..., 16384, 49152) = 16384
[pid 13478] pread(37, "e\251|m\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\31\245\243\0\5\0\0\0\0\0\0"..., 16384, 16384) = 16384‍

As we can see, InnoDB will open data file twice when open table. And the first open file without O_DIRECT flag will generate cached pages.

From the open(2)
man page, in the section about O_DIRECT:

	"Applications should avoid mixing O_DIRECT and normal I/O to
	the same file, and especially to overlapping byte regions in
	the  same file.   Even  when  the filesystem correctly
	handles the coherency issues in this situation, overall I/O
	throughput is likely to be slower than using either mode
	alone.  Likewise, applications should avoid mixing mmap(2)
	of files with direct I/O to the same files."

As we can see, InnoDB violates the rules here.

How to repeat:
strace -f -e trace=open,close,read,pread,io_submit,fcntl /usr/local/mysql/bin/mysqld --defaults-file=/etc/my.cnf ...

Suggested fix:
There are two options:
(1)set O_DIRECT flag at the first open file
(2)invalidate
the cached pages by using POSIX_FADV_DONTNEED before doing direct
IO.
[13 May 2015 13:52] Marko Mäkelä
I suspect that the first open/read/close is because InnoDB is reading the first page of the *.ibd file to determine the space_id. If this is the case, then this whole open/read/close would have been removed in MySQL 5.7.5, which included WL#7142: http://mysqlserverteam.com/innodb-crash-recovery-improvements-in-mysql-5-7/

In any case, the I/O is not overlapping with the main I/O. But it could pollute the file system cache, like you are suggesting.
[19 May 2015 12:23] Umesh Shastry
Hello yin ye,

Thank you for the report.
Observed similar with 5.6.24.

Thank you Marko for the details.

Thanks,
Umesh
[19 May 2015 12:23] Umesh Shastry
// 5.6.24 
[ushastry@ushastry Downloads]$ cat /etc/*release
CentOS Linux release 7.1.1503 (Core) 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.1.1503 (Core) 
CentOS Linux release 7.1.1503 (Core) 

[ushastry@ushastry Downloads]$ df -T /var/lib/mysql/
Filesystem              Type 1K-blocks     Used Available Use% Mounted on
/dev/mapper/centos-root xfs   23065596 10573868  12491728  46% /

// Session 1

mysql> create table t1(id int not null auto_increment primary key);
Query OK, 0 rows affected (0.11 sec)

mysql> insert into t1 values();
Query OK, 1 row affected (0.03 sec)

mysql> insert into t1 values();
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 values();
Query OK, 1 row affected (0.02 sec)

mysql> insert into t1 values();
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 values();
Query OK, 1 row affected (0.02 sec)

mysql> select * from t1;
+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
+----+
5 rows in set (0.00 sec)

mysql> \q
Bye
[root@ushastry Downloads]# service mysql restart
Redirecting to /bin/systemctl restart  mysql.service
[root@ushastry Downloads]# mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.6.24 MySQL Community Server (GPL)

Copyright (c) 2000, 2015, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 
mysql> select * from t1;
+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
+----+
5 rows in set (0.00 sec)

mysql> 

// Session 2

[root@ushastry Downloads]# ps aux|grep mysqld
mysql    12744  0.1  0.0 115216  1672 ?        Ss   06:21   0:00 /bin/sh /usr/bin/mysqld_safe
mysql    12919  1.0  3.2 762216 110084 ?       Sl   06:21   0:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
root     12959  0.0  0.0 112640   960 pts/1    R+   06:21   0:00 grep --color=auto mysqld
[root@ushastry Downloads]# strace -f -e trace=open,close,read,pread,io_submit,fcntl -p 12919
Process 12919 attached with 22 threads
[pid 12919] fcntl(12, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 12919] fcntl(12, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 12919] fcntl(12, F_SETFL, O_RDWR)  = 0
[pid 12945] open("./test/db.opt", O_RDONLY) = 32
[pid 12945] read(32, "default-character-set=latin1\ndef"..., 65) = 65
[pid 12945] close(32)                   = 0
[pid 12945] close(32)                   = 0
[pid 12945] open("./mysql/db.opt", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 12945] open("./performance_schema/db.opt", O_RDONLY) = 32
[pid 12945] read(32, "default-character-set=utf8\ndefau"..., 61) = 61
[pid 12945] close(32)                   = 0
[pid 12945] open("/proc/sys/vm/overcommit_memory", O_RDONLY|O_CLOEXEC) = 32
[pid 12945] read(32, "0", 1)            = 1
[pid 12945] close(32)                   = 0
[pid 12945] close(32)                   = 0
[pid 12945] open("./test/t1.frm", O_RDONLY) = 32
[pid 12945] read(32, "\376\1\t\f\3\0\0\20\1\0\0000\0\0i\1\5\0\0\0\0\0\0\0\0\0\0\2!\0\10\0"..., 64) = 64
[pid 12945] read(32, "//\0\0 \0\0", 7)  = 7
[pid 12945] read(32, "l\1\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 288) = 288
[pid 12945] read(32, "\1\1\0\0\n\0\0\0\4\0\1\0\0\0\1\200\2\0\0\33\0\4\0\377PRIMARY\377"..., 33) = 33
[pid 12945] pread(32, "\0\0\6\0InnoDB\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0\0", 26, 4462) = 26
[pid 12945] pread(32, "\377\0\0\0\0", 5, 4457) = 5
[pid 12945] read(32, "6\0\2\1\2\24)                         "..., 76) = 76
[pid 12945] close(32)                   = 0
[pid 12945] open("./test/t1.ibd", O_RDONLY) = 32
[pid 12945] pread(32, "d,[\270\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\31s\221\0\10\0\0\0\0\0\0"..., 16384, 0) = 16384
[pid 12945] close(32)                   = 0
[pid 12945] open("./test/t1.ibd", O_RDWR) = 32
[pid 12945] fcntl(32, F_SETFL, O_RDONLY|O_DIRECT) = 0
[pid 12945] fcntl(32, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
[pid 12945] pread(32, "\21\310\366{\0\0\0\3\377\377\377\377\377\377\377\377\0\0\0\0\0\31\227\5E\277\0\0\0\0\0\0"..., 16384, 49152) = 16384
[pid 12945] pread(32, "s&\322v\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\31d\353\0\5\0\0\0\0\0\0"..., 16384, 16384) = 16384
[19 May 2015 12:24] Umesh Shastry
######## 5.7.7 (same host as 5.6.24)
- Schema creation

[pid  2483] open("./test/t1.frm", O_RDWR|O_CREAT|O_TRUNC, 0640) = 34
[pid  2483] open("./test/", O_RDONLY)   = 35
[pid  2483] close(35)                   = 0
[pid  2483] close(34)                   = 0
[pid  2483] open("./test/t1.frm", O_RDONLY) = 34
[pid  2483] read(34, "\376\1\t\f\3\0\0\20\1\0\0000\0\0i\1\5\0\0\0\0\0\0\0\0\0\0\2!\0\10\0"..., 64) = 64
[pid  2483] read(34, "//\0\0 \0\0", 7)  = 7
[pid  2483] read(34, "l\1\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 288) = 288
[pid  2483] read(34, "\1\1\0\0\n\0\0\0\4\0\1\0\0\0\1\200\2\0\0\33\0\4\0\377PRIMARY\377"..., 33) = 33
[pid  2483] pread(34, "\0\0\6\0InnoDB\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0\0", 26, 4462) = 26
[pid  2483] pread(34, "\377\0\0\0\0", 5, 4457) = 5
[pid  2483] read(34, "6\0\2\1\2\24)                         "..., 76) = 76
[pid  2483] close(34)                   = 0
[pid  2483] open("./test/t1.ibd", O_RDWR|O_CREAT|O_EXCL, 0640) = 34
[pid  2483] fcntl(34, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
[pid  2483] close(34)                   = 0
[pid  2483] open("./test/t1.ibd", O_RDWR) = 34
[pid  2483] fcntl(34, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
[pid  2483] open("./test/t1.frm", O_RDONLY) = 35
[pid  2483] read(35, "\376\1\t\f\3\0\0\20\1\0\0000\0\0i\1\5\0\0\0\0\0\0\0\0\0\0\2!\0\10\0"..., 64) = 64
[pid  2483] read(35, "//\0\0 \0\0", 7)  = 7
[pid  2483] read(35, "l\1\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 288) = 288
[pid  2483] read(35, "\1\1\0\0\n\0\0\0\4\0\1\0\0\0\1\200\2\0\0\33\0\4\0\377PRIMARY\377"..., 33) = 33
[pid  2483] pread(35, "\0\0\6\0InnoDB\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0\0", 26, 4462) = 26
[pid  2483] pread(35, "\377\0\0\0\0", 5, 4457) = 5
[pid  2483] read(35, "6\0\2\1\2\24)                         "..., 76) = 76
[pid  2483] close(35)  = 

-- restart server, try to fetch row again

[root@ushastry Downloads]# strace -f -e trace=open,close,read,pread,io_submit,fcntl -p 2537
Process 2537 attached with 24 threads
Process 2580 attached
[pid  2580] open("/dev/urandom", O_RDONLY) = 46
[pid  2580] read(46, "\31\336\324\35cd\252\310JWIZ\325P\246\36\246\33\346&\351p\33\233\7\247y}\31\nI\271", 32) = 32
[pid  2580] close(46)                   = 0
[pid  2580] open("./test/db.opt", O_RDONLY) = 46
[pid  2580] read(46, "default-character-set=latin1\ndef"..., 65) = 65
[pid  2580] close(46)                   = 0
[pid  2580] close(46)                   = 0
[pid  2580] open("./mysql/db.opt", O_RDONLY) = 46
[pid  2580] read(46, "default-character-set=latin1\ndef"..., 65) = 65
[pid  2580] close(46)                   = 0
[pid  2580] open("./performance_schema/db.opt", O_RDONLY) = 46
[pid  2580] read(46, "default-character-set=utf8\ndefau"..., 61) = 61
[pid  2580] close(46)                   = 0
[pid  2580] open("./sys/db.opt", O_RDONLY) = 46
[pid  2580] read(46, "default-character-set=utf8\ndefau"..., 61) = 61
[pid  2580] close(46)                   = 0
[pid  2580] open("/proc/sys/vm/overcommit_memory", O_RDONLY|O_CLOEXEC) = 46
[pid  2580] read(46, "0", 1)            = 1
[pid  2580] close(46)                   = 0
[pid  2580] close(46)                   = 0
[pid  2580] open("./test/t1.frm", O_RDONLY) = 46
[pid  2580] read(46, "\376\1\t\f\3\0\0\20\1\0\0000\0\0i\1\5\0\0\0\0\0\0\0\0\0\0\2!\0\10\0"..., 64) = 64
[pid  2580] read(46, "//\0\0 \0\0", 7)  = 7
[pid  2580] read(46, "l\1\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 288) = 288
[pid  2580] read(46, "\1\1\0\0\n\0\0\0\4\0\1\0\0\0\1\200\2\0\0\33\0\4\0\377PRIMARY\377"..., 33) = 33
[pid  2580] pread(46, "\0\0\6\0InnoDB\0\0\0\0\0\0\n\0\0\0\0\0\0\0\0\0", 26, 4462) = 26
[pid  2580] pread(46, "\377\0\0\0\0", 5, 4457) = 5
[pid  2580] read(46, "6\0\2\1\2\24)                         "..., 76) = 76
[pid  2580] close(46)                   = 0
[25 May 2015 13:51] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 5.5.45, 5.6.26, 5.7.8, 5.8.0 release, and here's the changelog entry:

On Unix-like platforms, the
"os_file_create_simple_no_error_handling_func" and "os_file_create_func"
functions opened files in different modes when "innodb_flush_method" was
set to "O_DIRECT". 

Thank you for the bug report.
[29 Sep 2015 7:10] Laurynas Biveinis
commit b4daac21f52ced96c11632b83445111c0acede56
Author: Bin Su <bin.x.su@oracle.com>
Date:   Thu May 21 11:52:17 2015 +0800

    Bug#21113036 - MYSQL/INNODB MIX BUFFERED AND DIRECT IO
    
    As man page of open(2) suggested, we should open the same file in the same
    mode, to have better performance. For some data files, we will first call
    os_file_create_simple_no_error_handling_func() to open them, and then call
    os_file_create_func() again. We have to make sure if DIRECT IO is specified,
    these two functions should both open file with O_DIRECT.
    
    Reviewed-by: Sunny Bains <sunny.bains@oracle.com>
    RB: 8981
[29 Sep 2015 7:41] Laurynas Biveinis
I don't see this fix on 5.5.45. Filed bug 78616.