Bug #14986 should support direct I/O on a per-file basis (InnoDB log+data) on Solaris
Submitted: 16 Nov 2005 14:05 Modified: 12 Sep 2007 6:01
Reporter: Nils Goroll Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:4.1, 5.0, 5.1 OS:Solaris (Solaris)
Assigned to: Vasil Dimov CPU Architecture:Any
Tags: Contribution
Triage: D5 (Feature request)

[16 Nov 2005 14:05] Nils Goroll
Description:
With InnoDB, I/O should be issued directly to storage, rather than through
the buffer cache for the following reasons:

- InnoDB is usually configured with a large buffer pool, so additional
  buffering in the page cache is unnecessary and does not yield
  any increase in performance (except for the rare case where
  the buffer pool is just a small fraction of the system memory +
  a lot of system memory is free + it is not necessary or possible
  (32bit mysql build) to increase the innodb buffer pool

- Going through the buffer cache means an additional significant
  CPU-overhead

- Using direct I/O makes it unnecessary to call fsync() to finalise
  a transaction as all data is always written directly with no
  additional buffering - therefor using direct i/o increases reliability

Without this feature, it is still possible to use direct I/O by placing
InnoDB data and log on raw devices or on a seperate filesystem
mounted with forcedirectio (UFS) or convosync=direct (VxFS),
BUT:

- this solution is suboptimal as you have to reserve an additonal
  filesystem

- it is error prone as running executables/libraries off a direct i/o
  filesystem is not supportet and can lead to application hangs
  (when a read only page is feed/stolen by the page scanner,
   it will never get paged back in again if the backing store is
   Direct I/O)

- Supporting Direct I/O would allow MySQL to make sure that it
  is only enabled where useful (InnoDB data/log, not on MyISAM,
  not on log-files (error-log/slow-log)).

How to repeat:
n/a

Suggested fix:
Support the config option innodb_flush_method=O_DIRECT also on solaris by implementing
the following:

- Unless it is a raw device, call directio(filedescriptor, DIRECTIO_ON)
  on any InnoDB log/data file immediately after opening it

- For log/data files with DIRECTIO_ON, there is no need to call fsync() 
  at all.

- innodb_flush_log_at_trx_commit=1 should be the preferred flush
  option for Direct I/O - the overhead of flushing is much lower with
  direct I/O than with buffered I/O

manpage: see directio(3C)
[16 Nov 2005 14:43] Sinisa Milivojevic
HI!

Thank you for taking time to write to us.

First of all, it is mandatory that data are saved in memory, for many reasons, such as normal functioining of URL cache etc ...

Yes, both directio mounting and raw devices are supported by InnoDB.

I guess that you are trying to tell us that Solaris supports some third method, a method that requires specailised API for Solaris only.

In that case, please provide us with info on which versions of Solaris and CPU are supported as well as some link to the documentation.
[16 Nov 2005 15:41] Nils Goroll
> In that case, please provide us with info on which versions of Solaris
> and CPU are supported as well as some link to the documentation.

The directio() call is supportet on all Solaris Versions >= 2.6
(sparc/x86/amd - platform independend).

documentation: See section "suggested fix":

- Unless it is a raw device, call directio(filedescriptor, DIRECTIO_ON)

documentation: man -s 3c directio

manpage is available online at 

http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr48?q=directio&a=view
[16 Nov 2005 15:44] Nils Goroll
btw:

> First of all, it is mandatory that data are saved in memory, for many
> reasons, such as normal functioining of URL cache etc ...

this has got nothing to do with the fact that data needs to be saved
in memory. This suggestion is about a way to optimise writing data
to storage.
[4 Mar 2007 14:21] River Tarnell
Patch to enable directio() for InnoDB data files on Solaris (mysql-5.1.15-20070124)

Attachment: innodb-directio.diff (application/octet-stream, text), 482 bytes.

[10 Mar 2007 18:40] Mike Connell
Will this patch be part of the standard release? 

Will it be turned on by:  innodb_flush_method=O_DIRECT ?

In the meantime, do any other settings for innodb_flush_method
in Solaris work almost as well?

Thanks,

Mike
[28 Mar 2007 15:12] Nils Goroll
nice little patch, THANK YOU. I will try to allocate some time to test it, but I cant promise.
[29 Mar 2007 13:47] Heikki Tuuri
Thank you. Vasil has now written a patch that fixes http://bugs.mysql.com/bug.php?id=26662 and adds directio() in Solaris as well.

Regards,

Heikki
[29 Mar 2007 13:51] Heikki Tuuri
directio() in Solaris will work like O_DIRECT in Linux. InnoDB will not use unbuffered I/O for ib_logfiles. The reason is that avoiding double buffering is not that important for ib_logfiles, which are much smaller than data files.
[29 Mar 2007 14:31] River Tarnell
AIUI, the main reason to use direct-i/o on Solaris, rather than avoiding buffering, is that it avoids UFS single-writer lock (which prevents two threads to write to a file at the same time).  not sure if that makes a difference to whether log files should have it.
[29 Mar 2007 14:48] Heikki Tuuri
River,

thank you, interesting and surprising if Solaris blocks writes to the same file by two threads at the same time.

ib_logfiles are written to by only one thread at a time. Also, I think the same holds mostly for data files, since the writes go through the InnoDB 'doublewrite buffer'.

InnoDB flushes about 128 pages from the buffer pool at a time. In OS-buffered I/O, the subsequent fsync() call makes the OS to physically write the data to the disk surface or to a non-volatile disk cache. In unbuffered I/O, I think the data goes directly to the disk cache, and the fsync() call flushes the disk cache if the disk cache is volatile. This suggests that the parallelization of disk writes happens in the fsync() call, by the OS or by the disk hardware. The parallelization of file write calls is not that important.

Regards,

Heikki
[29 Mar 2007 17:32] Nils Goroll
Heikki/River,

* We do want to avoid buffering for a couple of reasons:
  - efficient memory usage
  - less overhead
  - consistency (also achieved by fsync()ing

  I think we all agree on this.

* Besides the UFS single writer lock, AFAIK, fsflush is
  also single threaded on solaris, which is one more reason
  why we want directio (and, ideally, async I/O, see my
  RFE Bug#14987, but I know I am asking for mich)

Another side note: a failing call to directio() really should
never be fatal. Even ZFS does not support directio yet, so all
this work is really only for UFS at this time.

Nils
[29 Mar 2007 20:56] Vasil Dimov
A patch has been committed as part of the fix of Bug#26662. See the comment there from [29 Mar 22:42].

Generally - directio() is called immediately after open() (if innodb_flush_method = O_DIRECT). In case of an error a diagnostic message is printed and operation continues.
[20 Apr 2007 9:55] Vasil Dimov
currently reads:

If O_DIRECT is specified (available on some GNU/Linux versions), InnoDB uses O_DIRECT to open the data files

it should be changed to something like:

If O_DIRECT is specified (available on some GNU/Linux versions, FreeBSD and Solaris), InnoDB uses O_DIRECT (or directio() on Solaris) to open the data files
[12 Sep 2007 6:01] Vasil Dimov
Patch for this bug has made it into 5.0.42 and 5.1.18-beta, see Bug#26662