Bug #47055 unconditional exit(1) on ERROR_WORKING_SET_QUOTA 1453 (0x5AD) for InnoDB backend
Submitted: 2 Sep 10:11 Modified: 12 Nov 5:00
Reporter: George Danchev
Status: Closed
Category:Server: InnoDB Severity:S2 (Serious)
Version:5.0.51b OS:Microsoft Windows
Assigned to: Satya B Target Version:
Triage: Triaged: D2 (Serious)

[2 Sep 10:11] George Danchev
Description:
We use MySQL v 5.0.51b  and libmysql v.5.0.19 linked with our applications under Windows
2003.
We have a database with multiple InnoDB and MyISAM tables.

The following problem occurs on several of our server which appears to be related to the
InnoDB tables.
Here is an excerpt from MySQL error log:

090624 20:35:04  InnoDB: Operating system error number 1453 in a file operation.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/operating-system-error-codes.html
InnoDB: File name .\ibdata1
InnoDB: File operation call: 'aio write'.
InnoDB: Cannot continue operation.
InnoDB: Log scan progressed past the checkpoint lsn 0 3192273582
090624 20:38:52  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 0 3192289463
090624 20:38:53  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 

53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 51270062, file name binlog_slave.000685
InnoDB: Last MySQL binlog file position 0 48504357, file name
D:\REPLICATION\LOGS_BIN_MYSQL\binlog_slave.000524
090624 20:38:54  InnoDB: Started; log sequence number 0 3192289463
090624 20:38:54 [Note] Recovering after a crash using
D:/REPLICATION/LOGS_BIN_MYSQL/binlog_slave
090624 20:38:54 [Note] Starting crash recovery...
090624 20:38:54 [Note] Crash recovery finished.
090624 20:38:55 [Note] D:\MySQL_sever\bin\mysqld-nt: ready for connections.
Version: '5.0.51b-community-nt-log'  socket: ''  port: 3306  MySQL Community Edition
(GPL)

We looked at the source of MySQL 5.0.51b and realized that the following code from
innobase/os/os0file.c is involved:
a function named os_file_handle_error_cond_exit() is being called by the
os_file_get_last_error() and which calls a Windows specific function GetLastError().
It  happens that the latter function returns ERROR_WORKING_SET_QUOTA 1453 (0x5AD) -
Insufficient quota to complete the requested service [1], which leads to 

unconditional exit(1) later within os_file_handle_error_cond_exit(). 

Our questions being:

1) Is that return code of ERROR_WORKING_SET_QUOTA 1453 (0x5AD) bogus as returned by the
operating system, since we do not even use quotas?
2) Why the rest of possible return values of GetLastError() are not checked as well, but
an unconditional exit(1) is preffered?

From the log file we can see File operation call: 'aio write', which might be a failed
asynchronous write file operation.

We also found similar problem in MSSQL when the ERROR_WORKING_SET_QUOTA 1453 (0x5AD) is
returned [2]. 

Citing from the above document:
-------------------------------------
Resource-based retries
SQL Server 2000 performs read retries only when beginning the read operations fails and
returns an operating system error of ERROR_WORKING_SET_QUOTA (1453) 

or ERROR_NO_SYSTEM_RESOURCES (1450). Unlike the SQL Server 2005 enhancements, SQL Server
2000 does not try any other form of read retry other than sort 

failures.
When the error occurs, the SQL Server worker yields for 100ms and tries the read
operation again. This loop continues until the I/O is successfully issued. A 

simplified example demonstrates this behavior.

WHILE(		FALSE == ReadFile() 
    && (1450 == GetLastError() ||  1453 == GetLastError()) 
)
{
    Yield(100);
}

SQL Server 2005 maintains the same logic when it is trying to start a read operation. 
-------------------------------------

Should MySQL insist on writing to the file, until GetLastError() reports an successful
operation ? We also had a look at MySQL 6.0 code and it is roughtly the same error
handling.

[1] http://msdn.microsoft.com/en-us/library/ms681385(VS.85).aspx
[2]
http://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/SQLIOBas...

How to repeat:
It is probably related to the InnoDB under high load under Windows.

Suggested fix:
Just looking at the code in innobase/os/os0file.c it should be clear that either my OS
returns bad error code or either the MySQL daemon does not check them all and behave
appropriately.
[3 Sep 18:06] Sveta Smirnova
Thank you for the report.

> 1) Is that return code of ERROR_WORKING_SET_QUOTA 1453 (0x5AD) bogus as returned by the
operating system, since we do not even use quotas?

We can't know this, because this can be some problem in your environment. Please check OS
log files. Also version 5.0.51 is old and many bugs were fixed since. It makes sense to
upgrade to current version 5.0.85.

> 2) Why the rest of possible return values of GetLastError() are not checked as well,
but
an unconditional exit(1) is preffered?

As I understand this is not possible to check every single OS error, because always would
be unexpected things. There is a sense to do it only for particular errors which can be
fixed while server is running.
[4 Sep 10:13] George Danchev
> Thank you for the report.

> > 1) Is that return code of ERROR_WORKING_SET_QUOTA 1453 (0x5AD) bogus as 
> > returned by the operating system, since we do not even use quotas?

> We can't know this, because this can be some problem in your environment.  
> Please check OS log files. Also version 5.0.51 is old and many bugs 
> were fixed since. 

As I already wrote, we faced that problem on several of our servers. 
There are some more bugs reported as ours, though not in your bug tracking system:

http://forums.mysql.com/read.php?132,237235,237235
http://forums.theplanet.com/index.php?showtopic=85485

> It makes sense to upgrade to current version 5.0.85.

The relevant code found in os/os0file.c is the same. It is the same even in 6.0.

> > 2) Why the rest of possible return values of GetLastError() are not checked 
> > as well, but an unconditional exit(1) is preffered?

> As I understand this is not possible to check every single OS error, because > always
would be unexpected things. There is a sense to do it only for 
> particular errors which can be fixed while server is running.

How it is possible for MSSQL to deal with that, but MySQL can not? Please, have a deeper
look at the example I provided in my previous message.
[4 Sep 10:59] Sveta Smirnova
Thank you for the feedback.

> As I already wrote, we faced that problem on several of our servers. 

MySQL just uses OS error code. This is OS what prevent io operation. This is why I ask to
check OS error log.

> The relevant code found in os/os0file.c is the same. It is the same even in 6.0.

If this is MySQL bug real problem is not in error handling, but wrong operation before
error occurs. It is not clear from description if this is the case, but can be
occasionally solved.

> How it is possible for MSSQL to deal with that, but MySQL can not? Please, have a
deeper
look at the example I provided in my previous message.

I can verify this report as feature request "Please handle ERROR_WORKING_SET_QUOTA
error". Let us know if this is your only concern regarding to this situation.
[4 Sep 11:56] George Danchev
> MySQL just uses OS error code. This is OS what prevent io operation. This is > why I ask
to check OS error log.

Okay, that makes sense. We will be waiting for the problem to occur, and then grab Event
Viewer | System and Application logs.

> > The relevant code found in os/os0file.c is the same. It is the same even in > >6.0.

> If this is MySQL bug real problem is not in error handling, but wrong 
> operation before error occurs. It is not clear from description if 
> this is the case, but can be occasionally solved.

It occures upon os_file_write() operation. If the 'write operation' fails, then why not
just sleep() for a while and try again.

> > How it is possible for MSSQL to deal with that, but MySQL can not? Please, 
> >have a deeper look at the example I provided in my previous message.

> I can verify this report as feature request "Please handle 
> ERROR_WORKING_SET_QUOTA error".
> Let us know if this is your only concern regarding to this situation.

It makes no big difference how you classified the bug-report, when your SQL daemon
performs exit(1) and leave the clients speechless. Yes, it is a request to handle
ERROR_WORKING_SET_QUOTA and all possible return codes of GetLastError(), since they are
already a stable set of codes available at:

http://msdn.microsoft.com/en-us/library/ms681381(VS.85).aspx

I know that it is not always possible to recover, but the codes MySQL currently handle is
just far from a sensible set of possible failures. 

Otherwise, either we change the daemon or the OS as well. Sad, but true.
[4 Sep 20:34] Sveta Smirnova
Thank you for the feedback.

Verified as feature request "Please handle ERROR_WORKING_SET_QUOTA more smart, for
example, like reporter suggested in the initial description".
[14 Oct 16:39] Bugs System
Pushed into 5.1.41 (revid:joro@sun.com-20091014143611-cphb0enjlx6lpat1) (version source
revid:satya.bn@sun.com-20091009140218-24h3v55dgsxgs609) (merge vers: 5.1.40) (pib:13)
[22 Oct 8:36] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091022063126-l0qzirh9xyhp0bpc) (version
source revid:alik@sun.com-20091019135554-s1pvptt6i750lfhv) (merge vers: 6.0.14-alpha)
(pib:13)
[22 Oct 9:09] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091022060553-znkmxm0g0gm6ckvw) (version
source revid:alik@sun.com-20091019131022-2o2ymjfjjoraq833) (merge vers: 5.5.0-beta)
(pib:13)
[3 Nov 0:11] Calvin Sun
Before the fix, when having a failed IO operation with return code of
ERROR_WORKING_SET_QUOTA from the Windows operating system, InnoDB will intentionally
crash the server. Now, InnoDB will sleep for 100ms and retry the failed operation.
[12 Nov 5:00] Paul DuBois
Noted in 5.1.41, 5.5.0, 6.0.14 changelogs.

On WIndows, when a failed I/O operation occurred with return code of
ERROR_WORKING_SET_QUOTA, InnoDB intentionally crashed the server. Now
InnoDB sleeps for 100ms and retries the failed operation.