MySQL Bugs: #40075: Non-descriptive error message when restoring on Falcon with open transactions

Bug #40075	Non-descriptive error message when restoring on Falcon with open transactions
Submitted:	16 Oct 2008 11:28	Modified:	7 Nov 2008 10:23
Reporter:	Øystein Grøvlen	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Backup	Severity:	S3 (Non-critical)
Version:	6.0	OS:	Any
Assigned to:	Ingo Strüwing	CPU Architecture:	Any

Description:
If I restore table in Falcon while there are transactions with uncommitted changes, restore will fail.  This is as expected, but I feel the error message is not of much help in explaining the situation.  The error one gets is:

1676: Could not restore database `test`

With additional warning:

Error 1007 Can't create database 'test'; database exists

How to repeat:
The following test script illustrate this:

--enable_warnings
use test;
create table t (s1 int) engine=falcon;
backup database test to 'test.bak';
set @@autocommit=0;
insert into t values (0);

connect (backup,localhost,root,,);

restore from 'test.bak';

Suggested fix:
It seems that it is not detected that dropping the database fails because of the open transactions.  This needs to be detected and reported.

It may be that BUG#33258 "Misleading Error when DROP TABLE operation while a transaction is open" is related to this issue.

Thank you for the bug report. Verified as described.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/58156

2727 Ingo Struewing	2008-11-07
      Bug#40075 - Non-descriptive error message when restoring
                  on Falcon with open transactions
      
      Not to be pushed. This is a demo patch. It shows, what could be changed
      to receive a better error message in this case.
      
      This patch reverts a couple of design decisions. It should not be pushed
      without approval from the architecture team. For more information see
      the bug report.
      
      As you can see, the changes do not always improve the error messages.
      In backup_views it became worse. Definitely.
      In ndb_single_user, it is arguable.
      In restore_obstacles it became better.

This is not a bug. It works as designed.

If you really want it changed, please ask our architects to revise the
design decisions. The below text may help to understand which decisions
are required.

The storage engine, Falcon in this case, returns an error code from
handler::ha_delete table(), which tells that the table cannot be deleted
because of ongoing transactions on this table
(ER_LOCK_OR_ACTIVE_TRANSACTION). This error code is blocked (cleared) at
multiple places and/or replaced by less informative error messages.

Even if one or the other blockade could be seen as a "mistake" or "bug",
the mere existence of multiple blockades suggests that this is an
explicit design decision.

The problem with MySQL server error reporting is the diversity of error
reporting facilities. Several fuctional blocks have their own
facilities. Some use object oriented methods, others are plain
fuctional. Some build on thers, while some are independent. This might
come from the fact that there are at least two paradigms for error
reporting:

Paradigm #1: The first time, an error is detected, do immediately report
it with as much detail as possible. This will mostly be quite low level
information. But in most cases it names the real cause of the problem.
Higher level functions in the call hierarchy can add information that
tells the user, which operation was going on when the error happened.

Paradigm #2: The highest level function reports that it cannot do its
job. This should give the most easy to understand message to the user.
More detailed information could be added on request.

Examinig the code and talking with MySQL developers reveals that
different opinions exist, which paradigm to use. No wonder that the
error reporting works pretty heterogeneous in MySQL.

The inital error reporting facility seems to be influenced by paradigm #1.
There exists a "warning stack". The first report entered on this stack
will be sent as the error message to the user, if the statement results
in a non-zero return value. The other "stack" entries can be retrieved
by the "SHOW WARNINGS" statement. Since this facility is implemented in
the networking module, which sends results to the user, most, if not all
other error reporing facilities ar based on this one.

Developers, who prefer paradigm #2, need to block the error messages
that have been reported by lower layers, and replace them with their own
message. This blocking happens by clearing the "warning stack"
(thd->clear_error(), or somewhat less "clean"
thd->main_da.reset_diagnostics_area()). After this, a new error message
becomes first in the stack and will thus be reported to the user.
Unfortunately, this approach wipes out all stacked error reports.
Friends of paradigm #2 probably would like to be able to enter new
messages as number one on the "stack". But we don't have this ability.

Back to the reported problem of RESTORE. These are the obstacles that
prevent the engine's error code to come to the user's attention:

1. The global function ha_delete_table() is a wrapper for
handler::ha_delete_table(). Its purpose is to provide
handler::ha_delete_table() with a TABLE object and to suppress error
reports from the engine. ha_delete_table() has a parameter
"generate_warning". If this is true, an error code from the engine is
pushed as a warning on the warning stack. If it is false, no error
handling happens at all. Since the engine does not push to the warning
stack itself, no report is done at this stage. A comment in
ha_delete_table() suggests that it is explicitly not wanted that an
error is reported if the deletion of a table fails in the engine. IMHO
this is very well justified, if the error code from the engine says that
the table does not exist in the engine. But every other error means that
the table could not be deleted and continues to exist. IMHO this is
vital to be reported. But since the code is very explicit, this should
not be changed without approval from the architects.

2. In the DROP TABLE code (mysql_rm_table_part2()), we call
ha_delete_table() with generate_warning=FALSE. So we do not even want to
see warnings if a table could not be deleted. However, we collect tables
with problems (other than table does not exist) in a list and later
report this list together with the error code ER_BAD_TABLE_ERROR, which
translates to "Unknown table ...". While generate_warning=FALSE is a
design decision, the error text could be fixed by remembering if any
error other than non-existence happened. If so a new message like
"Cannot drop ..." could supersede the other message. However, this would
be unnecessary if we decide to fix 1. In that case all other errors
would have been reported already.

3. In the code that tries to restore a database
(DatabaseObj::do_execute()), we do first try to drop the same database,
but ignore its return code. We just try to create it anew. This wipes
out the above error and replaces it with something like "Canot create
...". I guess the implementor expected nothing but a "does not exist"
error, and felt safe to ignore it. But since non-existence is masked out
on the lower layer already, this is probably just a mistake and can be
fixed.

4. In the backup/restore kernel (bcat_create_item()), we check the
return code from the [re-]creation of the database object. If it failed,
we report an error like "Could not restore database". Since DROP
DATABASE and CREATE DATABASE are executed in sub-statements, the
query_id becomes incremented. And now comes a design detail of the
original error reporting facility, that I do not understand. THD
remembers the query_id of the first error in the warning stack
(thd->warn_id). If a new error is reported with a different query_id,
the stack is cleared, wiping out all former error reports. This looks
like a lapse into paradigm #2. If the architecs decide that this shall
stay as it is, we could still fix the obstacle by reporting a new error
only if no error exists or query_id is unchanged (if (thd->is_error() &&
(thd->query_id != thd->warn_id)) in bcat_create_item().

5. In the backup/restore kernel (Backup_restore_ctx::do_restore()),
after handling/[re-]creating all meta data, we explicitly clear the
diagnostics area (wipe out all error reports). This has been added
recently with revision
oystein.grovlen@sun.com-20081014120856-0k6zhjkekl2xcybb for
BUG#39089/WL#4384 (Fix missing error handling in backup code). So this
is a design decision by the backup team and could be revised by them.

The above patch shows how an implementation could look like if these
design descisions are changed.