MySQL Bugs: #34612: CSV storage engine does not honor time

Bug #34612	CSV storage engine does not honor time_zone
Submitted:	15 Feb 2008 19:49	Modified:	25 Sep 2009 16:59
Reporter:	Tim Clark	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: CSV	Severity:	S2 (Serious)
Version:	5.0,5.1.22	OS:	Other (i5/OS v5r4 PASE)
Assigned to:		CPU Architecture:	Any
Tags:	csv, time, time_zone

Description:
The CSV storage engine is not honoring the time_zone system variable when doing INSERTs.

How to repeat:
create table t1 (ts timestamp) engine=csv;
set time_zone='+00:00';
insert into t1 (ts) values ('2003-03-30 02:30:00');
set time_zone='+10:30';
insert into t1 (ts) values ('2003-03-30 02:30:00');
set time_zone='-10:00';
insert into t1 (ts) values ('2003-03-30 02:30:00');
select * from t1;

Expected:

+---------------------+   
| ts                  |   
+---------------------+   
| 2003-03-29 16:30:00 |   
| 2003-03-29 06:00:00 |   
| 2003-03-30 02:30:00 |   
+---------------------+   
3 rows in set (0.01 sec)  

Actual:

+---------------------+   
| ts                  |   
+---------------------+   
| 2003-03-30 02:30:00 |   
| 2003-03-30 02:30:00 |   
| 2003-03-30 02:30:00 |   
+---------------------+   
3 rows in set (0.01 sec)

verified on support I5os-03.mysql.com
Server version: 5.0.46-enterprise-gpl MySQL Enterprise Server (GPL)

verified on support I5
Server version: 5.1.23-rc MySQL Community Server (GPL)

This problems effects backup log tables to (see Bug#37146).
As far as I can tell, what is happening is that the same code (field.cc) is used for outputting the timestamp to the CSV file as for sending it to the client.  Hence, timestamps are converted to local time when they are written to the CSV file.  Similarly, when timestamps are read, they are assumed to be in local time in file, converted to UTC (based on current local time) internally, and then converted back to local time when sent to the client.

I ask for this bug to be retriaged since a fix for this is needed before the SR60RC tagged Bug#37146 can be fixed.

This also affects the IBMDB2I storage engine.

Experience shows that for a CSV table, INSERT treats a TIMESTAMP column as DATETIME i.e. it stores in it the time and date the passed data verbatim. For example, when UTC is 13:45, and my session's time zone is "Japan" (where it's 22:45), NOW() returns 22:45, INSERT NOW() leads to a CSV table whose content (as displayed by "cat") is:
"2009-06-13 22:45:18"
Then, this is what will be returned by SELECT, no matter the time zone at the time of SELECT; for example if I switch my session to Europe/Moscow (when it's 17:45), SELECT shows 22:45. MyISAM, InnoDB, Archive, Memory don't have this bug.
Insertion into TIMESTAMP of CSV table inserts what is passed verbatim into the CSV file, and SELECT returns the CSV file's content unchanged (ignoring time_zone of SELECTer).
This is in contradiction with
http://dev.mysql.com/doc/refman/6.0/en/time-zone-support.html
which says "Values for TIMESTAMP columns are converted from the current time zone to UTC for storage, and from UTC to the current time zone for retrieval.".
More importantly it is in contradiction with the goal of TIMESTAMP (carving in the row a time-zone-independent time "number of seconds since 1970 UTC").

This bug is also visible in the mysql.general_log CSV table:
"2009-06-13 17:43:55","root[root] @ localhost []",2,1,"Query","select * from t"
"2009-06-13 17:44:08","root[root] @ localhost []",2,1,"Query","set time_zone='Japan'"
"2009-06-13 22:44:09","root[root] @ localhost []",2,1,"Query","select * from t"
"2009-06-13 22:44:24","root[root] @ localhost []",2,1,"Query","create table u(a timestamp, b datetime) engine=myisam"
"2009-06-13 22:44:35","root[root] @ localhost []",2,1,"Quit",""
"2009-06-13 17:44:35","root[root] @ localhost []",3,1,"Connect","root@localhost on test"
We see that when a session changed its time zone to Japan, this changed the time recorded in the log table. Then another session using Europe/Moscow led to a previous time being logged. One would expect an always increasing time in a log, which is how the file-based general log behaves by the way.
If we want to fix this, we can do as the manual says: convert time to UTC when storing to CSV, convert back when reading. However, this has drawbacks:
1) CSV is special because people may display the table's content directly with system tools like "cat". So far this returned the same output as a SELECT. So if we convert to UTC when storing, SELECT and "cat" would differ. This may break existing applications. This may also surprise users who are happily operating in a single timezone (assume a departemental database used in a single country, not a multinational thing): they would see unexpected time when displaying the CSV file.
2) users who upgrade MySQL from before-the-fix mysqld to after-the-fix mysqld, and have a pre-existing CSV table filled before the upgrade: unless they previously operated in UTC, what SELECT returns is going to change after upgrade (as post-upgrade MySQL interprets the content of the CSV file as UTC, whereas it was in fact localtime). So we need a prominent upgrade note, and an upgrade path for users: they need to be able to tell "this old CSV table has timestamp column in timezone X"; one way could be that they do:
 mysqldump old table using old mysqld
 start new mysqld
 set time_zone=X;
 execute output of mysqldump
Given that different old CSV tables may have been filled with data from different time zones, the value of X could be different for each table, so such upgrade looks too delicate for mysql_upgrade to automatically do. It probably needs to be a manual procedure.
A command-line option could be provided "local_time_in_csv_tables", which forces the new mysqld to use the old behaviour, so that users are not forced to execute this upgrade procedure immediately when upgrading, but can postpone it to later. Then this option would be deprecated in the next version, and removed later. It does not sound like a lot of effort to implement this option. It can also be added as behaviour of the existing --old option, but I believe this option is controlling too much behaviours already (and thus the user lacks granularity with it).
As the fix would cause such surprises and backward incompatibilities, I am going to check with Architects first.

for the record the ibmdb2i bug is http://bugs.mysql.com/bug.php?id=43343

I'd either extend ROW_FORMAT, or something similar to let you create the sort of behavior you are asking for. I wouldn't change the default, as someone pointed out the behavior as described works well for people who are looking at the files in a raw form. Also, a number of users tend to copy files in, and their expectations have been done based on the current behavior.

On a side note, I've had a request for this sort of behavior in Archive as well. I've not finished the new version yet which allows for field reading, but when I do something like this will probably be added as well.

Something more from Brian "in recent versions of CSV it has a meta file with a version number in it. You could pack your info about format in that. The lack of a presence of such a file can tell you to use defaults. "

leaving architects a few days to comment on this.

Comment from Shawn Green:

I do not agree that the proposed fix will have as severe an impact as Guilhem supposes. Let me break it down:

1) Users taking advantage of the human readable nature of CSV files will see different dates after the change than what they are used to seeing. -

This is true but those users were relying on a buggy behavior to remain consistent. Once we fix TIMESTAMP fields in CSV to store as UTC and not as local timezone then we become consistent with the rest of our engines and with our documentation. I think it will be simple to do a one time "adjustment" to these values after you update to a version that is operating correctly.

The proposed option flag to allow for the "old" behavior will be necessary while we deprecate the old behavior (two major releases). Once we deprecate the old behavior, we can also remove the flag. It is possible that a user could make a feature request to keep the option (timestamps+CSV will store as local TZ) and not deprecate it but I doubt that it will be approved.

The workaround is to use the DATETIME storage type to get local timezone data stored in the CSV file.

2) User migration will be difficult and confusing. -

I disagree. With proper documented warning (INCOMPATIBLE CHANGE), the new "use old behavior" option flag, and the simple nature of the CSV format along with the wide variety of options for datetime conversions I see migration to the newer (correctly behaving) version to be fairly straightforward.

Now, if we wanted to automate this CSV+TIMESTAMP migration within mysqlupgrade then that would also be very nice but it would be a very separate issue to worry about.

Another bug which likely has the same cause: the value of time_zone at the moment when CHECKSUM TABLE is run influences what CHECKSUM TABLE returns. Example:

create table t1(a timestamp) engine=csv;
insert into t1 values("2001-01-01 11:00:00");

select * from t1;
checksum table t1;
# switch to other time zone
set time_zone="Japan";
select @@time_zone;
select * from t1;
checksum table t1;
drop table t1;

With engine=csv, the first and second CHECKSUM TABLE return different values (1531669096 then 1579932560), though both SELECTs return the same. With engine=myisam, no problem (1579932560 both). With engine=csv and 'a' being "datetime not null", no problem (1249892621 both).

1) the bug is bigger than expected, see my post of
[30 Jun 21:56] (CHECKSUM TABLE depends on session's time_zone, ugh).

2) Brian suggested extending the ROW_FORMAT clause of CREATE TABLE and
making the behaviour of TIMESTAMP in CSV dependent on ROW_FORMAT.
ROW_FORMAT is currently one of DEFAULT FIXED DYNAMIC REDUNDANT COMPACT
which are all about how the row is stored physically on disk (compact,
not compact, fixed-size etc); that is a SQL-invisible detail and does
not affect the content reported by SELECT. Whereas the TIMESTAMP
problem here has to do with content reported by SELECT. So ROW_FORMAT
does not look like the most appropriate way.

3) The current behaviour of TIMESTAMP in CSV regarding what is stored,
is the normal semantics of DATETIME (i.e. date and time stored in the
timezone used at INSERT time, verbatim, and returned to reader
verbatim, independently of the reader's timezone). So users willing to
have this special behaviour can change their column to DATETIME.

4) The current behaviour of TIMESTAMP in CSV is unique to this engine,
and in contradiction with the manual's definition of TIMESTAMP. Given
that DATETIME exists, it sounds unnecessary to keep the current
behaviour of TIMESTAMP in CSV long-term.

5) Given previous points, my proposal is to change the *default* to
be: store TIMESTAMP in UTC in the CSV file, and return it to SELECT
after adjustment to the reader's timezone, like for other
engines. This should fix all bad cases reported in the bug report
(including CHECKSUM TABLE).

6) There remains upgrade problems and current use cases to tackle with
care, see following points.

7) A pre-upgrade departemental database might be limited to a single
timezone in practice, and users/scripts may be reading the table with
"cat" and expect the same content as SELECT (i.e. local time). After
upgrading, newly inserted TIMESTAMP values are stored in UTC in the
file, scripts breaks. The upgrade procedure to fix this is that for
each TIMESTAMP column where the old behaviour (SELECT==file) is
desired, run:
ALTER TABLE tbl MODIFY col DATETIME NOT NULL;
(this must be run on *pre-upgrade* server). And fixing the
application if it was relying on auto-setting of TIMESTAMP when
INSERT/UPDATE specifies a NULL value (DATETIME does not have auto-set
properties, so NOW() should be explicitely used when
inserting/updating DATETIME).
After that, the user has a DATETIME column with DATETIME semantics,
with correct time content, and SELECT==file.

8) For columns which are not upgraded like above, i.e. real correct
TIMESTAMP behaviour is desired, there is a different procedure. Say
local time is UTC+2. A pre-upgrade old CSV table contained TIMESTAMP
in local time (say 08AM). SELECTing it with a post-upgrade server will
show 10AM which is wrong (the file's content 08AM will be wrongly
interpreted as UTC).
For each column where the user wants to switch to real correct
TIMESTAMP behaviour, she or he should do:
ALTER TABLE tbl MODIFY col DATETIME NOT NULL;
(this must be run on *pre-upgrade* server)
and then
ALTER TABLE tbl MODIFY col TIMESTAMP;
(this must be run on *post-upgrade* server).
After that the user has a TIMESTAMP column with normal TIMESTAMP
semantics and correct time content.
Alternative which would also work:
 * run mysqldump on pre-upgrade server
 * in its output, replace
   SET TIME_ZONE='+00:00'
   with
   SET TIME_ZONE=the timezone used for that column (UTC+2)
   (or use "mysqldump --skip-tz-utc" and add the line above to the
   output's start).
 * pipe this into post-upgrade server.

9) If we don't want to force users to execute the upgrade procedures of
(7)(8) immediately, we can provide an option
--csv-timestamp-in-local-time which keeps the old behaviour, and
deprecate that option later.

I pinged architects; let's give them until the end of the week for replying, after which I'll make a WL entry and go the administrative circuit to get feedback.

Guilhem,
With regard to your comment:
"4) The current behaviour of TIMESTAMP in CSV is unique to this engine..."
This behavior also exists in IBMDB2I and Federated and probably any engine which relies on Field_timestamp::val_str() to obtain the timestamp field value. As a result, I think the problem and the solution both need to account for a scope that is wider than CSV.

Hello Tim, thanks for the comments.
Regarding ibmdb2i:
- I don't have access to DB2 so cannot test any fix there, but I believe that the fix I suggested for CSV could work for DB2 too: instead of using val_str() (in db2i_conversion.cc I guess), we should emit a date and time string in UTC (for example, my local time is "2009-07-02 14:53:38", but what should be sent to DB2 is the UTC one "2009-07-02 12:53:38").
- the upgrade procedures of (7) and (8) likely apply too (points in my previous post); I say "likely" because without testing I cannot be sure
Regarding the federated engine:
- there is the local server (which does not have data) and the remote server (which has data), it's the local server which uses the federated engine so which needs some fixing. The ingredients of the fix will be the same as for CSV: the federated engine should send the timestamp expressed in the proper timezone. The upgrade procedures will be based on (7)(8).

Guilhem,
I believe that your upgrade solution in (7) and (8) will work with IBMDB2I

With regard to the actual fix: would it be possible to fix this in val_str() and store() without requiring each individual storage engine to implement a fix? Since the val_* functions are the documented method for an engine to extract field data from a row, it seems like a general solution there would make sense and eliminate a lot of code duplication among engines. At the very least, perhaps we could have a static function in handler.h/cc to help with the conversion and to avoid code duplication among engines?

Hello Tim. Agree with the helper function so that we don't have to duplicate code. Regarding fixing val_str() itself I don't know; this function creates what is returned by SELECT, so is probably expected to output the timestamp in human-readable form in the session's timezone. So I think it does its job; what's wrong is that this value is interpreted as an UTC value by some engines.

Guilhem,
I agree with you that we need to fix the default behavior to follow the standard set by MySQL for other engines.

I agree we should provide an upgrade procedure for the users, lots of your suggestions in item 7) make sense. I think concrete details can be ironed
out in the code review or architecture review.

I think you should move forward with this and fix the bug following your suggestion.

Given the scope of this work (upgrade procedures, several engines), a Worklog entry has been created, WL#5051.

not working on it at the moment (rather on Optimizer bugs), and waiting on arch review of WL#5051

BUG#37146 is also dependent on this fix.

I am having difficulties because of this bug, what is the fix schedule? I couldn't find any information.