Bug #90398 Duplicate entry for key '<group_key>' error
Submitted: 12 Apr 4:16 Modified: 3 May 16:45
Reporter: Yoseph Phillips Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:5.7.17 EE, 5.7.22 OS:Any
Assigned to: CPU Architecture:Any

[12 Apr 4:16] Yoseph Phillips
Description:
The error is:
com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '2018-04-12 01:23:45-12345678' for key '<group_key>' 

This appeared to be the same as bug #58081 however that was fixed in 5.7.5 and we are still seeing this error in versions of MySQL until 5.7.17 EE.

We have seen it happen it many versions of MySQL from 5.6.25 CE to 5.7.17 EE.

If required we can provide the exact query where the issue happens (I have changed some of the names here for better understanding), it is of the form:
SELECT ...
FROM ...
INNER JOIN ...
INNER JOIN ...
INNER JOIN ...
INNER JOIN ...
LEFT JOIN ...
LEFT JOIN ...
WHERE ...
GROUP BY b.createdTimestamp, b.id
ORDER BY b.createdTimestamp DESC, b.id DESC;

We use InnoDB for all of our tables.
The query itself does not seem to be the problem.

The batch table has 8 indexes.

When we do a SELECT * FROM batch b WHERE b.id = 12345678; it returns the row with the issue. The explain shows that this is using the PRIMARY KEY index.

When we do a SELECT * FROM batch b WHERE b.createdTimestamp = '2018-04-12 01:23:45'; then it returns an empty set. The explain shows that this is using an index which is just an index on createdTimestamp.

This seems to imply that just this one index is getting corrupted somehow.

Every time clients report this error to us we need to get into their systems and run: UPDATE batch b SET b.createdTimestamp = b.createdTimestamp + INTERVAL 0 DAY; to fix up this index.

After running that UPDATE statement then SELECT * FROM batch b WHERE b.createdTimestamp = '2018-04-12 01:23:45'; now returns the row that had the issue.

The createdTimestamp column is of type: TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP

The batch table is having many rows inserted into it all the time.
Following some of the inserts there is an UPDATE to the row that was just inserted, neither of the id or createdTimestamp columns are updated. No other UPDATEs are done to this table.
Sometimes there are DELETEs from this table. We have checked one of these corrupted records and we can see that a record just 80 records away had been deleted which almost certainly would have been on the same index page.

Regrettably we cannot leave these records in this state as clients need to be able to run the query with the GROUP BY all the time. 

For security we also cannot keep a dump of the client's data.
I cannot confirm this first hand and this might confuse the issue, but I have been informed that for one client the above UPDATE did not solve the issue. Apparently a MySQL dump and restore into MySQL 5.6.27 still had this issue. A dump and restore to MySQL 5.5 did not have the issue. They then dropped and recreated all of the indexes which apparently solved the problem. This one client doesn't make much sense to me as the indexes would have been recreated when the restore was done as well. For all of our other clients doing the UPDATE statement above has worked so I would prefer to focus on that issue first.

How to repeat:
We do not have a reproducible test case for this as it seems to happen randomly.
[20 Apr 13:40] Sinisa Milivojevic
Hi,

First of all, we truly need a reproducible test case in order to verify the bug.

You are claiming that this issue occurs randomly. Hence, the next question ... How do you know that the DML that is UPDATE which increments the timestamp column by 0 days fixes the problem ???

One of the possible factors could be that your disk or RAM modules have problems. Problems that are, quite possibly, not reported by hardware diagnostic software, but that are glitches that occur randomly ???

Hence, in order to alleviate the problem you can try the following options, one by one:

* check the  available disk space on the datadir and partitions mounted there upon

* check the  available disk space on the tmpdir and partitions mounted there upon

* the error message that is reported looks like a possible problem with a temporary table. Hence, try to change the default temporary engine. If it is InnoDB, try MyISAM or the other way around

* when nothing is running on the server, run the diagnostic software for the hardware. I do not see what OS is in question, so I can not comment further

* check whether you are using ECC DRAM, two bits checking, one bit correcting
[21 Apr 12:36] Yoseph Phillips
I realise that this one is going to be really challenging without a reproducible test case. It really is a strange one. Our clients tend to have lots of concurrent users using the system all of the time. I believe that some clients can go years without this happening and then it just seemingly randomly happens. This makes it near impossible for us to come up with a reproducible test case.

Many of our clients have had this exact same issue on their systems. Many of them are on their own servers. Every time (with the possible exception of one time - which I can not personally verify) we have used this UPDATE the timestamp column by 0 days then both the SELECT * FROM batch b query and the other query SELECT statement with the GROUP BY start to work, so we know for a fact that this solves the issue.

The issue I have with going down the hardware track is that i) the rest of the database is fine ii) it is always the exact same index on this one table (we have lots of tables iii) it happens on so many client's servers which do not share the same hardware, in fact they all use very different hardware from different manufacturers.

We have checked the available disk space in both the tempdir and datadir partitions for many of these clients and they have ample free space.

All of our tables are InnoDB. We have not been able to reproduce this issue on our own test servers. On client servers we cannot switch them to MyISAM as MyISAM is not reliable enough, causes too many performance issues (these are huge tables that we are talking about), and has other limitations that do not work for us. Also I am 100% certain that this problem is not with a temporary table as the batch table with the corrupt index is not a temporary table. I do believe that it could be specific to InnoBD.    

These servers have users using them 24/7 so we will not have a time when nothing is running on the server. I do recall that the first time I saw this problem back in 2015 that the client had run a scan for any hardware issues and replied to us that there weren't any. I know for certain that it is happening on many Windows Servers. We have lots of clients using Linux as well. Please let me know if it would be helpful to know if this has also happened on clients running Linux and if there is any other information you need me to check about regarding the Operating System.   

Also regarding the ECC Dram - see my answers above - lots of different hardware and I believe that many of them would be using ECC Dram - but I do not believe that this issue is hardware related.

My guess is it is something to do with when rows are deleted from table when there might be concurrent inserts, updates or other deletions going on at the same time. There might also be timing issues with the redo logs. Many of the transactions would also be using read uncommitted. This will not be an easy one to solve.
[23 Apr 0:07] Yoseph Phillips
I have confirmed that this also happens on Linux clients - so clearly it is not related to the OS.
[23 Apr 12:47] Sinisa Milivojevic
Hi,

I hope that you understand that we have to find a way to reproduce the issue as otherwise we could not be able to help you.

Also, I did not recommend you to switch from InnoDB to MyISAM. I only recommended that you use the setting to switch only temporary tables to MyISAM. These are the intrinsic temporary tables that MySQL server creates in order to resolve queries. Those are not tables created and required by your application.

The reason why I am recommending to you to try to switch to different temporary table SE or to change tmpdir setting is because the error that is reported can be created only when creating an index for the intrinsic temporary tables.

The above error usually occurs when there is not sufficient space in the tmpdir or if you run out of the file descriptors.

Last, but not least, since you are using Enterprise Edition, why don't you create a SR for this problem.
[23 Apr 13:10] Sinisa Milivojevic
Hi,

I did some searching and discovered that this bug is a duplicate of the bug:

https://bugs.mysql.com/bug.php?id=58081

The only exception from this would be another bug that occurs only when you use ORDER BY rand();

The above bug contains all the workaround for solving this problem. And, yes, it is caused by tmpdir.
[23 Apr 13:31] Yoseph Phillips
Hi,

I was thinking of creating lots of threads all randomly inserting, updating and deleting from the batch table at random times, however before I go down that path I will need to work out what other updates end up fixing the issue - I will need to wait for it to happen on another client's system before I can do that - and we cannot change their data in anyway so it will be a little tricky. If we can reproduce it like that would that count as a way of reproducing the issue?  

My point is that while switching from InnoDB to MyISAM might stop the Duplicate entry for key error it will not solve the underlying cause of this index getting corrupted. The queries for a single row of the batch table do not use any temporary tables, and so it will not make a difference to them if we use InnoDB or MyISAM as you have described.

The Duplicate entry for key '<group_key>' error seems to be a symptom of the issue and not the cause of it.

I believe that you are referring to the Duplicate Entry error when you are talking about tmpdir and file descriptors. Here though the issue is caused by a corrupted index on the batch table. I had previously checked the tmpdir and file descriptors on one of the client systems at the time when they were getting the Duplicate entry for key error. At that point those were not an issue. 

Are there known things that could cause InnoDB indexes to become corrupted? I could check for those things. It occurs on so many different client systems, and this is the only index we have noticed this issue on. We probably don't have that many indexes on just a TIMESTAMP column so that might be related.

Some of our clients are using Enterprise Edition. One of those clients had this issue once and so we asked them to raise the SR for this issue (we gave them the issue number). They were using an old version of MySQL 5.5 for many years and I don't believe they has this issue. Then within a year of them upgrading to 5.7.17 EE this issue happens (it might have been a coincidence). They were happy as we resolved the issue for them using the UPDATE by adding 0 DAYS. We have not had confirmation from them that they have raised the SR.
[23 Apr 13:35] Sinisa Milivojevic
Hi,

Please read my comments regarding this duplicate bug.
[23 Apr 13:45] Yoseph Phillips
Regarding bug # 58081. I wrote about that bug in the initial description. I had commented on that bug back in 2015 as I also thought it might have been the same issue. However as I noted in the description of this issue 'that one was fixed in 5.7.5 and we are still seeing this error in versions of MySQL until 5.7.17 EE.'

Also we had tried the workarounds listed on there back in 2015 (when we first saw this issue) and they did not work for us. We did the UPDATE column by adding 0 DAYS trick. Updating using UPDATE batch b SET b.createdTimestamp = b.createdTimestamp did not do the trick, but by adding INTERVAL 0 DAY it did. When we do that it shows that it matches all rows, but only changes a few of them. Failing that we were going to try CHECK index and REPAIR index however we have never had an opportunity to test those.
[23 Apr 13:47] Yoseph Phillips
Please see my previous comment - this is not a Duplicate as the other bug was fixed in an earlier version of MySQL, however this issue still exists in MySQL 5.7.17 EE.
[23 Apr 15:50] Sinisa Milivojevic
Hi,

Bug #58081 is not fixed in 5.7.5, but only documented there, in our manual. As you can see from other duplicate bugs of that bug, this is a problem that are experienced with other GROUP BY queries.

This bug is not fixed and will not be fixed in the code, since MySQL server needs not check for the available space for large temporary tables, nor for the health of the I/O subsystem.

Hence, apply the workaround from that bug report.
[25 Apr 3:14] Yoseph Phillips
Regarding bug #58081:

Something was fixed there as we can easily reproduce that bug in 5.5.37 but not in 5.7.21.

Assuming you are referring to the workaround of:
SET SESSION max_heap_table_size = 536870912; 
SET SESSION tmp_table_size = 536870912;

That workaround does not even work for the original test in bug #58081.
In 5.5.37 the issue occurs irrespective of those table_size settings.

In 5.7.21 the issue in bug #58081 does not occur with either the default table_size settings of 16,777,216 or with those ones.

Likewise replacing their SELECT statement with:
SELECT insert('', t2.a, t1.a, (@@global.max_binlog_size)) FROM t1, t1 t2;

shows very buggy results in 5.5.37, however in 5.7.21 the correct results are shown, so something has definitely been fixed regarding bug #58081.

Regarding this bug:
This bug (#90398) still happens on 5.7.17 EE.
I am almost 100% certain that we tried that workaround back in July 2015 and it did not workaround this bug either. When I am next informed of a client having this issue again, I can reconfirm that with you.
It is obvious that once the index on the batch table is corrupted lots of queries are giving the wrong results or causing errors, so it is not an issue with the GROUP BY - that is just a symptom of the corrupted index.
[25 Apr 12:56] Sinisa Milivojevic
Hi,

I have already written that duplicate bugs of the #58081, do not have INSERT.

Corrupted index can only occur due to disk problems. Beside hardware problems, there is insufficient space problems, that does not have to be caused only by out-of-disk-space problems, but from various OS settings.

Until you come up with a repeatable test case, we can not verify the bug.
[27 Apr 4:56] Yoseph Phillips
Data creation script

Attachment: MySQL-Bug-90398.sql (application/octet-stream, text), 772.15 KiB.

[27 Apr 5:04] Yoseph Phillips
I have made a test case that I can reproduce on my own computer every time.
It is not letting me change the status back to open.

I have attached a file for creating the batch table with data.

Whenever I do this on my computer using MySQL 5.7.21 then the index is corrupted in the same way as on our client systems:

SELECT * FROM batch b WHERE b.id = 17702;
correctly returns the record.

SELECT * FROM batch b WHERE b.createdTimestamp = '2015-04-05 02:05:26';
returns an empty set.

Using the UPDATE by adding 0 DAYs trick fixes the issue.

I also noticed that many other records are corrupted.

If this does not reproduce it on your test systems then we can look at what the difference is between our environments.
[27 Apr 12:15] Sinisa Milivojevic
Hi Yoseph,

Don't worry about the status. 

Since you have a test case, I will try to reproduce it on two different servers and if the error occurs, I will happily verify this bug report.
[30 Apr 15:43] Sinisa Milivojevic
Hi,

Actually, I do not see what is wrong with your test case. For me, it works just fine as both servers return the same result.

" select id, createdTimestamp from batch where id=17702"

+-------+---------------------+
| id    | createdTimestamp    |
+-------+---------------------+
| 17702 | 2015-04-04 18:05:26 |
+-------+---------------------+

" select id, createdTimestamp from batch where createdTimestamp='2015-04-04 18:05:26.000'"

+-------+---------------------+
| id    | createdTimestamp    |
+-------+---------------------+
| 17702 | 2015-04-04 18:05:26 |
+-------+---------------------+

Just to check, I also ran:

update batch set createdTimestamp=addtime(createdTimestamp,'0 0:0:0.000') where id=17702";

and the above two SELECTs returned the exactly same results on both servers.

A record like you show in your last comment does not exist in the dump that you uploaded.
[1 May 2:06] Yoseph Phillips
I just tried this on 5.7.22 and it is still easy to reproduce in my environment:

C:\Program Files\MySQL\MySQL Server 5.7\bin>mysql -uyoseph -p test -e "source d:\temp\MySQL-Bug-90398.sql"

Server version: 5.7.22-log MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SELECT * FROM batch b WHERE b.id = 17702;
+-------+---------------------+
| id    | createdTimestamp    |
+-------+---------------------+
| 17702 | 2015-04-05 02:05:26 |
+-------+---------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM batch b WHERE b.createdTimestamp = '2015-04-05 02:05:26';
Empty set (0.00 sec)

* Please let me know if you are also testing on 5.7.22
* Please note that the difference in the displayed timestamp is simply due to our different TimeZones - doing the same thing for earlier rows works just fine
* Please do an explain on your "select id, createdTimestamp from batch where createdTimestamp='2015-04-04 18:05:26.000'" to make sure that it is using the createdTimestampIndex index
* Please don't do any UPDATE statements until you can reproduce the above issue as UPDATE statements can stop the issue 
* Please let me know if you are using my method above of restoring the table (-e and source) or if you are using some other method - in which case I can try that but I would expect the same results
* Please let me know if you have have something in your my.ini that could affect this, I can reproduce it even using an almost empty my.ini:

[client]
port = 3306
 
[mysql]
no-beep
 
[mysqld]
port = 3306
basedir = "C:/Program Files/MySQL/MySQL Server 5.7"
datadir = "D:/MySQL/MySQL Server 5.7/Data" 
log-error = "D:/Logs/MySQL Server 5.7 Error.log"
[1 May 11:08] Yoseph Phillips
I believe that I have worked out what the actual issue is.

In 2015 there were two 2015-04-05 02:05:26 in VIC Australia due daylight savings changes.

When MySQL is displaying the TIMESTAMP in the Sydney/Melbourne Australia Time Zone it must be using one of those two, and when we are using it in the where clause it must be converting it back to a TIMESTAMP an hour out from where it was.

Likewise when MySQL is doing the GROUP BY it must be converting the TIMESTAMP to a String, and then when it converts it back again it can no longer find it and so gives a Duplicate entry  for key '<group_key>' error.

Hopefully tomorrow I will have some time to double check that this is the issue for all of our clients.

I have not tested but this would mean it is very easy to reproduce this issue just by having one row in the data when using any TimeZone that has daylight savings changes.

As a workaround we might need to change our TimeZone to GMT before doing queries with a GROUP BY and change it back afterwards, but that will be messy.

So a simple way to reproduce my test case above is to use the Sydney/Melbourne Australia Time Zone.
[2 May 12:24] Sinisa Milivojevic
Hi,

While you are looking for it, check out whether you have properly upgraded your tables containing temporal values, between 5.5 and newer versions. This is described in 5.6 Revision History and in our Reference Manual  in chapter 11.3 for both 5.6 and 5.7.

Both are available on our dev.mysql.com public site for documentation.
[2 May 12:52] Yoseph Phillips
All of our temporal values are correct, when we upgrade between major versions we do logical upgrades, not in place upgrades and we even do mysql_upgrade as well.

I had a quick look at many clients and all of them were the same issue with the repeated hour during daylight savings time. I can officially confirm that the issue is how MySQL does GROUP BY on TIMESTAMP columns. As I wrote I my previous post you can easily reproduce what I am seeing by setting your Time Zone to Sydney/Melbourne Australia. 

If I get some free time tomorrow I can also write a small test case that will cause the Duplicate entry for key '<group_key>' error for any Time Zone that uses Daylight savings.
[2 May 12:54] Sinisa Milivojevic
We would be truly grateful on the test case.
[3 May 8:00] Yoseph Phillips
Here is a test case that should cause the Duplicate entry for key '<group_key>' error for any Time Zone that has Daylight Savings.

DROP TABLE IF EXISTS batch;

CREATE TABLE batch (
  id INT NOT NULL AUTO_INCREMENT,
  createdTimestamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  KEY createdTimestampIndex(createdTimestamp)
);

DROP PROCEDURE IF EXISTS populate_batches;

DELIMITER $$

CREATE PROCEDURE populate_batches()
BEGIN
  DECLARE startTime TIMESTAMP DEFAULT CURRENT_TIMESTAMP; 
  DECLARE counter INT DEFAULT 0;
    
  WHILE counter < 18000 DO
    INSERT INTO batch(createdTimestamp) VALUES(startTime + INTERVAL 30 * counter MINUTE);
    
    SET counter = counter + 1;
  END WHILE;
END $$

DELIMITER ;

SET TIME_ZONE = '+00:00';

CALL populate_batches();

SET TIME_ZONE = 'SYSTEM';

SELECT b.createdTimestamp, min(b.id)
FROM (SELECT 1 UNION SELECT 2) t 
CROSS JOIN batch b 
GROUP BY b.createdTimestamp;
[3 May 16:45] Sinisa Milivojevic
HI,

Thank you for your test case.

I managed to repeat your result:

ERROR 1062 (23000) at line 29: Duplicate entry '2018-10-28 03:12:40' for key '<group_key>'

I am happy to verify your bug. Also, I find that this is a severe bug, because wrong error code is returned.

Thanks again.

Verified as reported.