Description:
'GROUP BY concat_ws(...)' raises an unexpected error in 'SELECT ...' statement when query a comparatively large table. (e.g. ERROR 1062 (23000): Duplicate entry '28524-32769' for key 'group_key'.)
How to repeat:
1. Download GO term DB from "http://archive.geneontology.org/termdb/2011-01-01/go_daily-termdb-tables.tar.gz".
Import it into local MySQL database 'go', as described in "http://archive.geneontology.org/termdb/2011-01-01/README".
2. Successful query when table is small:
mysql>use test;
Database changed
mysql>CREATE TABLE d (id INT(11) PRIMARY KEY) SELECT DISTINCT term2_id AS id FROM go.graph_path WHERE term1_id=33275 AND distance>=1;
Query OK, 10 rows affected (0.30 sec)
Records: 10 Duplicates: 0 Warnings: 0
mysql>CREATE TABLE a (anc INT(11), term INT(11), dist INT(11)) SELECT DISTINCT g.term1_id AS anc, g.term2_id AS term, g.distance AS dist FROM d JOIN go.graph_path AS g ON (g.term1_id=d.id);
Query OK, 21 rows affected (0.09 sec)
Records: 21 Duplicates: 0 Warnings: 0
mysql>CREATE TABLE ss (term1 INT(11), term2 INT(11), LCA INT(11), between_dist INT(11)) SELECT a1.term AS term1, a2.term AS term2, a1.anc AS LCA, MIN(a1.dist +
a2.dist) AS between_dist FROM a AS a1 JOIN a AS a2 ON (a1.anc=a2.anc AND a1
.term<a2.term) GROUP BY CONCAT_WS('-', CAST(a1.term AS CHAR), CAST(a2.term AS
CHAR));
Query OK, 22 rows affected (0.11 sec)
Records: 22 Duplicates: 0 Warnings: 0
(Change 'term1_id=33275' to 'term1_id=4498', we will success again on a moderate large table.)
3. Failure occurs when table is comparatively large(with same statements as above):
mysql>RENAME TABLE d TO d_33275, a TO a_33275, ss TO ss_33275;
Query OK, 0 rows affected (0.03 sec)
mysql>CREATE TABLE d (id INT(11) PRIMARY KEY) SELECT DISTINCT term2_id AS id FROM go.graph_path WHERE term1_id=6534 AND distance>=1;
Query OK, 20091 rows affected (0.88 sec)
Records: 20091 Duplicates: 0 Warnings: 0
mysql>CREATE TABLE a (anc INT(11), term INT(11), dist INT(11)) SELECT DISTINCT g.term1_id AS anc, g.term2_id AS term, g.distance AS dist FROM d JOIN go.graph_path AS g ON (g.term1_id=d.id);
Query OK, 467970 rows affected (3.58 sec)
Records: 467970 Duplicates: 0 Warnings: 0
(Here, creating some indexes on this table 'a' will increase running speed of the last SQL statement. It takes more than several hours' WAITING before error reported, on my Notebook with T2050@1.60GHz CPU and 2.0G memory).
mysql>CREATE TABLE ss (term1 INT(11), term2 INT(11), LCA INT(11), between_dist INT(11)) SELECT a1.term AS term1, a2.term AS term2, a1.anc AS LCA, MIN(a1.dist +
a2.dist) AS between_dist FROM a AS a1 JOIN a AS a2 ON (a1.anc=a2.anc AND a1
.term<a2.term) GROUP BY CONCAT_WS('-', CAST(a1.term AS CHAR), CAST(a2.term AS
CHAR));
ERROR 1062 (23000): Duplicate entry '28524-32769' for key 'group_key'
(The 'Duplicate entry' may not be '28524-32769' in your test cases, even not the same in two repeats.)
Suggested fix:
N/A