Bug #39498 Doing INSERT INTO .. SELECT brings down all data nodes.
Submitted: 17 Sep 2008 13:49
Reporter: Geert Vanderkelen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: CPU Architecture:Any
Tags: mysql-5.1.28-telco-6.3.18

[17 Sep 2008 13:49] Geert Vanderkelen
Description:
I was testing some things, and needed to duplicated a table. I used INSERT INTO .. SELECT, why not. But after it completing 'successfully', all data nodes crashed in what appears to be a domino effect. (Ah, bit romance doesn't hurt)

[root@MASTER:/test]
> SELECT NOW(); INSERT INTO t2 SELECT * FROM t1; SELECT NOW(); INSERT INTO t3 SELECT * FROM t1;
+---------------------+
| NOW()               |
+---------------------+
| 2008-09-17 15:37:20 |
+---------------------+
1 row in set (0.00 sec)

Query OK, 57036 rows affected (1 min 39.20 sec)
Records: 57036  Duplicates: 0  Warnings: 0

+---------------------+
| NOW()               |
+---------------------+
| 2008-09-17 15:38:59 |
+---------------------+
1 row in set (0.00 sec)

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Funny is, I found out that the time on the machines was out of sync. So reconfigured NTP and made sure they were running on the same time. Same crash, but while time was not synced, I got this error instead of the 'Lock wait timeout':

ERROR 1297 (HY000): Got temporary error 286 'Node failure caused abort of transaction' from NDBCLUSTER

Time: Wednesday 17 September 2008 - 15:39:03
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 249) 0x0000000a
Program: /data1/mysql/ndb-6.3bzr/libexec/ndbd
Pid: 17259
Trace: /data2/users/geert/cluster/master/ndb_3_trace.log.5
Version: mysql-5.1.28 ndb

Time: Wednesday 17 September 2008 - 15:39:14
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 4 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 249) 0x0000000a
Program: /data1/mysql/ndb-6.3bzr/libexec/ndbd
Pid: 30691
Trace: /data2/users/geert/cluster/master/ndb_4_trace.log.5
Version: mysql-5.1.28 ndb

I'll attach logs and traces.

How to repeat:
Need 3 tables, filling up the first one, doing the INSERTS:

CREATE TABLE t1 (
id INT NOT NULL AUTO_INCREMENT,
vc1 VARCHAR(200),
vc2 VARCHAR(30),
i1 INT NOT NULL DEFAULT '1',
PRIMARY KEY (id)
) ENGINE=NDB;

CREATE TABLE t2 (
id INT NOT NULL AUTO_INCREMENT,
vc1 VARCHAR(200),
vc2 VARCHAR(30),
i1 INT NOT NULL DEFAULT '1',
PRIMARY KEY (id)
) ENGINE=NDB;

CREATE TABLE t3 (
id INT NOT NULL AUTO_INCREMENT,
vc1 VARCHAR(200),
vc2 VARCHAR(30),
i1 INT NOT NULL DEFAULT '1',
PRIMARY KEY (id)
) ENGINE=NDB;

for((i=1;i<100000;i++)); do /data1/mysql/ndb-6.3bzr/bin/mysql -e "INSERT INTO t1 (vc1,vc2,i1) VALUES ('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3),('asdkfjdkfjdkfjkdjfkdjfksjfksjdfkjdfdf','kjkdjfkdj',3)" test; done

Stopped filling it up after it had 57036 rows or so.

mysql> INSERT INTO t2 SELECT * FROM t1; INSERT INTO t3 SELECT * FROM t1;

Configuration used with 2 data nodes:

[NDBD DEFAULT]
Datadir=/data2/users/geert/cluster/master
NoOfReplicas=2
DataMemory=2000M
IndexMemory=400M
TimeBetweenGlobalCheckpoints=2000

MaxNoOfConcurrentOperations=500000
MaxNoOfConcurrentTransactions=20000
SharedGlobalMemory=600M

[NDB_MGMD DEFAULT]
Datadir=/data2/users/geert/cluster/master
portnumber=1406

[TCP DEFAULT]
portnumber=2292
SendBufferMemory=4M

etc..
[17 Sep 2008 14:03] Geert Vanderkelen
Actually, before setting the time again OK on the machines (on data nodes there was like a 40 second differences), one of the data node crashed like this:

Time: Wednesday 17 September 2008 - 14:50:50
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtc/DbtcMain.cpp
Error object: DBTC (Line: 4419) 0x0000000a
Program: /data1/mysql/ndb-6.3bzr/libexec/ndbd
Pid: 16987
Trace: /data2/users/geert/cluster/master/ndb_3_trace.log.4
Version: mysql-5.1.28 ndb-6.3.18-RC

Attaching that trace too
[17 Sep 2008 14:05] Geert Vanderkelen
Logs and traces for mentioned errors

Attachment: bug39498.zip (application/zip, text), 186.51 KiB.

[17 Sep 2008 15:59] Geert Vanderkelen
MySQL Cluster 6.2.15 has similar crash, but error is bit different here:

Time: Wednesday 17 September 2008 - 17:53:32
Status: Temporary error, restart node
Message: Internal program error (failed ndbassert) (Internal error, programming error or missing error message, please report a b
ug)
Error: 2343
Error data: suma/Suma.cpp
Error object: SUMA (Line: 3473) 0x0000000e
Program: /data1/mysql/ndb-6.2.15/libexec/ndbd
Pid: 30995
Trace: /data2/users/geert/cluster/master/ndb_4_trace.log.8
Version: mysql-5.1.23 ndb-6.2.15

The other node goes down just second before with:
Error data: Node 3 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 235) 0x0000000e

The client gets ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER

Uploading traces
[17 Sep 2008 16:00] Geert Vanderkelen
Using MySQL Cluster 6.2.15

Attachment: ndb_4_and_4_trace_log_8.zip (application/zip, text), 81.02 KiB.

[7 Apr 2009 8:11] Daniel Herlitz
Same thing just happened to me. I wanted to duplicated a table and executed a insert into X select * from Y when all ndb nodes crashed.

Version: 6.3.17 

Node 3 said: 

2009-04-07 08:51:26 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Node 2 said:

2009-04-07 08:51:18 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

ndb_2_error.log:
------------------

Time: Tuesday 7 April 2009 - 08:51:17
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 249) 0x0000000a
Program: ndbd
Pid: 3408
Trace: /var/mysql-cluster/ndb_2_trace.log.6
Version: mysql-5.1.27 ndb-6.3.17-RC
[7 Apr 2009 8:13] Daniel Herlitz
Attaching log files

Attachment: logs.tar.gz (application/x-gzip, text), 105.56 KiB.

[16 Apr 2009 13:27] Jonathan Miller
http://bugs.mysql.com/bug.php?id=37227
http://bugs.mysql.com/bug.php?id=43882
[20 Aug 2009 8:30] Jonas Oreland
decreasing Prio after having updated risk/effort.
we're aware of the problem...and have an idea on how to fix it
1) for MM, by committing transactions rather than operations in LQH
2) for DD, by changing from NO-STEAL to STEAL

however, each of those tasks is a huge undertaking