Bug #25119 Data nodes died during inserting 1M records through INSERT INTO ... SELECT FROM
Submitted: 17 Dec 2006 16:36 Modified: 12 Mar 2007 12:10
Reporter: Serge Kozlov Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.15-bk OS:Linux (Linux FC4)
Assigned to: CPU Architecture:Any

[17 Dec 2006 16:36] Serge Kozlov
Description:
The attached script (sqe.pl and aa.txt) creates one ndb_dd table with 1M records and then copies these records into second table via INSERT INTO ... SELECT FROM in loop. Though all options have high values (DataMemory, MaxNoOfConcurrentOperations, MaxNoOfLocalOperations, undofiles, datafiles) data nodes had crash:

Current byte-offset of file-pointer is: 568

Time: Sunday 17 December 2006 - 16:59:36
Status: Permanent error, external action needed
Message: Signal lost, out of send buffer memory, please increase SendBufferMemor
y or lower the load (Resource configuration error)
Error: 6052
Error data: Remote note id 2.
Error object: TransporterCallback.cpp
Program: ./builds/libexec/ndbd
Pid: 13239
Trace: /space/run/ndb_3_trace.log.1
Version: Version 5.1.15 (beta)
***EOM***

Current byte-offset of file-pointer is: 568

Time: Sunday 17 December 2006 - 17:10:17
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing
error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: ./builds/libexec/ndbd
Pid: 13232
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.15 (beta)
***EOM***

How to repeat:
1. Use configuration from attached file. Main options from one are:
DataMemory: 1G
IndexMemory: 500M
MaxNoOfConcurrentOperations: 2M
MaxNoOfLocalOperations: 2M
2. Start cluster.
3. Run the script:
 ./sqe.pl -q aa.txt -p 127.0.0.1:3306:root::test
4. Wait while the script will show '4009' error and look error log files
[17 Dec 2006 16:41] Serge Kozlov
trace, log files, config.ini, perl script

Attachment: bug25119.tar.gz (application/gzip, text), 161.25 KiB.

[17 Dec 2006 19:32] Jonas Oreland
trace files from node 2 is missing.
[17 Dec 2006 22:01] Serge Kozlov
trace files for node 2

Attachment: bug25119-trace-node-2.tar.gz (application/gzip, text), 141.87 KiB.

[18 Dec 2006 23:19] Jonas Oreland
Hi,

Could you test if problem is related to relativly small undo_buffer_size,
  by increasing it to say 8M

/Jonas
[19 Dec 2006 14:34] Serge Kozlov
I used undo_buffer_size=8M, 15M, and 20M and got same results (crash).
[19 Dec 2006 18:51] Jonas Oreland
Hi,

I tried this today, but I failed already at trying to use such a big configuration.

My machine does only have 2G of ram, how big machine are you using?
Can you run this with LockPagesInMemory (note need to be root, or correctly set ulimit)

Then, doing a 1M row transation is very much not recommended.
Some algorithms are not adapted to big transactions.
Does this work for MM?

Anyway, so I added limit clauses here and there, and then it works like a charm.
(also using a configuration that I could run wo/ swapping on my machine)

---

So, conclusion: there might is probably a bug somewhere.
But wo/ a more realistic test-case, it's very hard to estimate how likely
  this is to get "in real life".

My guess would be that you could maybe recreate a similar bug with reasonable
  sized transactions and a small value for SharedGlobalMemory.

(i added limit 50000, you can ask johan what I suggest as maximum for customer to use)

/Jonas
[19 Dec 2006 20:52] Serge Kozlov
Also I got same error for query CREATE TABLE ... SELECT FROM ... if source table has 1M rows. I expected that because in fact same transaction uses for that as for INSERT INTO ... SELECT FROM ...