MySQL Bugs: #66104: MySQL-Cluster Online backup error

Bug #66104	MySQL-Cluster Online backup error
Submitted:	30 Jul 2012 20:48	Modified:	7 Sep 2012 6:31
Reporter:	jose ferrero	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.2.7	OS:	Linux (Debian 6)
Assigned to:	Ole John Aske	CPU Architecture:	Any
Tags:	3001: Could not start backup, Backup, mysql-cluster

Description:
Hello, 

Trying to perform an online backup I get the following error. Although increasing level log I cant find any relevant log in the system. 

ndb_mgm> 3 clusterlog BACKUP=15 
Executing CLUSTERLOG BACKUP=15 on node 3 OK! 

ndb_mgm> 3 clusterlog ERROR=15 
Executing CLUSTERLOG ERROR=15 on node 3 OK! 

ndb_mgm> start backup 11 
Waiting for completed, this may take several minutes 
Node 3: Backup 11 started from node 1 
Backup failed 
* 3001: Could not start backup 
* Backup abortet due to node failure: Permanent error: Internal error 
ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11. 
Node 2: Backup 11 started from 1 has been aborted. Error: 1326 
Node 2: Forced node shutdown completed. Occured during startphase 0. Initiated by signal 11. 

The backup directory contains the following files:

BACKUP-11# ls -lrt
total 976
-rw------- 1 root root      0 Jul 28 14:46 BACKUP-11.3.log
-rw------- 1 root root      0 Jul 28 14:46 BACKUP-11-0.3.Data
-rw------- 1 root root 999424 Jul 28 14:48 BACKUP-11.3.ctl

The .ctl file has begun to be written (so the backup started) but for some reason stops.

A tail of this file shows the following:

/opt/mysql/server-5.5/bin/ndb_print_backup_file BACKUP-11.3.ctl  | tail -n 50
Key: 19 value(40) : "alienvault/def/vuln_nessus_report_stats"
Key: 20 value(4) : 519
Key: 21 value(4) : 4
Key: 25 value(4) : 253
Key: 26 value(4) : 0
Unknown type for key: 27 type: 2
Key: 133 value(4) : 0
Key: 129 value(4) : 0
Unknown type for key: 130 type: 2
Key: 135 value(4) : 0
Unknown type for key: 136 type: 2
Key: 1000 value(14) : "dtLastScanned"
Key: 1001 value(4) : 0
Key: 1006 value(4) : 0
Key: 1003 value(4) : 3
Key: 1005 value(4) : 8
Key: 1019 value(4) : 0
Key: 1008 value(4) : 0
Key: 1009 value(4) : 0
Key: 1010 value(4) : 0
Key: 1013 value(4) : 18
Key: 1014 value(4) : 0
Key: 1015 value(4) : 0
Key: 1016 value(4) : 1
Key: 1017 value(4) : 0
Key: 1007 value(4) : 0
Key: 1020 value(4) : 0
Unknown type for key: 1021 type: 2
Key: 1999 value(4) : 1
Key: 1000 value(10) : "NDB$TNODE"
Key: 1001 value(4) : 1
Key: 1006 value(4) : 1
Key: 1003 value(4) : 5
Key: 1005 value(4) : 64
Key: 1019 value(4) : 0
Key: 1008 value(4) : 0
Key: 1009 value(4) : 0
Key: 1010 value(4) : 0
Key: 1013 value(4) : 8
Key: 1014 value(4) : 0
Key: 1015 value(4) : 0
Key: 1016 value(4) : 64
Key: 1017 value(4) : 0
Key: 1007 value(4) : 0
Key: 1020 value(4) : 0
Unknown type for key: 1021 type: 2
Key: 1999 value(4) : 1
Key: 999 value(4) : 1

############

ndb_1_cluster.log: 

2012-07-28 14:48:19 [MgmtSrvr] INFO -- Node 3: Backup 11 started from node 1 
2012-07-28 14:48:25 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected 

######################### 

#show 

[ndbd(NDB)]	2 node(s) 
id=2	@172.16.13.31 (mysql-5.5.25 ndb-7.2.7, Nodegroup: 0) 
id=3	@172.16.13.32 (mysql-5.5.25 ndb-7.2.7, Nodegroup: 0, Master) 

[ndb_mgmd(MGM)]	1 node(s) 
id=1	@172.16.13.30 (mysql-5.5.25 ndb-7.2.7) 

##### 

The default values of the parameters BackupDataBufferSize, BackupLogBufferSize, BackupMemory, BackupWriteSize and BackupMaxWriteSize are not changed. 

Thanks,

How to repeat:
Happens everytime I try to issue an online backup.

Jay Ward made a wonderfull debugging of the same problem ( http://forums.mysql.com/read.php?25,563119,563457#msg-563457 ). Please find attached the traces.

Debugging traces made by Jay Ward

Attachment: debug.txt (text/plain), 8.58 KiB.

ndb_error_report for second cluster built with this same problem

Attachment: ndb_error_report_20120812114527.tar.bz2 (application/octet-stream, text), 265.14 KiB.

I also have this problem on Solaris amd64 mc-7.2.7. The crash occurs if I use ndbmtd but will not occur if I use ndbd. 

Using 8 threads.

This happens whenever I start the cluster with more than one ldm thread (which makes sense, since with only one ldm thread, it can talk to all ldm threads).
I was able to predictably recreate this using 'Recommended Starting Configuration for MySQL Cluster' (http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-config-starting.html) and adding these lines:

SharedGlobalMemory=2G // the default value causes out of job buffer memory
DiskPageBufferMemory=1G // Just in case

// To use multicores efficiently. 12 core machine:
// 11 - Main/IO Thread
// 10 - Rep
// 9  - TC
// 8  - Left to OS (shown to receive most interrupts)
// 7  - Left to OS (shown to receive second most interrupts)
// 6  - TC
// 5  - Recv
// 4  - Send
// 3  - LDM
// 2  - LDM
// 1  - LDM
// 0  - LDM
ThreadConfig=main={count=1,cpubind=11},io={count=1,cpubind=11},rep={count=1,cpubind=10},tc={count=2,cpubind=6,9},recv={count=1,cpubind=5},send={count=1,cpubind=4},ldm={count=4,cpubind=0-3}

Taking out the ThreadConfig line makes the problem go away, auto assignment looks like this:

ThreadConfig: input:  LockExecuteThreadToCPU:  => parsed: main,ldm,recv,rep
NDBMT: MaxNoOfExecutionThreads=4
NDBMT: workers=1 threads=1 tc=0 send=0 receive=1

And with only one worker, all workers can talk to all workers, and so the backup succeeds. I can supply a core dump if needed.

Jay

Jose posted his config.ini in his original forum thread: http://forums.mysql.com/read.php?25,563119,564783#msg-564783

He is just using the line MaxNoOfExecutionThreads=8 to generate this error.

Jay:

Thank you for a very detailed bug description. Due to that it was very easy
to identify the bug. It seems to be a regression introduced in 7.2.7.

As you have already indicated, a possible (though poor) workaround is to avoid using ndbmtd, or at least restrict it to have only a single LDM.

Ole,

Thank you very much. I did post this as a question on Mikael Ronstrom's blog and he said, "No, you missed nothing, you hit a bug. We also discovered this very recently and actually discussed the fix today :) A fix is in the works."

So... Thanks guys for being on top of this! I'm reducing our LDM thread count to 1 temporarily until the fix is produced.

Again, thank you!
Jay

I hate to be a bother, but is there any update on this? Do we have a time frame for it getting resolved?

This bug has been fixed in MySQL CLuster 7.2.8 which is now available on http://dev.mysql.com/downloads/cluster/