Bug #47651 Got temporary error 4025 'Node failure caused abort of transaction'
Submitted: 25 Sep 2009 17:15 Modified: 28 Oct 2009 10:08
Reporter: Matthew Bilek Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: CPU Architecture:Any
Tags: 7.0.6, alter, cluster, Failure, gcp_stop, MySQL, node

[25 Sep 2009 17:15] Matthew Bilek
Description:
When performing an ALTER command on the cluster is receive the following message from the mysql command prompt:

mysql> ALTER TABLE `xxx`.`table2` ADD INDEX `i_table2_timestamp`(`timestamp`);
ERROR 1297 (HY000): Got temporary error 4025 'Node failure caused abort of transaction' from NDBCLUSTER
mysql>

Message from the ndb_mgm console:

Node 22: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
/etc/mysql/config.ini

[TCP DEFAULT]
SendBufferMemory=16M
ReceiveBufferMemory=16M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=/usr/local/mysql/data/

[NDB_MGMD]
Id=1
Hostname=192.168.0.80
ArbitrationRank=1

[NDB_MGMD]
Id=2
Hostname=192.168.0.81
ArbitrationRank=1

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/usr/local/mysql/data/
FileSystemPathDD=/usr/local/mysql/data/
#FileSystemPathUndoFiles=/usr/local/mysql/data/
#FileSystemPathDataFiles=/usr/local/mysql/data/
DataMemory=9G
IndexMemory=2G
LockPagesInMainMemory=0

MaxNoOfConcurrentOperations=100000

StringMemory=25
MaxNoOfTables=20000
MaxNoOfOrderedIndexes=10000
MaxNoOfUniqueHashIndexes=2500
MaxNoOfAttributes=120000
DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
InitFragmentLogFiles=FULL
NoOfFragmentLogFiles=53
RedoBuffer=256M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

#MemReportFrequency=30
#BackupReportFrequency=10

### Params for setting logging 
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for increasing Disk throughput 
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
#ODirect=1

### Watchdog 
TimeBetweenWatchdogCheckInitial=60000

### TransactionInactiveTimeout  - should be enabled in Production 
TransactionInactiveTimeout=30000
TransactionDeadlockDetectionTimeout=60000
### CGE 6.3 - REALTIME EXTENSIONS 
#RealTimeScheduler=1
#SchedulerExecutionTimer=80
#SchedulerSpinTimer=40

### DISK DATA 
#SharedGlobalMemory=384M
#read my blog how to set this:
#DiskPageBufferMemory=3072M

### Multithreading 
MaxNoOfExecutionThreads=8

### Increasing the LongMessageBuffer b/c of a bug (20090903)
LongMessageBuffer=32M

BatchSizePerLocalScan=512
[NDBD]
Id=21
Hostname=192.168.0.84

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=22
Hostname=192.168.0.85

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=23
Hostname=192.168.0.86

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=24
Hostname=192.168.0.87

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

##	BELOW ARE TWO (INACTIVE) SLOTS FOR DATA NODES TO ALLOW FOR GROWTH
#[NDBD]
#Id=7
#Hostname=

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

#[NDBD]
#Id=8
#Hostname=

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[MYSQLD DEFAULT]
BatchSize=512
#BatchByteSize=2048K
#MaxScanBatchSize=2048K

[MYSQLD]
Id=11
Hostname=192.168.0.81

[MYSQLD]
Id=12
Hostname=192.168.0.81

[MYSQLD]
Id=13
Hostname=192.168.0.81

[MYSQLD]
Id=14
Hostname=192.168.0.82

[MYSQLD]
Id=15
Hostname=192.168.0.82

[MYSQLD]
Id=16
Hostname=192.168.0.82

### SLOTS (one for each ndb_mgmd) FOR HELPER APPLICATIONS SUCH AS ndb_show_tables etc
[MYSQLD]
Hostname=192.168.0.80
[MYSQLD]
Hostname=192.168.0.81
[25 Sep 2009 17:17] Matthew Bilek
ndb_22_error.log

Current byte-offset of file-pointer is: 568

Time: Friday 25 September 2009 - 10:57:05
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 22 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 260) 0x0000000a
Program: /usr/local/mysql/libexec/ndbmtd
Pid: 27359
Trace: /data/mysqlcluster/ndb_22_trace.log.1 /data/mysqlcluster/ndb_22_trace.log.1_t1 /data
[25 Sep 2009 17:25] Matthew Bilek
ndb_22_out.log:

2009-09-25 10:57:05 [ndbd] INFO     -- Node 22 killed this node because GCP stop was detected
2009-09-25 10:57:05 [ndbd] INFO     -- NDBCNTR (Line: 260) 0x0000000a
2009-09-25 10:57:05 [ndbd] INFO     -- Error handler shutting down system
2009-09-25 10:57:05 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-09-25 10:57:05 [ndbd] ALERT    -- Node 22: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[25 Sep 2009 18:12] Matthew Bilek
Version is 7.0.6 not 7.0.7
[28 Sep 2009 10:08] Bernd Ocklin
Hi Matthew,

please run ndb_error_reporter tool and attach the resulting file. In addition please make sure to read the manual:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-nd...
[28 Sep 2009 10:30] Hartmut Holzgraefe
The key information from the logs is:

  2009-09-25 10:57:05 [ndbd] INFO     -- Node 22 killed this node because GCP stop was detected

This indicates that disk I/O bandwidth was too low and usually happens with operations on disk based tables. Can you confirm that the table you are trying to alter is using disk based columns and provide us with the table schema and the current number of rows stored in the table?
[28 Sep 2009 11:06] Bernd Ocklin
One comment: you are running lock pages in memory == 0 and a lot of data memory configured. You are likely swapping which is bad for cluster (and actually all data bases). Maybe you want to reduce data memory and use some disk based tables instead.
[29 Oct 2009 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".