Bug #41202 Error: 2341 during cluster backup causes multiple data node crash
Submitted: 3 Dec 2008 14:15 Modified: 25 Apr 2009 11:15
Reporter: Jonathan Carter Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.45 OS:Linux (RHEL4)
Assigned to: CPU Architecture:Any
Tags: Backup, cluster, data node, ndb

[3 Dec 2008 14:15] Jonathan Carter
Description:
Hi

In 12 months this is the first time I have had a problem with performing a cluster BACKUP. The cluster starts again and runs fine , but at the moment I am without a backup.

Any Ideas????

Time: Sunday 30 November 2008 - 01:25:07
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Backup.cpp
Error object: BACKUP (Line: 3014) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 11786
Trace: /opt/mysql-cluster/ndb_2_trace.log.21
Version: Version 5.0.45
***EOM***

2008-11-30 00:44:56 [MgmSrvr] INFO -- Node 2: Local checkpoint 91138 started. Keep GCI = 15389746 oldest restorable GCI = 15389521
2008-11-30 00:54:04 [MgmSrvr] INFO -- Node 2: Local checkpoint 91139 started. Keep GCI = 15390007 oldest restorable GCI = 15389929
2008-11-30 01:03:26 [MgmSrvr] INFO -- Node 2: Local checkpoint 91140 started. Keep GCI = 15390268 oldest restorable GCI = 15390333
2008-11-30 01:12:31 [MgmSrvr] INFO -- Node 2: Local checkpoint 91141 started. Keep GCI = 15390528 oldest restorable GCI = 15390504
2008-11-30 01:21:40 [MgmSrvr] INFO -- Node 2: Local checkpoint 91142 started. Keep GCI = 15390788 oldest restorable GCI = 15390686
2008-11-30 01:25:08 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-11-30 01:25:08 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-11-30 01:25:08 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-11-30 01:25:09 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
I can repeat it at will just be requesting a backup
[11 Dec 2008 13:08] Martin Skold
Has this bug been verified on newer versions?
Try ndb-6.2 for example.
[11 Dec 2008 13:16] Jonathan Carter
no, this is a production environment and I do not have enough hardware to set up a parallel environment to test on.  

Also even if I could set up such a test rig I actually still cannot backup the cluster so there is no way for me to get the data over to a 6.x test rig at present.

jc
[11 Dec 2008 13:34] Jonathan Carter
via the mailing list I have suggestion that it is caused by bug30172, however I cannot see this bug - could you check if this is correct.

Jonathan
[16 Dec 2008 13:43] Nicolas Diguet
I have the same Bug with 5.1.24 ndb-6.3.16-telco version after a lot of insertion.
Nicolas
[16 Dec 2008 16:34] Jonathan Carter
Do you also have the problem that it brings down both data nodes?

I was talking to people on the messageboard and my current reccommendation was to embark on an upgrade to 6.x; if it is still possible for a bug to bring down both data nodes, therefore poluting the idea of no single point of failure then I will delay this until we have a solution.
[24 Dec 2008 10:02] Nicolas Diguet
Yes , it brings down both data nodes.
I can't populate my database and my data is lost (problem with backup.cpp process and then with restore.cpp process if i want to restart nodes)
[24 Dec 2008 10:24] Nicolas Diguet
My ndb error log trace (6.3.20-innodb)
Time: Wednesday 23 December 2008 - 10:13:37
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: backup/Backup.cpp
Error object: BACKUP (Line: 4652) 0x0000000e
Program: /usr/local/mysql-5.1.30-ndb-6.3.20-innodb/libexec/ndbd
Pid: 22346
Trace: /var/lib/mysql-cluster/logs/ndb_1_trace.log.5
Version: mysql-5.1.30 ndb-6.3.20-GA
***EOM***

if i want to restart (without init)

Time: Tuesday 24 December 2008 - 17:46:32
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: restore.cpp
Error object: RESTORE (Line: 487) 0x0000000a
Program: /usr/local/mysql-5.1.30-ndb-6.3.20-innodb/libexec/ndbd
Pid: 22195
Trace: /var/lib/mysql-cluster/logs/ndb_1_trace.log.6
Version: mysql-5.1.30 ndb-6.3.20-GA
***EOM***
[25 Mar 2009 11:15] Jonathan Miller
Provide configuration
[2 Apr 2009 9:04] Nicolas Diguet
Here my config.ini

6 Servers :
2 Data nodes (16Go Ram)
3 Api
1 Controler

#
# This config.ini file was generated by ndbinstaller.
#

[NDBD DEFAULT]
#The number of copies of the data stored on different nodes in the cluster
NoOfReplicas=1

#The amount of main memory (RAM) used to store columns and ordered indexes in tables, plus some overhead
DataMemory=12150M

#The amount of main memory (RAM) used to hash indexes in tables, plus some overhead
#Usually between 1/6 or 1/8 of the DataMemory is enough, but depends on the
#number of unique hash indexes (UNIQUE in table def)
IndexMemory=1500M

#If you have lots of tables (>1000), then you may get 'Error 773' and need to increase this (25 would be reasonable)
StringMemory=25

#Set NoOfFragmentLogFiles to 6xDataMemory [in MB]/(4 *FragmentLogFileSize [in MB]
#Thus, NoOfFragmentLogFiles=6*2048/1024=12
#The "6 (or 4) xDataMemory" is a good heuristic and is STRONGLY recommended.
#The amount of disk space (NoOfFragmentLogFiles * 64MB) used to store the Redo Log (used for node recovery)
NoOfFragmentLogFiles=77

#The size of a Redo Log file on disk.
FragmentLogFileSize=256M

#The speed at which LocalCheckpoints (LCP) are written to disk
DiskCheckpointSpeed=10M

#The speed at which LocalCheckpoints (LCP) are written to disk, as part of a node recovery 
DiskCheckpointSpeedInRestart=100M

#The size of the RedoLog Buffer in memory; Reduce for improved disk throughput (but slower recovery time)
DiskSyncSize=10M

#The size of the RedoBuffer used to buffer writes to the disk subsystem. Increase for high write-rate or slow disk.
RedoBuffer=386M

#The maximum time in ms that is permitted to lapse between operations in the same transaction before the transaction is aborted.
TransactionInactiveTimeout=8000

#Time in ms between global checkpoint groups are flushed from memory to disk
TimeBetweenGlobalCheckpoints=1000

#Time in ms between local checkpoints of memory to local disk. Increase this if system is highly loaded to improve node restart times.
TimeBetweenLocalCheckpoints=21

#Time in ms between replication events sent up from cluster
TimeBetweenEpochs=100

#Represents the number of seconds between memory usage reports written to the cluster log
MemReportFrequency=10

#This prevents ndbd processes and DB memory from being swapped out to disk
LockPagesInMainMemory=1

#
# TRANSACTION PARAMETERS
#

#The number of transaction records available at each data node. Increase this if you have a higher number of concurrent transactions
MaxNoOfConcurrentTransactions=4096

#The number of operation records available at each data node. Increase this if you have a higher number of concurrent operations
MaxNoOfConcurrentOperations=32768

#The number of operation records available at each data node for queries that use a hash index. Increase this if you have a higher number of concurrent operations
MaxNoOfConcurrentIndexOperations=8192

#
# Scans and Buffering
#

# Max number of parallel scans. Max value is 500.
MaxNoOfConcurrentScans=256

#Note: 'alter table' requires 3 times the number of attributes that are in the original table. 
MaxNoOfAttributes=24576

#A table is required not just for a table, but also for every unique hash index and every ordered index. Maximum value is 20320, minimum is 8.
MaxNoOfTables=4096

#Each ordered index requires 10KB of memory.
MaxNoOfOrderedIndexes=2048

#Each unique hash index requires 15KB of memory.
MaxNoOfUniqueHashIndexes=4096

#Replication, unique index operations, backups and order index operations require trigger objects.
MaxNoOfTriggers=768

# 
#Backup Parameters
#The following must hold when updating backup params:
# 1. BackupDataBufferSize >= BackupWriteSize + 188KB
# 2. BackupLogBufferSize >= BackupWriteSize + 16KB
# 3. BackupMaxWriteSize >= BackupWriteSize
# 4. BackupMemory = BackupDataBufferSize + BackupLogBufferSize

BackupMemory=1856M
#Increase if slow disk subsystem when making a backup. 
BackupDataBufferSize=928M

#Increase if slow disk subsystem when making a backup. 
BackupLogBufferSize=928M

#Default size of msgs written to disk. This value must be less than or equal to BackupMaxWriteSize.
BackupWriteSize=32000

#Max size of msgs written to disk. This value must be greater than or equal to than BackupMaxWriteSize.
BackupMaxWriteSize=256000

#Frequency of Backups
BackupReportFrequency=100

#
# Real-time params
#

TransactionDeadlockDetectionTimeout=8000

#Setting this to '1' enables real-time scheduling of ndb threads.
#RealtimeScheduler=0

#Time in microseconds for threads to be executed in the scheduler before being sent.
#SchedulerExecutionTimer=50

#Time in microseconds for threads to be executed in the scheduler before sleeping.
#SchedulerSpinTimer=0

#Whether a ndbd process should halt or be restarted if there is an error in it.
StopOnError=true

#where disk-based user data will be used for the cluster
Diskless=false

#Data nodes attempt to use O_DIRECT when writing LCPs, backups and redo logs.
#Should be disabled for 2.4 or older kernels, enabled for 2.6 or higher kernels
ODirect=true

#[SHM]
#NodeId1=1
#NodeId2=2
#ShmKey=324
#ShmSize=4M

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]

[TCP DEFAULT]
# I found  this settting buggy with 5.1.17-beta, when starting ndb_mgmd process.
# SendBufferMemory=2M

# Management Server
[NDB_MGMD]
HostName=10.1.9.137
PortNumber=1186
DataDir=/var/lib/mysql-cluster/logs
LogDestination=FILE:filename=/var/lib/mysql-cluster/logs/cluster.log,maxsize=1000000,maxfiles=6
NodeID=63

[NDBD]
HostName=10.1.9.56
DataDir=/var/lib/mysql-cluster/logs
FileSystemPath=/space/ndbdata/1
NodeID=1
DiskPageBufferMemory= 754M
SharedGlobalMemory= 396M

#Assign this data node to a CPU-ID
#LockExecuteThreadToCPU=
#LockMaintThreadToCPU=
#DiskPageBufferMemory= 250M

[NDBD]
HostName=10.1.9.57
DataDir=/var/lib/mysql-cluster/logs
FileSystemPath=/space/ndbdata/2
NodeID=2
DiskPageBufferMemory= 724M
SharedGlobalMemory= 396M

#Assign this data node to a CPU-ID
#LockExecuteThreadToCPU=
#LockMaintThreadToCPU=
#DiskPageBufferMemory= 250M

# Setup node IDs for mySQL API-servers (clients of the cluster)
[MYSQLD]
HostName=10.1.9.137
[MYSQLD]
HostName=10.1.9.135
#wait_timeout=100000
#interactive_timeout=100000
[MYSQLD]
HostName=10.1.9.136
#wait_timeout=100000
#interactive_timeout=100000
[MYSQLD]

[MYSQLD]

[MYSQLD]

[MYSQLD]

[MYSQLD]

[MYSQLD]
[25 Apr 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".