Bug #41202 | Error: 2341 during cluster backup causes multiple data node crash | ||
---|---|---|---|
Submitted: | 3 Dec 2008 14:15 | Modified: | 25 Apr 2009 11:15 |
Reporter: | Jonathan Carter | Email Updates: | |
Status: | No Feedback | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0.45 | OS: | Linux (RHEL4) |
Assigned to: | CPU Architecture: | Any | |
Tags: | Backup, cluster, data node, ndb |
[3 Dec 2008 14:15]
Jonathan Carter
[11 Dec 2008 13:08]
Martin Skold
Has this bug been verified on newer versions? Try ndb-6.2 for example.
[11 Dec 2008 13:16]
Jonathan Carter
no, this is a production environment and I do not have enough hardware to set up a parallel environment to test on. Also even if I could set up such a test rig I actually still cannot backup the cluster so there is no way for me to get the data over to a 6.x test rig at present. jc
[11 Dec 2008 13:34]
Jonathan Carter
via the mailing list I have suggestion that it is caused by bug30172, however I cannot see this bug - could you check if this is correct. Jonathan
[16 Dec 2008 13:43]
Nicolas Diguet
I have the same Bug with 5.1.24 ndb-6.3.16-telco version after a lot of insertion. Nicolas
[16 Dec 2008 16:34]
Jonathan Carter
Do you also have the problem that it brings down both data nodes? I was talking to people on the messageboard and my current reccommendation was to embark on an upgrade to 6.x; if it is still possible for a bug to bring down both data nodes, therefore poluting the idea of no single point of failure then I will delay this until we have a solution.
[24 Dec 2008 10:02]
Nicolas Diguet
Yes , it brings down both data nodes. I can't populate my database and my data is lost (problem with backup.cpp process and then with restore.cpp process if i want to restart nodes)
[24 Dec 2008 10:24]
Nicolas Diguet
My ndb error log trace (6.3.20-innodb) Time: Wednesday 23 December 2008 - 10:13:37 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: backup/Backup.cpp Error object: BACKUP (Line: 4652) 0x0000000e Program: /usr/local/mysql-5.1.30-ndb-6.3.20-innodb/libexec/ndbd Pid: 22346 Trace: /var/lib/mysql-cluster/logs/ndb_1_trace.log.5 Version: mysql-5.1.30 ndb-6.3.20-GA ***EOM*** if i want to restart (without init) Time: Tuesday 24 December 2008 - 17:46:32 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: restore.cpp Error object: RESTORE (Line: 487) 0x0000000a Program: /usr/local/mysql-5.1.30-ndb-6.3.20-innodb/libexec/ndbd Pid: 22195 Trace: /var/lib/mysql-cluster/logs/ndb_1_trace.log.6 Version: mysql-5.1.30 ndb-6.3.20-GA ***EOM***
[25 Mar 2009 11:15]
Jonathan Miller
Provide configuration
[2 Apr 2009 9:04]
Nicolas Diguet
Here my config.ini 6 Servers : 2 Data nodes (16Go Ram) 3 Api 1 Controler # # This config.ini file was generated by ndbinstaller. # [NDBD DEFAULT] #The number of copies of the data stored on different nodes in the cluster NoOfReplicas=1 #The amount of main memory (RAM) used to store columns and ordered indexes in tables, plus some overhead DataMemory=12150M #The amount of main memory (RAM) used to hash indexes in tables, plus some overhead #Usually between 1/6 or 1/8 of the DataMemory is enough, but depends on the #number of unique hash indexes (UNIQUE in table def) IndexMemory=1500M #If you have lots of tables (>1000), then you may get 'Error 773' and need to increase this (25 would be reasonable) StringMemory=25 #Set NoOfFragmentLogFiles to 6xDataMemory [in MB]/(4 *FragmentLogFileSize [in MB] #Thus, NoOfFragmentLogFiles=6*2048/1024=12 #The "6 (or 4) xDataMemory" is a good heuristic and is STRONGLY recommended. #The amount of disk space (NoOfFragmentLogFiles * 64MB) used to store the Redo Log (used for node recovery) NoOfFragmentLogFiles=77 #The size of a Redo Log file on disk. FragmentLogFileSize=256M #The speed at which LocalCheckpoints (LCP) are written to disk DiskCheckpointSpeed=10M #The speed at which LocalCheckpoints (LCP) are written to disk, as part of a node recovery DiskCheckpointSpeedInRestart=100M #The size of the RedoLog Buffer in memory; Reduce for improved disk throughput (but slower recovery time) DiskSyncSize=10M #The size of the RedoBuffer used to buffer writes to the disk subsystem. Increase for high write-rate or slow disk. RedoBuffer=386M #The maximum time in ms that is permitted to lapse between operations in the same transaction before the transaction is aborted. TransactionInactiveTimeout=8000 #Time in ms between global checkpoint groups are flushed from memory to disk TimeBetweenGlobalCheckpoints=1000 #Time in ms between local checkpoints of memory to local disk. Increase this if system is highly loaded to improve node restart times. TimeBetweenLocalCheckpoints=21 #Time in ms between replication events sent up from cluster TimeBetweenEpochs=100 #Represents the number of seconds between memory usage reports written to the cluster log MemReportFrequency=10 #This prevents ndbd processes and DB memory from being swapped out to disk LockPagesInMainMemory=1 # # TRANSACTION PARAMETERS # #The number of transaction records available at each data node. Increase this if you have a higher number of concurrent transactions MaxNoOfConcurrentTransactions=4096 #The number of operation records available at each data node. Increase this if you have a higher number of concurrent operations MaxNoOfConcurrentOperations=32768 #The number of operation records available at each data node for queries that use a hash index. Increase this if you have a higher number of concurrent operations MaxNoOfConcurrentIndexOperations=8192 # # Scans and Buffering # # Max number of parallel scans. Max value is 500. MaxNoOfConcurrentScans=256 #Note: 'alter table' requires 3 times the number of attributes that are in the original table. MaxNoOfAttributes=24576 #A table is required not just for a table, but also for every unique hash index and every ordered index. Maximum value is 20320, minimum is 8. MaxNoOfTables=4096 #Each ordered index requires 10KB of memory. MaxNoOfOrderedIndexes=2048 #Each unique hash index requires 15KB of memory. MaxNoOfUniqueHashIndexes=4096 #Replication, unique index operations, backups and order index operations require trigger objects. MaxNoOfTriggers=768 # #Backup Parameters #The following must hold when updating backup params: # 1. BackupDataBufferSize >= BackupWriteSize + 188KB # 2. BackupLogBufferSize >= BackupWriteSize + 16KB # 3. BackupMaxWriteSize >= BackupWriteSize # 4. BackupMemory = BackupDataBufferSize + BackupLogBufferSize BackupMemory=1856M #Increase if slow disk subsystem when making a backup. BackupDataBufferSize=928M #Increase if slow disk subsystem when making a backup. BackupLogBufferSize=928M #Default size of msgs written to disk. This value must be less than or equal to BackupMaxWriteSize. BackupWriteSize=32000 #Max size of msgs written to disk. This value must be greater than or equal to than BackupMaxWriteSize. BackupMaxWriteSize=256000 #Frequency of Backups BackupReportFrequency=100 # # Real-time params # TransactionDeadlockDetectionTimeout=8000 #Setting this to '1' enables real-time scheduling of ndb threads. #RealtimeScheduler=0 #Time in microseconds for threads to be executed in the scheduler before being sent. #SchedulerExecutionTimer=50 #Time in microseconds for threads to be executed in the scheduler before sleeping. #SchedulerSpinTimer=0 #Whether a ndbd process should halt or be restarted if there is an error in it. StopOnError=true #where disk-based user data will be used for the cluster Diskless=false #Data nodes attempt to use O_DIRECT when writing LCPs, backups and redo logs. #Should be disabled for 2.4 or older kernels, enabled for 2.6 or higher kernels ODirect=true #[SHM] #NodeId1=1 #NodeId2=2 #ShmKey=324 #ShmSize=4M [MYSQLD DEFAULT] [NDB_MGMD DEFAULT] [TCP DEFAULT] # I found this settting buggy with 5.1.17-beta, when starting ndb_mgmd process. # SendBufferMemory=2M # Management Server [NDB_MGMD] HostName=10.1.9.137 PortNumber=1186 DataDir=/var/lib/mysql-cluster/logs LogDestination=FILE:filename=/var/lib/mysql-cluster/logs/cluster.log,maxsize=1000000,maxfiles=6 NodeID=63 [NDBD] HostName=10.1.9.56 DataDir=/var/lib/mysql-cluster/logs FileSystemPath=/space/ndbdata/1 NodeID=1 DiskPageBufferMemory= 754M SharedGlobalMemory= 396M #Assign this data node to a CPU-ID #LockExecuteThreadToCPU= #LockMaintThreadToCPU= #DiskPageBufferMemory= 250M [NDBD] HostName=10.1.9.57 DataDir=/var/lib/mysql-cluster/logs FileSystemPath=/space/ndbdata/2 NodeID=2 DiskPageBufferMemory= 724M SharedGlobalMemory= 396M #Assign this data node to a CPU-ID #LockExecuteThreadToCPU= #LockMaintThreadToCPU= #DiskPageBufferMemory= 250M # Setup node IDs for mySQL API-servers (clients of the cluster) [MYSQLD] HostName=10.1.9.137 [MYSQLD] HostName=10.1.9.135 #wait_timeout=100000 #interactive_timeout=100000 [MYSQLD] HostName=10.1.9.136 #wait_timeout=100000 #interactive_timeout=100000 [MYSQLD] [MYSQLD] [MYSQLD] [MYSQLD] [MYSQLD] [MYSQLD]
[25 Apr 2009 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".