Description:
ERROR 1297 (HY000) at line 1883: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER
and
ERROR 1297 (HY000) at line 1879: Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER
We have a data node pair (replica 2).
restoring the database from a mysqldump causes a node failure.
The master node is generally uneffected.
Sometimes both nodes crash.
It generally occurs in the same section of the dump file but not always.
If one data node is down the restore is successful.
This leads me to believe it is a problem with the disk table code or
the replication code.
cluster log
2007-04-18 18:19:31 [MgmSrvr] INFO -- Node 3: Local checkpoint 7 started. Keep GCI = 743 oldest restorable GCI = 747
2007-04-18 18:19:56 [MgmSrvr] INFO -- Node 5: Event buffer status: used=56MB(100%) alloc=56MB(0%) max=0B apply_gci=783 latest_gci=783
2007-04-18 18:20:28 [MgmSrvr] INFO -- Node 4: Data usage is 7%(2300 32K pages of total 32768)
2007-04-18 18:20:28 [MgmSrvr] INFO -- Node 4: Index usage is 7%(795 8K pages of total 10272)
2007-04-18 18:20:29 [MgmSrvr] INFO -- Node 3: Index usage is 7%(795 8K pages of total 10272)
2007-04-18 18:20:49 [MgmSrvr] INFO -- Node 3: Local checkpoint 8 started. Keep GCI = 772 oldest restorable GCI = 787
2007-04-18 18:22:28 [MgmSrvr] INFO -- Node 4: Data usage is 9%(3213 32K pages of total 32768)
2007-04-18 18:22:28 [MgmSrvr] INFO -- Node 4: Index usage is 9%(941 8K pages of total 10272)
2007-04-18 18:22:29 [MgmSrvr] INFO -- Node 3: Index usage is 9%(941 8K pages of total 10272)
2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed
2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 4: Network partitioning - arbitration required
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: President restarts arbitration thread [state=7]
2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 2
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: GCP Take over started
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: LCP Take over started
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: GCP Take over completed
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: LCP Take over completed (state = 6)
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2007-04-18 18:22:31 [MgmSrvr] INFO -- Node 4: Started arbitrator node 1 [ticket=601500020881ba92]
2007-04-18 18:22:46 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened
2007-04-18 18:24:29 [MgmSrvr] INFO -- Node 4: Data usage is 9%(3273 32K pages of total 32768)
2007-04-18 18:24:29 [MgmSrvr] INFO -- Node 4: Index usage is 9%(941 8K pages of total 10272)
node error log:
Time: Wednesday 18 April 2007 - 18:22:30
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: main.cpp
Program: ndbd
Pid: 5183
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1
Version: Version 5.1.17 (beta)
***EOM***
last portion of node output log:
allocated 32 pages: [ m_page_no: 3714 m_file_no: 3 m_page_idx: 1480]
allocated 32 pages: [ m_page_no: 3746 m_file_no: 3 m_page_idx: 1481]
allocated 32 pages: [ m_page_no: 3778 m_file_no: 3 m_page_idx: 1482]
allocated 32 pages: [ m_page_no: 3810 m_file_no: 3 m_page_idx: 1483]
allocated 32 pages: [ m_page_no: 3842 m_file_no: 3 m_page_idx: 1484]
allocated 32 pages: [ m_page_no: 3874 m_file_no: 3 m_page_idx: 1485]
2007-04-18 18:22:30 [ndbd] INFO -- Received signal 6. Running error handler.
2007-04-18 18:22:30 [ndbd] INFO -- Signal 6 received; Aborted
2007-04-18 18:22:30 [ndbd] INFO -- main.cpp
2007-04-18 18:22:30 [ndbd] INFO -- Error handler signal shutting down system
How to repeat:
config.ini
[TCP DEFAULT]
SendBufferMemory= 8M
[NDBD DEFAULT]
NoOfReplicas= 2
DataDir= /var/lib/mysql-cluster
DataMemory= 1024M
IndexMemory= 80M
MaxNoofTables=400
MaxNoofAttributes= 4000
MaxNoOfConcurrentOperations= 166000
MaxNoOfOrderedIndexes=1000
MaxNoOfUniqueHashIndexes=1000
MemReportFrequency= 120
SharedGlobalMemory= 96M
NoOfFragmentLogFiles=32
RedoBuffer= 32M
MaxNoOfOpenFiles=60
[NDB_MGMD]
Id=1
Hostname= 64.72.127.150
DataDir= /var/lib/mysql-cluster
ArbitrationRank= 1
[NDB_MGMD]
id=2
Hostname= 64.72.127.183
DataDir= /var/lib/mysql-cluster
ArbitrationRank= 2
[NDBD]
Id=3
HostName= 64.72.127.183
[NDBD]
Id=4
HostName= 64.72.127.152
[MYSQLD]
Id=5
HostName= 64.72.127.183
ArbitrationRank= 0
[MYSQLD]
id=6
HostName= 64.72.127.184
ArbitrationRank= 0
[MYSQLD]
id=7
HostName= 64.72.127.152
ArbitrationRank= 0
[MYSQLD]
Id=8
HostName= 64.72.127.150
ArbitrationRank= 0
disktable setup
CREATE LOGFILE GROUP lg_1
ADD UNDOFILE 'undo_1.dat'
INITIAL_SIZE 256M
UNDO_BUFFER_SIZE 64M
ENGINE NDB;
ALTER LOGFILE GROUP lg_1
ADD UNDOFILE 'undo_2.dat'
INITIAL_SIZE 256M
ENGINE NDB;
CREATE TABLESPACE ts_1
ADD DATAFILE 'data_1.dat'
USE LOGFILE GROUP lg_1
INITIAL_SIZE 512M
ENGINE NDB;
ALTER TABLESPACE ts_1
ADD DATAFILE 'data_2.dat'
INITIAL_SIZE 512M
ENGINE NDB;
ALTER TABLESPACE ts_1
ADD DATAFILE 'data_3.dat'
INITIAL_SIZE 512M
ENGINE NDB;
ALTER TABLESPACE ts_1
ADD DATAFILE 'data_4.dat'
INITIAL_SIZE 512M
ENGINE NDB;
create database [database];
then from command line execute a data restore of the old database.
mysql -u root [database] < test2.sql
Description: ERROR 1297 (HY000) at line 1883: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER and ERROR 1297 (HY000) at line 1879: Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER We have a data node pair (replica 2). restoring the database from a mysqldump causes a node failure. The master node is generally uneffected. Sometimes both nodes crash. It generally occurs in the same section of the dump file but not always. If one data node is down the restore is successful. This leads me to believe it is a problem with the disk table code or the replication code. cluster log 2007-04-18 18:19:31 [MgmSrvr] INFO -- Node 3: Local checkpoint 7 started. Keep GCI = 743 oldest restorable GCI = 747 2007-04-18 18:19:56 [MgmSrvr] INFO -- Node 5: Event buffer status: used=56MB(100%) alloc=56MB(0%) max=0B apply_gci=783 latest_gci=783 2007-04-18 18:20:28 [MgmSrvr] INFO -- Node 4: Data usage is 7%(2300 32K pages of total 32768) 2007-04-18 18:20:28 [MgmSrvr] INFO -- Node 4: Index usage is 7%(795 8K pages of total 10272) 2007-04-18 18:20:29 [MgmSrvr] INFO -- Node 3: Index usage is 7%(795 8K pages of total 10272) 2007-04-18 18:20:49 [MgmSrvr] INFO -- Node 3: Local checkpoint 8 started. Keep GCI = 772 oldest restorable GCI = 787 2007-04-18 18:22:28 [MgmSrvr] INFO -- Node 4: Data usage is 9%(3213 32K pages of total 32768) 2007-04-18 18:22:28 [MgmSrvr] INFO -- Node 4: Index usage is 9%(941 8K pages of total 10272) 2007-04-18 18:22:29 [MgmSrvr] INFO -- Node 3: Index usage is 9%(941 8K pages of total 10272) 2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected 2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed 2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 4: Network partitioning - arbitration required 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: President restarts arbitration thread [state=7] 2007-04-18 18:22:30 [MgmSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 2 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: GCP Take over started 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: LCP Take over started 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000] 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000] 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000] 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: GCP Take over completed 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: LCP Take over completed (state = 6) 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010] 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000] 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000] 2007-04-18 18:22:30 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0 2007-04-18 18:22:31 [MgmSrvr] INFO -- Node 4: Started arbitrator node 1 [ticket=601500020881ba92] 2007-04-18 18:22:46 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened 2007-04-18 18:24:29 [MgmSrvr] INFO -- Node 4: Data usage is 9%(3273 32K pages of total 32768) 2007-04-18 18:24:29 [MgmSrvr] INFO -- Node 4: Index usage is 9%(941 8K pages of total 10272) node error log: Time: Wednesday 18 April 2007 - 18:22:30 Status: Temporary error, restart node Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug) Error: 6000 Error data: Signal 6 received; Aborted Error object: main.cpp Program: ndbd Pid: 5183 Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1 Version: Version 5.1.17 (beta) ***EOM*** last portion of node output log: allocated 32 pages: [ m_page_no: 3714 m_file_no: 3 m_page_idx: 1480] allocated 32 pages: [ m_page_no: 3746 m_file_no: 3 m_page_idx: 1481] allocated 32 pages: [ m_page_no: 3778 m_file_no: 3 m_page_idx: 1482] allocated 32 pages: [ m_page_no: 3810 m_file_no: 3 m_page_idx: 1483] allocated 32 pages: [ m_page_no: 3842 m_file_no: 3 m_page_idx: 1484] allocated 32 pages: [ m_page_no: 3874 m_file_no: 3 m_page_idx: 1485] 2007-04-18 18:22:30 [ndbd] INFO -- Received signal 6. Running error handler. 2007-04-18 18:22:30 [ndbd] INFO -- Signal 6 received; Aborted 2007-04-18 18:22:30 [ndbd] INFO -- main.cpp 2007-04-18 18:22:30 [ndbd] INFO -- Error handler signal shutting down system How to repeat: config.ini [TCP DEFAULT] SendBufferMemory= 8M [NDBD DEFAULT] NoOfReplicas= 2 DataDir= /var/lib/mysql-cluster DataMemory= 1024M IndexMemory= 80M MaxNoofTables=400 MaxNoofAttributes= 4000 MaxNoOfConcurrentOperations= 166000 MaxNoOfOrderedIndexes=1000 MaxNoOfUniqueHashIndexes=1000 MemReportFrequency= 120 SharedGlobalMemory= 96M NoOfFragmentLogFiles=32 RedoBuffer= 32M MaxNoOfOpenFiles=60 [NDB_MGMD] Id=1 Hostname= 64.72.127.150 DataDir= /var/lib/mysql-cluster ArbitrationRank= 1 [NDB_MGMD] id=2 Hostname= 64.72.127.183 DataDir= /var/lib/mysql-cluster ArbitrationRank= 2 [NDBD] Id=3 HostName= 64.72.127.183 [NDBD] Id=4 HostName= 64.72.127.152 [MYSQLD] Id=5 HostName= 64.72.127.183 ArbitrationRank= 0 [MYSQLD] id=6 HostName= 64.72.127.184 ArbitrationRank= 0 [MYSQLD] id=7 HostName= 64.72.127.152 ArbitrationRank= 0 [MYSQLD] Id=8 HostName= 64.72.127.150 ArbitrationRank= 0 disktable setup CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'undo_1.dat' INITIAL_SIZE 256M UNDO_BUFFER_SIZE 64M ENGINE NDB; ALTER LOGFILE GROUP lg_1 ADD UNDOFILE 'undo_2.dat' INITIAL_SIZE 256M ENGINE NDB; CREATE TABLESPACE ts_1 ADD DATAFILE 'data_1.dat' USE LOGFILE GROUP lg_1 INITIAL_SIZE 512M ENGINE NDB; ALTER TABLESPACE ts_1 ADD DATAFILE 'data_2.dat' INITIAL_SIZE 512M ENGINE NDB; ALTER TABLESPACE ts_1 ADD DATAFILE 'data_3.dat' INITIAL_SIZE 512M ENGINE NDB; ALTER TABLESPACE ts_1 ADD DATAFILE 'data_4.dat' INITIAL_SIZE 512M ENGINE NDB; create database [database]; then from command line execute a data restore of the old database. mysql -u root [database] < test2.sql