Bug #27942 node failure during massive insert
Submitted: 18 Apr 2007 23:59 Modified: 15 May 2007 6:48
Reporter: LR Mack McBride Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S3 (Non-critical)
Version:5.1.18 OS:Linux (CentOS 4.4)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: disk tables, error, inserting, node failure

[18 Apr 2007 23:59] LR Mack McBride
Description:
ERROR 1297 (HY000) at line 1883: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

and

ERROR 1297 (HY000) at line 1879: Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER

We have a data node pair (replica 2).
restoring the database from a mysqldump causes a node failure.
The master node is generally uneffected.
Sometimes both nodes crash.
It generally occurs in the same section of the dump file but not always.

If one data node is down the restore is successful.
This leads me to believe it is a problem with the disk table code or
the replication code.

cluster log

2007-04-18 18:19:31 [MgmSrvr] INFO     -- Node 3: Local checkpoint 7 started. Keep GCI = 743 oldest restorable GCI = 747
2007-04-18 18:19:56 [MgmSrvr] INFO     -- Node 5: Event buffer status: used=56MB(100%) alloc=56MB(0%) max=0B apply_gci=783 latest_gci=783
2007-04-18 18:20:28 [MgmSrvr] INFO     -- Node 4: Data usage is 7%(2300 32K pages of total 32768)
2007-04-18 18:20:28 [MgmSrvr] INFO     -- Node 4: Index usage is 7%(795 8K pages of total 10272)
2007-04-18 18:20:29 [MgmSrvr] INFO     -- Node 3: Index usage is 7%(795 8K pages of total 10272)
2007-04-18 18:20:49 [MgmSrvr] INFO     -- Node 3: Local checkpoint 8 started. Keep GCI = 772 oldest restorable GCI = 787
2007-04-18 18:22:28 [MgmSrvr] INFO     -- Node 4: Data usage is 9%(3213 32K pages of total 32768)
2007-04-18 18:22:28 [MgmSrvr] INFO     -- Node 4: Index usage is 9%(941 8K pages of total 10272)
2007-04-18 18:22:29 [MgmSrvr] INFO     -- Node 3: Index usage is 9%(941 8K pages of total 10272)
2007-04-18 18:22:30 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2007-04-18 18:22:30 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2007-04-18 18:22:30 [MgmSrvr] ALERT    -- Node 4: Network partitioning - arbitration required
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: President restarts arbitration thread [state=7]
2007-04-18 18:22:30 [MgmSrvr] ALERT    -- Node 4: Arbitration won - positive reply from node 2
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: GCP Take over started
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: LCP Take over started
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: GCP Take over completed
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: LCP Take over completed (state = 6)
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2007-04-18 18:22:30 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2007-04-18 18:22:31 [MgmSrvr] INFO     -- Node 4: Started arbitrator node 1 [ticket=601500020881ba92]
2007-04-18 18:22:46 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 opened
2007-04-18 18:24:29 [MgmSrvr] INFO     -- Node 4: Data usage is 9%(3273 32K pages of total 32768)
2007-04-18 18:24:29 [MgmSrvr] INFO     -- Node 4: Index usage is 9%(941 8K pages of total 10272)

node error log:

Time: Wednesday 18 April 2007 - 18:22:30
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: main.cpp
Program: ndbd
Pid: 5183
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1
Version: Version 5.1.17 (beta)
***EOM***

last portion of node output log:
allocated 32 pages: [ m_page_no: 3714 m_file_no: 3 m_page_idx: 1480]
allocated 32 pages: [ m_page_no: 3746 m_file_no: 3 m_page_idx: 1481]
allocated 32 pages: [ m_page_no: 3778 m_file_no: 3 m_page_idx: 1482]
allocated 32 pages: [ m_page_no: 3810 m_file_no: 3 m_page_idx: 1483]
allocated 32 pages: [ m_page_no: 3842 m_file_no: 3 m_page_idx: 1484]
allocated 32 pages: [ m_page_no: 3874 m_file_no: 3 m_page_idx: 1485]
2007-04-18 18:22:30 [ndbd] INFO     -- Received signal 6. Running error handler.
2007-04-18 18:22:30 [ndbd] INFO     -- Signal 6 received; Aborted
2007-04-18 18:22:30 [ndbd] INFO     -- main.cpp
2007-04-18 18:22:30 [ndbd] INFO     -- Error handler signal shutting down system

How to repeat:
config.ini

[TCP DEFAULT]
SendBufferMemory= 8M

[NDBD DEFAULT]
NoOfReplicas= 2
DataDir= /var/lib/mysql-cluster
DataMemory= 1024M
IndexMemory= 80M
MaxNoofTables=400
MaxNoofAttributes= 4000
MaxNoOfConcurrentOperations= 166000
MaxNoOfOrderedIndexes=1000
MaxNoOfUniqueHashIndexes=1000
MemReportFrequency= 120
SharedGlobalMemory= 96M
NoOfFragmentLogFiles=32
RedoBuffer= 32M
MaxNoOfOpenFiles=60

[NDB_MGMD]
Id=1
Hostname= 64.72.127.150
DataDir= /var/lib/mysql-cluster
ArbitrationRank= 1

[NDB_MGMD]
id=2
Hostname= 64.72.127.183
DataDir= /var/lib/mysql-cluster
ArbitrationRank= 2

[NDBD]
Id=3
HostName= 64.72.127.183
[NDBD]
Id=4
HostName= 64.72.127.152

[MYSQLD]
Id=5
HostName= 64.72.127.183
ArbitrationRank= 0
[MYSQLD]
id=6
HostName= 64.72.127.184
ArbitrationRank= 0

[MYSQLD]
id=7
HostName= 64.72.127.152
ArbitrationRank= 0

[MYSQLD]
Id=8
HostName= 64.72.127.150
ArbitrationRank= 0

disktable setup
CREATE LOGFILE GROUP lg_1
    ADD UNDOFILE 'undo_1.dat'
    INITIAL_SIZE 256M
    UNDO_BUFFER_SIZE 64M
    ENGINE NDB;

ALTER LOGFILE GROUP lg_1
    ADD UNDOFILE 'undo_2.dat'
    INITIAL_SIZE 256M
    ENGINE NDB;

CREATE TABLESPACE ts_1
    ADD DATAFILE 'data_1.dat'
    USE LOGFILE GROUP lg_1
    INITIAL_SIZE 512M
    ENGINE NDB;

ALTER TABLESPACE ts_1
    ADD DATAFILE 'data_2.dat'
    INITIAL_SIZE 512M
    ENGINE NDB;

ALTER TABLESPACE ts_1
    ADD DATAFILE 'data_3.dat'
    INITIAL_SIZE 512M
    ENGINE NDB;

ALTER TABLESPACE ts_1
    ADD DATAFILE 'data_4.dat'
    INITIAL_SIZE 512M
    ENGINE NDB;

create database [database];

then from command line execute a data restore of the old database.

mysql -u root [database] < test2.sql
[19 Apr 2007 5:34] LR Mack McBride
This problem appears only when disk tables are used.
If the tables that are disk are converted to ram it does not occur.
If the tablespaces are removed and regular ndb inserts are used the problem does not occur.

This seems to indicate that the disk table are somehow part of the problem.
[19 Apr 2007 7:50] Adam Dixon
Hello,
Your sample data file does not specify to use disk data for non-indexed columns.
They all seem to be ENGINE=NDBCLUSTER but they should be TABLESPACE ts_1 STORAGE DISK ENGINE NDB;

Did you provide the right dump file?
[19 Apr 2007 8:19] LR Mack McBride
Only 20 of the 159 tables use disk, the remainder use RAM.
These tables make up more than half of the total space used.
The table where the crashes seem to occur most often is _vb_post.
This is a disk table.

There is also one myisam table as it contains a larger number of text fields than the cluster will support (8kb row limit).
[23 Apr 2007 4:01] Adam Dixon
Verified.
[2 May 2007 10:38] Jonas Oreland
Adam, can you test with the patch that I attached?

/Jonas
[2 May 2007 10:38] Jonas Oreland
Possible fix

Attachment: patch (application/octet-stream, text), 436 bytes.

[2 May 2007 23:38] Adam Dixon
Jonas, your patch works. I could not reproduce as I did before.
[7 May 2007 14:03] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26225

ChangeSet@1.2473, 2007-05-07 16:02:57+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#27942
    Increase size of memory channel
[7 May 2007 14:07] Jonas Oreland
pushed into 51-ndb, 51-telco & telco-6.1
[10 May 2007 12:57] Bugs System
Pushed into 5.1.19-beta
[15 May 2007 6:48] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.19 and telco-6.1.17 changelogs.
[15 May 2007 6:50] Jon Stephens
Changed category to Cluster:Disk Data.