Bug #41169 ndb_restore of Disk Data fails with "Unknown: 4009: Cluster Failure"
Submitted: 2 Dec 2008 10:58 Modified: 11 Dec 2008 13:18
Reporter: John Hicks Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.29-ndb-6.3.19 OS:Linux (Centos 5.0)
Assigned to: CPU Architecture:Any
Tags: disk data, ndb_restore

[2 Dec 2008 10:58] John Hicks
Description:
We have four Dell servers running as 2x dedicated data and SQL nodes, and 2x management nodes (that double as mail servers, etc). All machines are 1.86GHz Dual Core Xeon with 4GB RAM and 2x 300Gb SAS 10K rpm drives as mirrored RAID.

The cluster has been running as a development system on 5.1.28-ndb-6.3.18 until yesterday when we were about to go live with it as a production system. In preparation for this, I started to delete the test data from the database so as to leave one month's data for reporting purposes. The schema has a few tables that use Disk Data storage for and one in particular for holding the contents of files uploaded to the system as Base-64 encoded strings of 300000 characters.

While trying to delete about 30000 rows from this table both ndbd processes crashed (sorry no trace files). I tried to restart the cluster but both ndbd processes stopped with an internal error. I shutdown the cluster, cleared the ns_N_fs directories, restarted the ndbd using --initial and then tried to use ndb_restore to load the last successful backup from 06:30 that morning before the data deletion began.

The restore failed several times, asking me to change NoOfFragmentLogFiles, TimeBetweenLocalCheckpoints, MaxNoOfConcurrentTransactions and RedoBuffer in various combinations. Current values for these can be seen in the config.ini collected by ndb_error_reporter and ready for submission with this bug.

This morning, I updated to 5.1.29-ndb-6.3.19 and tried to re-import the data. It was much faster but still failed, asking me to increase FragmentLogFileSize which I did.

Final attempt fails with:

Processing data in table: AMSLive/def/NDB$BLOB_163_4(164) fragment 0
Temporary error: 4010: Node failure caused abort of transaction
Unknown: 4009: Cluster Failure
Cannot start transaction

NDBT_ProgramExit: 1 - Failed

The end of the mgmd.log shows:

2008-12-02 09:41:14 [MgmSrvr] INFO     -- Node 3: Local checkpoint 71 started. Keep GCI = 196 oldest restorable GCI = 197
2008-12-02 09:41:30 [MgmSrvr] INFO     -- Node 3: Local checkpoint 72 started. Keep GCI = 200 oldest restorable GCI = 201
2008-12-02 09:42:03 [MgmSrvr] INFO     -- Node 3: Local checkpoint 73 started. Keep GCI = 208 oldest restorable GCI = 209
2008-12-02 09:42:45 [MgmSrvr] INFO     -- Node 5: Restore meta data: backup 72 from node 3 #Tables: 76
 #Tablespaces: 1 #Logfilegroups: 1 #datafiles: 10 #undofiles: 1
2008-12-02 09:42:53 [MgmSrvr] INFO     -- Node 3: Local checkpoint 74 started. Keep GCI = 224 oldest restorable GCI = 246
2008-12-02 09:43:57 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2008-12-02 09:43:57 [MgmSrvr] ALERT    -- Node 4: Node 3 Disconnected
2008-12-02 09:43:57 [MgmSrvr] INFO     -- Node 4: Communication to Node 3 closed
2008-12-02 09:43:57 [MgmSrvr] ALERT    -- Node 4: Network partitioning - arbitration required
2008-12-02 09:43:57 [MgmSrvr] INFO     -- Node 4: President restarts arbitration thread [state=7]
2008-12-02 09:43:57 [MgmSrvr] ALERT    -- Node 4: Arbitration won - positive reply from node 1
2008-12-02 09:43:57 [MgmSrvr] INFO     -- Node 4: GCP Take over started
2008-12-02 09:43:57 [MgmSrvr] INFO     -- Node 4: GCP Take over completed
2008-12-02 09:43:57 [MgmSrvr] INFO     -- Node 4: kk: 265/1 2 0
2008-12-02 09:43:57 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-12-02 09:43:57 [MgmSrvr] INFO     -- Node 4: Started arbitrator node 1 [ticket=7d3500026c59415c]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: LCP Take over started
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: LCP Take over completed (state = 5)
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: ParticipatingDIH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2008-12-02 09:44:08 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2008-12-02 09:44:09 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Please let me know if you require more information.

How to repeat:
Problem is consistently repeatable on our system. Probably could be duplicated using the BACKUP files and a similarly configured system.
[2 Dec 2008 11:00] John Hicks
NDB error capture

Attachment: ndb_error_report_20081202100714.tar.bz2 (application/octet-stream, text), 172.78 KiB.

[4 Dec 2008 14:01] Geert Vanderkelen
Hi,

Please set DiskPageBufferMemory and try again:

 http://johanandersson.blogspot.com/2008/10/diskpagebuffermemory.html
[4 Dec 2008 14:24] John Hicks
Thank you for the information. Looking at the figures on the linked page, we have probably not installed sufficient RAM to support the buffers and pages. 

We have, since logging the bug report, decided to replace the MySQL Cluster solution with an InnoDB/DRBD/Heartbeat one that should prove more suitable for our size of company and application needs, and still enable us to afford a MySQL support contract if we ever need to take one out.

This bug can be closed.