Description:
We have four Dell servers running as 2x dedicated data and SQL nodes, and 2x management nodes (that double as mail servers, etc). All machines are 1.86GHz Dual Core Xeon with 4GB RAM and 2x 300Gb SAS 10K rpm drives as mirrored RAID.
The cluster has been running as a development system on 5.1.28-ndb-6.3.18 until yesterday when we were about to go live with it as a production system. In preparation for this, I started to delete the test data from the database so as to leave one month's data for reporting purposes. The schema has a few tables that use Disk Data storage for and one in particular for holding the contents of files uploaded to the system as Base-64 encoded strings of 300000 characters.
While trying to delete about 30000 rows from this table both ndbd processes crashed (sorry no trace files). I tried to restart the cluster but both ndbd processes stopped with an internal error. I shutdown the cluster, cleared the ns_N_fs directories, restarted the ndbd using --initial and then tried to use ndb_restore to load the last successful backup from 06:30 that morning before the data deletion began.
The restore failed several times, asking me to change NoOfFragmentLogFiles, TimeBetweenLocalCheckpoints, MaxNoOfConcurrentTransactions and RedoBuffer in various combinations. Current values for these can be seen in the config.ini collected by ndb_error_reporter and ready for submission with this bug.
This morning, I updated to 5.1.29-ndb-6.3.19 and tried to re-import the data. It was much faster but still failed, asking me to increase FragmentLogFileSize which I did.
Final attempt fails with:
Processing data in table: AMSLive/def/NDB$BLOB_163_4(164) fragment 0
Temporary error: 4010: Node failure caused abort of transaction
Unknown: 4009: Cluster Failure
Cannot start transaction
NDBT_ProgramExit: 1 - Failed
The end of the mgmd.log shows:
2008-12-02 09:41:14 [MgmSrvr] INFO -- Node 3: Local checkpoint 71 started. Keep GCI = 196 oldest restorable GCI = 197
2008-12-02 09:41:30 [MgmSrvr] INFO -- Node 3: Local checkpoint 72 started. Keep GCI = 200 oldest restorable GCI = 201
2008-12-02 09:42:03 [MgmSrvr] INFO -- Node 3: Local checkpoint 73 started. Keep GCI = 208 oldest restorable GCI = 209
2008-12-02 09:42:45 [MgmSrvr] INFO -- Node 5: Restore meta data: backup 72 from node 3 #Tables: 76
#Tablespaces: 1 #Logfilegroups: 1 #datafiles: 10 #undofiles: 1
2008-12-02 09:42:53 [MgmSrvr] INFO -- Node 3: Local checkpoint 74 started. Keep GCI = 224 oldest restorable GCI = 246
2008-12-02 09:43:57 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2008-12-02 09:43:57 [MgmSrvr] ALERT -- Node 4: Node 3 Disconnected
2008-12-02 09:43:57 [MgmSrvr] INFO -- Node 4: Communication to Node 3 closed
2008-12-02 09:43:57 [MgmSrvr] ALERT -- Node 4: Network partitioning - arbitration required
2008-12-02 09:43:57 [MgmSrvr] INFO -- Node 4: President restarts arbitration thread [state=7]
2008-12-02 09:43:57 [MgmSrvr] ALERT -- Node 4: Arbitration won - positive reply from node 1
2008-12-02 09:43:57 [MgmSrvr] INFO -- Node 4: GCP Take over started
2008-12-02 09:43:57 [MgmSrvr] INFO -- Node 4: GCP Take over completed
2008-12-02 09:43:57 [MgmSrvr] INFO -- Node 4: kk: 265/1 2 0
2008-12-02 09:43:57 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-12-02 09:43:57 [MgmSrvr] INFO -- Node 4: Started arbitrator node 1 [ticket=7d3500026c59415c]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: LCP Take over started
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: LCP Take over completed (state = 5)
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: ParticipatingDIH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000010
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=1 0000000000000010]
2008-12-02 09:44:04 [MgmSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2008-12-02 09:44:08 [MgmSrvr] ALERT -- Node 2: Node 4 Disconnected
2008-12-02 09:44:09 [MgmSrvr] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Please let me know if you require more information.
How to repeat:
Problem is consistently repeatable on our system. Probably could be duplicated using the BACKUP files and a similarly configured system.