Description:
there was no human readable message in multiple log and trace files, that could help to solve an issue when ndbd does not start, actually because of lack of undo space.
Message like this:
------
Error: 2303
Error data: Killed by node 2 as copyfrag failed, error: 1501
Error object: NDBCNTR (Line: 235) 0x0000000e
------
does not explain the real meaning of error 1501, giving some cryptic "Killed by node 2 as copyfrag failed" statement, which could be helpful for support or product developers people, but it is quite misleading for DBAs.
-------------
I've compiled mysql with ndb from entire source tree 5.1.22
I've started first node, it works fine. However when I start a second one it loads data to memory for ~40 min then when it reaches ~700-850Mb the ndbd process fails.
ndb_3_error.log sais:
----------
Current byte-offset of file-pointer is: 568
Time: Monday 20 August 2007 - 17:01:12
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal e
rror, programming error or missing error message, please report a bug)
Error: 2303
Error data: Killed by node 2 as copyfrag failed, error: 1501
Error object: NDBCNTR (Line: 235) 0x0000000e
Program: /usr/local/mysql/libexec/ndbd
Pid: 7498
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1
Version: Version 5.1.22 (beta)
***EOM***
------------
ndb_3_out.log sais:
------------
....
allocated 32 pages: [ m_page_no: 30723 m_file_no: 0 m_page_idx: 2324]
allocated 32 pages: [ m_page_no: 30755 m_file_no: 0 m_page_idx: 2325]
2007-08-20 17:01:12 [ndbd] INFO -- Killed by node 2 as copyfrag failed, erro
r: 1501
2007-08-20 17:01:12 [ndbd] INFO -- NDBCNTR (Line: 235) 0x0000000e
2007-08-20 17:01:12 [ndbd] INFO -- Error handler startup shutting down syste
m
2007-08-20 17:01:13 [ndbd] INFO -- Error handler shutdown completed - aborti
ng
2007-08-20 17:01:13 [ndbd] INFO -- Angel received ndbd startup failure count
1.
2007-08-20 17:01:13 [ndbd] ALERT -- Node 3: Forced node shutdown completed. O
ccured during startphase 5. Initiated by signal 6. Caused by error 2303: 'System
error, node killed during node restart by other node(Internal error, programmin
g error or missing error message, please report a bug)
------------
managemnt node log ndb_1_cluster.log sais:
------------
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Network partitioning - arbitra
tion required
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: President restarts arbitration
thread [state=7]
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Arbitration won - positive rep
ly from node 1
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: DICT: remove lock by failed no
de 3 for NodeRestart
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: DICT: lock bs: 0 ops: 0 poll:
0 cnt: 0 queue:
2007-08-20 17:01:13 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [tic
ket=51a30007851449e0]
2007-08-20 17:01:13 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed
. Occured during startphase 5. Initiated by signal 6. Caused by error 2303: 'Sys
tem error, node killed during node restart by other node(Internal error, program
ming error or missing error message, please report a bug)
2007-08-20 17:01:16 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened
2007-08-20 17:18:27 [MgmSrvr] INFO -- Node 2: Local checkpoint 731 started.
Keep GCI = 636889 oldest restorable GCI = 583856
------------
ndbd also generated a huge about 1Gb ndb_3_signal.log file and 2Mb ndb_3_trace.log.1.
How to repeat:
Significantly decrease size of undo space for cluster tablespace i.e. by removing file(s) from logfile group. Or alternatively increase size of data. Restart ndbd. Look into log/trace files.
Suggested fix:
ndbd and ndb_mgmd must also print "Out of undo space" (sub)message whenever error 1501 happens.