Bug #30655 No error message that accompanies error 1501
Submitted: 27 Aug 2007 23:49 Modified: 28 Aug 2007 12:08
Reporter: Mark Malakanov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1 OS:Linux
Assigned to: CPU Architecture:Any
Tags: 5.1.22
Triage: Triaged: D4 (Minor)

[27 Aug 2007 23:49] Mark Malakanov
Description:
there was no human readable message in multiple log and trace files, that could help to solve an issue when ndbd does not start, actually because of lack of undo space. 

Message like this: 
------ 
Error: 2303 
Error data: Killed by node 2 as copyfrag failed, error: 1501 
Error object: NDBCNTR (Line: 235) 0x0000000e 
------ 
does not explain the real meaning of error 1501, giving some cryptic "Killed by node 2 as copyfrag failed" statement, which could be helpful for support or product developers people, but it is quite misleading for DBAs. 

-------------
I've compiled mysql with ndb from entire source tree 5.1.22 
I've started first node, it works fine. However when I start a second one it loads data to memory for ~40 min then when it reaches ~700-850Mb the ndbd process fails. 

ndb_3_error.log sais: 
---------- 
Current byte-offset of file-pointer is: 568 
Time: Monday 20 August 2007 - 17:01:12 
Status: Temporary error, restart node 
Message: System error, node killed during node restart by other node (Internal e 
rror, programming error or missing error message, please report a bug) 
Error: 2303 
Error data: Killed by node 2 as copyfrag failed, error: 1501 
Error object: NDBCNTR (Line: 235) 0x0000000e 
Program: /usr/local/mysql/libexec/ndbd 
Pid: 7498 
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1 
Version: Version 5.1.22 (beta) 
***EOM*** 
------------ 

ndb_3_out.log sais: 
------------ 
.... 
allocated 32 pages: [ m_page_no: 30723 m_file_no: 0 m_page_idx: 2324] 
allocated 32 pages: [ m_page_no: 30755 m_file_no: 0 m_page_idx: 2325] 
2007-08-20 17:01:12 [ndbd] INFO -- Killed by node 2 as copyfrag failed, erro 
r: 1501 
2007-08-20 17:01:12 [ndbd] INFO -- NDBCNTR (Line: 235) 0x0000000e 
2007-08-20 17:01:12 [ndbd] INFO -- Error handler startup shutting down syste 
m 
2007-08-20 17:01:13 [ndbd] INFO -- Error handler shutdown completed - aborti 
ng 
2007-08-20 17:01:13 [ndbd] INFO -- Angel received ndbd startup failure count 
1. 
2007-08-20 17:01:13 [ndbd] ALERT -- Node 3: Forced node shutdown completed. O 
ccured during startphase 5. Initiated by signal 6. Caused by error 2303: 'System 
error, node killed during node restart by other node(Internal error, programmin 
g error or missing error message, please report a bug) 
------------ 

managemnt node log ndb_1_cluster.log sais: 
------------ 
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed 

2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Network partitioning - arbitra 
tion required 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: President restarts arbitration 
thread [state=7] 
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Arbitration won - positive rep 
ly from node 1 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: DICT: remove lock by failed no 
de 3 for NodeRestart 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: DICT: lock bs: 0 ops: 0 poll: 
0 cnt: 0 queue: 
2007-08-20 17:01:13 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [tic 
ket=51a30007851449e0] 
2007-08-20 17:01:13 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed 
. Occured during startphase 5. Initiated by signal 6. Caused by error 2303: 'Sys 
tem error, node killed during node restart by other node(Internal error, program 
ming error or missing error message, please report a bug) 
2007-08-20 17:01:16 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened 

2007-08-20 17:18:27 [MgmSrvr] INFO -- Node 2: Local checkpoint 731 started. 
Keep GCI = 636889 oldest restorable GCI = 583856 
------------ 

ndbd also generated a huge about 1Gb ndb_3_signal.log file and 2Mb ndb_3_trace.log.1. 

How to repeat:
Significantly decrease size of undo space for cluster tablespace i.e. by removing file(s) from logfile group. Or alternatively increase size of data. Restart ndbd. Look into log/trace files.

Suggested fix:
ndbd and ndb_mgmd must also print "Out of undo space" (sub)message whenever error 1501 happens.
[28 Aug 2007 12:00] Hartmut Holzgraefe
The error is registered with an understandable message just fine

  ./storage/ndb/src/ndbapi/ndberror.c:  { 1501, DMEC, TR, "Out of undo space" },

i wonder why that isn't used in the cluster log ...
[28 Aug 2007 12:08] Hartmut Holzgraefe
The error code can be resolved using

 $ perror --ndb 1501

  NDB error code 1501: Out of undo space: Temporary error: Temporary Resource error

i agree though that "out of undo space" should be included in the 
logged error message for clarity
[8 Dec 2008 13:47] Frazer Clement
See also bug#40312
[25 May 2012 11:16] Gustaf Thorslund
See also FR in bug 50979:
  http://bugs.mysql.com/bug.php?id=50979