MySQL Bugs: #30655: No error message that accompanies error 1501

Bug #30655	No error message that accompanies error 1501
Submitted:	27 Aug 2007 23:49	Modified:	28 Aug 2007 12:08
Reporter:	Mark Malakanov	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	5.1.22

Description:
there was no human readable message in multiple log and trace files, that could help to solve an issue when ndbd does not start, actually because of lack of undo space. 

Message like this: 
------ 
Error: 2303 
Error data: Killed by node 2 as copyfrag failed, error: 1501 
Error object: NDBCNTR (Line: 235) 0x0000000e 
------ 
does not explain the real meaning of error 1501, giving some cryptic "Killed by node 2 as copyfrag failed" statement, which could be helpful for support or product developers people, but it is quite misleading for DBAs. 

-------------
I've compiled mysql with ndb from entire source tree 5.1.22 
I've started first node, it works fine. However when I start a second one it loads data to memory for ~40 min then when it reaches ~700-850Mb the ndbd process fails. 

ndb_3_error.log sais: 
---------- 
Current byte-offset of file-pointer is: 568 
Time: Monday 20 August 2007 - 17:01:12 
Status: Temporary error, restart node 
Message: System error, node killed during node restart by other node (Internal e 
rror, programming error or missing error message, please report a bug) 
Error: 2303 
Error data: Killed by node 2 as copyfrag failed, error: 1501 
Error object: NDBCNTR (Line: 235) 0x0000000e 
Program: /usr/local/mysql/libexec/ndbd 
Pid: 7498 
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.1 
Version: Version 5.1.22 (beta) 
***EOM*** 
------------ 

ndb_3_out.log sais: 
------------ 
.... 
allocated 32 pages: [ m_page_no: 30723 m_file_no: 0 m_page_idx: 2324] 
allocated 32 pages: [ m_page_no: 30755 m_file_no: 0 m_page_idx: 2325] 
2007-08-20 17:01:12 [ndbd] INFO -- Killed by node 2 as copyfrag failed, erro 
r: 1501 
2007-08-20 17:01:12 [ndbd] INFO -- NDBCNTR (Line: 235) 0x0000000e 
2007-08-20 17:01:12 [ndbd] INFO -- Error handler startup shutting down syste 
m 
2007-08-20 17:01:13 [ndbd] INFO -- Error handler shutdown completed - aborti 
ng 
2007-08-20 17:01:13 [ndbd] INFO -- Angel received ndbd startup failure count 
1. 
2007-08-20 17:01:13 [ndbd] ALERT -- Node 3: Forced node shutdown completed. O 
ccured during startphase 5. Initiated by signal 6. Caused by error 2303: 'System 
error, node killed during node restart by other node(Internal error, programmin 
g error or missing error message, please report a bug) 
------------ 

managemnt node log ndb_1_cluster.log sais: 
------------ 
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed 

2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Network partitioning - arbitra 
tion required 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: President restarts arbitration 
thread [state=7] 
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 
2007-08-20 17:01:12 [MgmSrvr] ALERT -- Node 2: Arbitration won - positive rep 
ly from node 1 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: DICT: remove lock by failed no 
de 3 for NodeRestart 
2007-08-20 17:01:12 [MgmSrvr] INFO -- Node 2: DICT: lock bs: 0 ops: 0 poll: 
0 cnt: 0 queue: 
2007-08-20 17:01:13 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [tic 
ket=51a30007851449e0] 
2007-08-20 17:01:13 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed 
. Occured during startphase 5. Initiated by signal 6. Caused by error 2303: 'Sys 
tem error, node killed during node restart by other node(Internal error, program 
ming error or missing error message, please report a bug) 
2007-08-20 17:01:16 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened 

2007-08-20 17:18:27 [MgmSrvr] INFO -- Node 2: Local checkpoint 731 started. 
Keep GCI = 636889 oldest restorable GCI = 583856 
------------ 

ndbd also generated a huge about 1Gb ndb_3_signal.log file and 2Mb ndb_3_trace.log.1. 

How to repeat:
Significantly decrease size of undo space for cluster tablespace i.e. by removing file(s) from logfile group. Or alternatively increase size of data. Restart ndbd. Look into log/trace files.

Suggested fix:
ndbd and ndb_mgmd must also print "Out of undo space" (sub)message whenever error 1501 happens.

The error is registered with an understandable message just fine

  ./storage/ndb/src/ndbapi/ndberror.c:  { 1501, DMEC, TR, "Out of undo space" },

i wonder why that isn't used in the cluster log ...

The error code can be resolved using

 $ perror --ndb 1501

  NDB error code 1501: Out of undo space: Temporary error: Temporary Resource error

i agree though that "out of undo space" should be included in the 
logged error message for clarity

See also bug#40312

See also FR in bug 50979:
  http://bugs.mysql.com/bug.php?id=50979