Bug #22299 mgmd crash due to unchecked TransporterFacade::ThreadData expand()
Submitted: 13 Sep 2006 7:41 Modified: 3 Jan 2007 3:32
Reporter: Stewart Smith Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.0, 5.1 OS:
Assigned to: Stewart Smith CPU Architecture:Any

[13 Sep 2006 7:41] Stewart Smith
Description:
> Hello,
> management-server crashed again.

>> I've now modified the ndb_mgmd init script to enable core dumps
>> for the ndb_mgmd process by adding an "ulimit -c unlimited" to
>> the startup file and restarted the management server, so on the
>> next crash we should be able to get some more information on 
>> what happened from the core file in /var/lib/mysql-cluster

ok, we've got a core file now, the backtrace looks like this:

(gdb) Program terminated with signal 6, Aborted.

(gdb) bt
#0  0xb7e6683b in ??? from /lib/tls/libc.so.6
#1  0xb7e67fa2 in ??? from /lib/tls/libc.so.6
#2  0x080a022f in Vector<unsigned int>::operator[] ()
#3  0x08095553 in TransporterFacade::ThreadData::open ()
#4  0x08094872 in TransporterFacade::open ()
#5  0x080a7740 in SignalSender::SignalSender ()
#6  0x08080f4d in MgmtSrvr::sendVersionReq ()
#7  0x08080e4d in MgmtSrvr::versionNode ()
#8  0x08081e4d in MgmtSrvr::status ()
#9  0x08088172 in MgmApiSession::getStatus ()
#10 0x0808ac05 in Parser<MgmApiSession>::run ()
#11 0x08086465 in MgmApiSession::runSession ()
#12 0x080c8f0e in sessionThread_C ()
#13 0x080c189e in ndb_thread_wrapper ()
#14 0xb7fdbb63 in __nptl_setxid () from /lib/tls/libpthread.so.0
#15 0xb7f1618a in ruserpass () from /lib/tls/libc.so.6

the SIGABRT is probably thrown here:

mysql-4.1/ndb/include/util/Vector.hpp:70
   66: template<class T>
   67: T &
   68: Vector<T>::operator[](unsigned i){
   69:  if(i >= m_size)
* 70:    abort();
   71:  return m_items[i];
   72: }

and caused by the first vector accesses in

mysql-4.1/ndb/src/ndbapi/TransporterFacade.cpp:1132
   1116: int
   1117: TransporterFacade::ThreadData::open(void* objRef,
   1118:                                     ExecuteFunction fun,
   1119:                                     NodeStatusFunction fun2)
   1120: {
   1121:   Uint32 nextFree = m_firstFree;
   1122:
   1123:   if(m_statusNext.size() >= MAX_NO_THREADS && nextFree == END_OF_LIST){
   1124:     return -1;
   1125:   }
   1126:
   1127:   if(nextFree == END_OF_LIST){
   1128:     expand(10);
   1129:     nextFree = m_firstFree;
   1130:   }
   1131:
* 1132:   m_firstFree = m_statusNext[nextFree];
   1133:
   1134:   Object_Execute oe = { objRef , fun };
   1135:
   1136:   m_statusNext[nextFree] = INACTIVE;
   1137:   m_objectExecute[nextFree] = oe;
   1138:   m_statusFunction[nextFree] = fun2;
   1139:
   1140:   return indexToNumber(nextFree);
   1141: }

looks to me as if the exand() silently fails?
(i didn't investigate any further at this point ...)

How to repeat:
look for a blue moon...

Suggested fix:
don't crash.
[13 Sep 2006 9:23] Jonas Oreland
Decreasing prio per discussion with Stewart.
Basic cause, this has never been observed to happen, 
even if it's teoretically possible
[3 Nov 2006 12:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14803

ChangeSet@1.2277, 2006-11-03 23:56:25+11:00, stewart@willster.(none) +1 -0
  BUG#22299 mgmd crash due to unchecked TransporterFacade::ThreadData expand()
  
  abort if we ever fail to expand a Vector
[8 Nov 2006 6:10] Stewart Smith
pushed to 5.1-ndb
[4 Dec 2006 8:31] Martin Skold
Pushed to 5.1.14
[29 Dec 2006 0:36] Stewart Smith
pushed to 5.0-ndb
[29 Dec 2006 8:18] Stewart Smith
pushed to 5.0.34
[3 Jan 2007 3:32] Jon Stephens
I don't see anything here that affects end users directly; closing w/o further action at this time.