Bug #68010 memcached crashes
Submitted: 2 Jan 2013 10:41
Reporter: Davide F Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Cluster: Memcached Severity:S1 (Critical)
Version:7.2.8 OS:Linux
Assigned to: CPU Architecture:Any
Tags: cluster, crash, memcached, ndb

[2 Jan 2013 10:41] Davide F
Description:
I'm experiencing random crashs of the memcached daemon (provided by the mysql distrib) when using NDB as a backend. The cluster is working all the time, SQL nodes are working, but the memcached nodes crash, all the output I get is this 

NDB Temporary Error 266: Time-out in NDB, probably caused by deadlock 
tx: Cluster Failure 
memcached: /pb2/build/sb_0-6682536-1345655787.21/mysql-cluster-gpl-7.2.8/storage/ndb/memcache/src/ndb_worker.cc:407: op_status_t WorkerStep1::do_write(): Assertion `false' failed.

(and sometimes I don't even get this output)

How to repeat:
The bg is quite random, I'm writing about 130 bytes of data avg with an expire time, I'm using the default "demo_table" ndbmemcache schema, I've just added the expire field on this table as well. Memcached receives about 10 new SETs per second.
[5 Aug 2013 18:15] Hartmut Holzgraefe
This only happens with binaries built with debug enabled, right?

The code is:

 [...]
 402   /* Start the transaction */
 403   tx = op.startTransaction(wqitem->ndb_instance->db);
 404   if(! tx) {
 405     logger->log(LOG_WARNING, 0, "tx: %s \n",
 406                 wqitem->ndb_instance->db->getNdbError().message);
>407     DEBUG_ASSERT(false);
 408   }
 409 
 410   if(wqitem->base.verb == OPERATION_REPLACE) {
 411     DEBUG_PRINT(" [REPLACE] \"%.*s\"", wqitem->base.nkey, wqitem->key);
 412     ndb_op = op.updateTuple(tx);
 [...]

So a transaction is started, which can fail e.g. due to timeouts (as seen here) or as e.g. MaxNoOfConcurrentTransactions is exceeded, and the only error handling that is done is to throw an assertion failure if in debug mode?

Otherwise the now invalid tx handle is used in followup instructions as if nothing happened, and only much later the followup operation failures are
handled ...

Not really my idea of error handling, why not simply return op_failed
early on not being able to start a new cluster transaction instead of
throwing an assertion failure?