Bug #42474 Requests blocked in MySQL Cluster
Submitted: 30 Jan 2009 12:40 Modified: 17 Feb 2009 8:03
Reporter: Guido Ostkamp Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:6.4.2 OS:Solaris
Assigned to: Assigned Account CPU Architecture:Any

[30 Jan 2009 12:40] Guido Ostkamp
Description:
Hello,

during tests with MySQL Cluster we found that our
test applications performing 'insert' and 'delete'
requests on a table on both nodes of a 2 node cluster, 
suddenly became stuck during a nightrun.

---
mysql_stmt_execute failed, errno 1296
Got error 4008 'Receive from NDB failed' from NDBCLUSTER
mysql_stmt_execute failed, errno 1296
Got error 4008 'Receive from NDB failed' from NDBCLUSTER
mysql_stmt_execute failed, errno 1296
---

Also, after stopping all test applications we have not been
able to address the table anymore through 'mysql' command
interface:

mysql> delete from scscf_registrations where reg_id = 1;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

mysqld and ndbmtd were still running on both nodes.

After killing and restarting mysqld on first node
the above then worked ok again.

So it seems that internally 'mysqld' locks became corrupted.

We have not found any hint regarding problems in the NDB and mysql logs.

We are using revid jonathan.miller@sun.com-20090128150430-4gcp8g81gky8bgkg dated
2009-09-21 from repo mysql-5.1-telco-6.4 compiled with Solaris Workshop 12 as follows:

CC=cc CXX=CC CFLAGS="-xO5 -fast -g -mt -m64" CXXFLAGS="-xO5 -fast -g -mt -m64" ./configure --prefix=<somepath> --with-plugins=all --without-docs --without-man --with-ndbmtd

For our DB configuration please refer to debug data of http://bugs.mysql.com/bug.php?id=42450.

Regards

Guido Ostkamp

How to repeat:
Run insert/delete traffic on both nodes for a long time (see above).
[30 Jan 2009 13:22] Bernd Ocklin
Hi,

4008 are time outs you are getting. Can you send your log files? Did you run the system with high load?

Bernd
[30 Jan 2009 14:06] Bernd Ocklin
Bye the way, next time you get this it would be interesting to get the following information. Where is mysqld possibly hanging (if it is hanging)? 

a) run "show full processlist" from mysqld shell. 
b) you can also connect to mysqld with gdb and run the "thread apply all bt"
c) run kill -6
[30 Jan 2009 16:09] Guido Ostkamp
Requested data has been uploaded to bug-data-42474.tar.bz2 on FTP server.

Please have a look at files processlist and debugger.

Regards

Guido Ostkamp
[30 Jan 2009 16:28] Guido Ostkamp
Add-on:

Regarding "high load": We are running several processes in parallel on both nodes.
[5 Feb 2009 11:26] Bernd Ocklin
Looking at the queries executed and the status it seems they are waiting for the UPDATE statement to finish (even if we only lock 32 or 64 rows at a time). Given the DELETE and the other statements executed in parallel on the same table I think its easy to run into deadlock or waiting for lock release scenarios. Deadlock detection timeout is defaulted to 10 secs and could kick in. Does that sound reasonable? Are you getting locks on other storage engines?
[5 Feb 2009 12:57] Guido Ostkamp
Hello,

the statements we execute in our tests are simple and we immediately commit after each statement (it could be that some triggers are fired). This should never lead to a deadlock situation due to a 10 secs timeout.

Furthermore, as already mentioned in the bug report, we have not been able to address the table anymore, even _after_ killing all test applications (killing them should release all locks held).

We are not using any other storage engine then NDBCLUSTER because of HA requirements thus can't tell whether problem exists with other storage engines.

Regards

Guido Ostkamp
[5 Feb 2009 13:23] Henrik Ingo
Hi Guido

While I'm not officially the one to work on this bug, just to note on something that sounds familiar...

In mysql, if you connect to the same mysqld as the client was executing against, you can do "show processlist". I've sometimes noticed that a query might be left running even after the client has disconnected. MySQL also allows you to kill a running query with "KILL <nr>", where you take the nr from the processlist output. But in some cases even this doesn't immediately kill the query - I suppose in the case of NDB engine, mysqld is waiting for it to return from the ndb layer.

In my case this happened when I had forgot to define an index, making a query try to scan 10M rows. This of course takes some time. Also in my case restarting the mysqld node in question is a good workaround. The deadlock timeout doesn't interrupt that either, the query will run for hundreds of seconds if not killed.  (I didn't file a bug myself, because in my case scanning 10M rows was an error on my part, but what happened seems similar to what you describe here.)

If this behavior seems similar to what you see, maybe my additional info is helpful to the engineers participating in this thread.
[5 Feb 2009 13:53] Jonas Oreland
Hi,

I've now looked at traces and logs.
Would it be possible to get the schema (which looks quite simple)
*including* the triggers (that is showing up in the backtrace)

Testdata would also be a positive thing.

/Jonas

Note: updating status to Need Feedback
[5 Feb 2009 14:27] Jonathan Miller
Hi,

I was hitting this as well in performance testing. If memory serves me right, we reduced the number of log file and increased their sizes. In addition we added:

InitFragmentLogFiles=FULL

/Jeb
[6 Feb 2009 8:08] Jonas Oreland
Hi Guido,

I find no indexes on table1.
Hence triggers will have to perform full-table-scans.

Is this 
1) deliberate
2) a misstake when sending schema

/Jonas
[6 Feb 2009 8:24] Guido Ostkamp
Hello Jonas,

table1 has only 38 rows in our test environment so not having an index should have nearly no negative impact.

Regards

Guido Ostkamp
[6 Feb 2009 8:49] Jonas Oreland
output from show processlist (slightly edited, for easier reading)

| 13 | root | localhost | test  | Query   |  321 | Locked   | select count(*) from TABLE0 where column0 < 111111
| 14 | root | localhost | test  | Execute |  427 | Locked   | INSERT INTO TABLE0 (...)
| 15 | root | localhost | test  | Execute |  427 | Locked   | DELETE FROM TABLE0 where column0 = ? 
| 16 | root | localhost | test  | Execute |  427 | Updating | UPDATE TABLE1 SET column21  =  column21 - 1 WHERE column1  =  OLD.column1 
| 17 | root | localhost | test  | Execute |  427 | Locked   | INSERT INTO TABLE0 (...)
| 21 | root | localhost | test  | Query   |    0 | NULL     | show full processlist
[6 Feb 2009 8:55] Jonas Oreland
Hi Guido again

How many rows does table0 have
(note it does not either have a primary key)

I'm can't find anything "special" in any log,
and is therefor trying to reproduce the scenario.

So,
1) if you have data, that would be nice
   but a "minimal" description is data-distribution
   (i.e no of rows in tables, and a rought idea about relationship)
   
2) if you have more detailed information what your app is doing
   (how many clients/threads do different things)
   
   from output from processlist, i guess that 
- one client/thread does the select count(*)
- two (or more) client/threads does insert into table0 likely on disjoint data
- one client/threads does delete from table0

   is this fairly accurate?

/Jonas
[17 Feb 2009 8:03] Jonas Oreland
This seems to be have resolved after "trigger" fix,
i.e don't use table-locks when executing triggers

i.e duplicate of bug#42751