Bug #44047 Cluster API node, All threads stuck in lock wait
Submitted: 2 Apr 2009 16:01 Modified: 27 Dec 2009 9:23
Reporter: Brown Casey Email Updates:
Status: No Feedback Impact on me:
Category:MySQL Server: Locking Severity:S1 (Critical)
Version:mysql-5.1.30 ndb-6.3.20 OS:Linux (ubuntu 8.04 (ec2))
Assigned to: CPU Architecture:Any
Tags: cluster, lock wait

[2 Apr 2009 16:01] Brown Casey
I have a MySQL Cluster install with 2 NDB nodes and 4 API/SQL nodes.

Sometimes an API node (not specific to one server, happens on all) will get stuck.

Connecting from the command line mysql client will hang after negotiating a connection.

No errors are logged.

I attached to the mysqld process via GDB and found that all threads were stuck in some kind of lock wait. (example below)  The query it was hung on would vary, from simple inserts to complex selects with joins/subqueries.

Thread 105 (Thread 0xa5b1fb90 (LWP 30918)):
#0  0x00ee0d94 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00edc837 in _L_lock_1038 () from /lib/libpthread.so.0
#2  0x00edc797 in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x085a6450 in my_pthread_fastmutex_lock (mp=0x88702bc) at thr_mutex.c:464
#4  0x0836d41d in Query_cache::send_result_to_client (this=0x8870260, thd=0xa4eae5c8,
    sql=0x9366580 "<redacted>"..., query_length=1185) at sql_cache.cc:1230
#5  0x08256046 in mysql_parse (thd=0xa4eae5c8,
    inBuf=0x9366580 "<redacted>"..., length=1185, found_semicolon=0xa5b1f32c) at sql_parse.cc:5745
#6  0x082570d8 in dispatch_command (command=COM_QUERY, thd=0xa4eae5c8,
    packet=0xa5c8f931 "<redacted>"..., packet_length=1185) at sql_parse.cc:1200
#7  0x08257c60 in do_command (thd=0xa4eae5c8) at sql_parse.cc:857
#8  0x082488b3 in handle_one_connection (arg=0xa4eae5c8) at sql_connect.cc:1115
#9  0x00ed9fda in start_thread () from /lib/libpthread.so.0
#10 0xa5b1f480 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

All 100+ thread appeared as above, excepting a few that appeared to be maintenance/control threads.

After a second look I noticed thread 2 has the following additional information:
#15 0x0886f73c in stderror_file ()
#16 0xa3eb3850 in ?? ()

Note: this only affects one API node, the rest of the cluster still functions.

Currently the only workaround is the kill -9 mysqld

How to repeat:
Issue is intermittent.
Unable to reliably reproduce at this time. (Suggestions welcome!)
The issue does not require a system under load to happen.

Suggested fix:
Timeout or otherwise report an error if the API node becomes deadlocked.
[7 Apr 2009 14:27] Valeriy Kravchuk
Thank you for the problem report. Please, send my.cnf from this node. Try to disable query cache and check if the problem is still repeatable after that.
[7 Apr 2009 15:17] Brown Casey
query_cache_size set to 0

After some research, I will probably keep that setting as our app is very write heavy, and the cache is likely invalidated 99% of the time.

I will post if the issue shows up again.
[7 Apr 2009 15:20] Valeriy Kravchuk
Thank you. Please, inform about any results of your testing without query cache.
[7 Apr 2009 15:26] Brown Casey
I should also mention that I just rebuilt from the latest source last week and am now running mysql-5.1.32 ndb-6.3.23
[24 Apr 2009 20:30] Brown Casey
The issue appears to have been related to the query cache.

I have encountered no hung SQL nodes and no segfaults since disabling it.
[27 Nov 2009 9:23] Valeriy Kravchuk
So, I think, this problem was more a result of misconfiguration than of any bug in MySQL code. Do you agree?
[30 Nov 2009 15:30] Casey Brown
That depends on if you expect the query cache to hang or segfault with high volumes of invalidations.

In retrospect, I don't mind.  It was a very efficient way of telling me that what I was doing was wrong.
[28 Dec 2009 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".