MySQL Bugs: #76821: WatchDog detects no progress in send thread and kills datanode

Bug #76821	WatchDog detects no progress in send thread and kills datanode
Submitted:	24 Apr 2015 10:36	Modified:	15 May 2015 9:04
Reporter:	Ole John Aske	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	7.2.21	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When configured with multiple send threads we can get into a situation where:

1) Send thread x locks the send buffer mutex for node n.
   It then calls perform_send and may spend considerable
   time inside that call. (Hi send load, lack of available send
   buffers, communication slowness +++)

2) Send thread y tries to send to the same node n.
   It will initially grab the global send_thread_mutex
   and holds that while trying to lock the send buffer
   already held by thread x

3a) Any other send threads will be blocked when they
    try to lock the global send buffer mutex, and 
    thus all further send, even to non contended
    dest nodes are stalled.

3b) Any worker threads which tries to awake a send thread
    will be blocked when ::alert_send_thread() tries
    to lock the global send mutex, and all work stalls.

The WatchDog thread will detect this situation and warn 
about it in the log if the blockage last for more than 100ms.
(However, there might still be short blockages degrading performance
prior to this)

015-04-20 17:54:10 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in:
Performing Send elapsed=1004
2015-04-20 17:54:10 [ndbd] INFO     -- Watchdog: User time: 51916555  System
time: 20693129
2015-04-20 17:54:10 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in:
Performing Send elapsed=1104
2015-04-20 17:54:10 [ndbd] INFO     -- Watchdog: User time: 51916555  System
time: 20693129
2015-04-20 17:54:10 [ndbd] WARNING  -- Ndb kernel thread 4 is stuck in:
Performing Send elapsed=1104
2015-04-20 17:54:10 [ndbd] INFO     -- Watchdog: User time: 51916555  System
time: 20693129

How to repeat:
Can likely be reproduced with a config with multiple send threads and a load
sufficient to saturating the available network performance.

Suggested fix:
WL#7654 fixed this issue in the 'part 3 of 3' patch. That WL was intentionally only aimed at improving performance in 7.4.

We suggest to backport that part of the patch as a fix for this bug.

Posted by developer:
 
Note to doc:
This issue was already fixed in 7.4 and *backported* to 7.2-> by this fix

Documented fix in the NDB 7.2.21, 7.3.10, 7.4.7, and 7.5.0 changelogs, as follows:

    Previously, multiple send threads could be invoked for handling
    sends to the same node; these threads then competed for the same
    send lock. While the send lock blocked the additional send
    threads, work threads could be passed to other nodes.

    This issue is fixed by ensuring that new send threads are not
    activated while there is already an active send thread assigned
    to the same node. In addition, a node already having an active
    send thread assigned to it is no longer visible to other,
    already active, send threads; that is, such a node is longer
    added to the node list when a send thread is currently assigned
    to it.

Closed.