| Bug #76821 | WatchDog detects no progress in send thread and kills datanode | ||
|---|---|---|---|
| Submitted: | 24 Apr 2015 10:36 | Modified: | 15 May 2015 9:04 |
| Reporter: | Ole John Aske | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
| Version: | 7.2.21 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[27 Apr 2015 7:30]
Ole John Aske
Posted by developer: Note to doc: This issue was already fixed in 7.4 and *backported* to 7.2-> by this fix
[15 May 2015 9:04]
Jon Stephens
Documented fix in the NDB 7.2.21, 7.3.10, 7.4.7, and 7.5.0 changelogs, as follows:
Previously, multiple send threads could be invoked for handling
sends to the same node; these threads then competed for the same
send lock. While the send lock blocked the additional send
threads, work threads could be passed to other nodes.
This issue is fixed by ensuring that new send threads are not
activated while there is already an active send thread assigned
to the same node. In addition, a node already having an active
send thread assigned to it is no longer visible to other,
already active, send threads; that is, such a node is longer
added to the node list when a send thread is currently assigned
to it.
Closed.

Description: When configured with multiple send threads we can get into a situation where: 1) Send thread x locks the send buffer mutex for node n. It then calls perform_send and may spend considerable time inside that call. (Hi send load, lack of available send buffers, communication slowness +++) 2) Send thread y tries to send to the same node n. It will initially grab the global send_thread_mutex and holds that while trying to lock the send buffer already held by thread x 3a) Any other send threads will be blocked when they try to lock the global send buffer mutex, and thus all further send, even to non contended dest nodes are stalled. 3b) Any worker threads which tries to awake a send thread will be blocked when ::alert_send_thread() tries to lock the global send mutex, and all work stalls. The WatchDog thread will detect this situation and warn about it in the log if the blockage last for more than 100ms. (However, there might still be short blockages degrading performance prior to this) 015-04-20 17:54:10 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=1004 2015-04-20 17:54:10 [ndbd] INFO -- Watchdog: User time: 51916555 System time: 20693129 2015-04-20 17:54:10 [ndbd] WARNING -- Ndb kernel thread 3 is stuck in: Performing Send elapsed=1104 2015-04-20 17:54:10 [ndbd] INFO -- Watchdog: User time: 51916555 System time: 20693129 2015-04-20 17:54:10 [ndbd] WARNING -- Ndb kernel thread 4 is stuck in: Performing Send elapsed=1104 2015-04-20 17:54:10 [ndbd] INFO -- Watchdog: User time: 51916555 System time: 20693129 How to repeat: Can likely be reproduced with a config with multiple send threads and a load sufficient to saturating the available network performance. Suggested fix: WL#7654 fixed this issue in the 'part 3 of 3' patch. That WL was intentionally only aimed at improving performance in 7.4. We suggest to backport that part of the patch as a fix for this bug.