Bug #77081 Excessive Ndb kernel thread is stuck in: Performing Send; WatchDog terminate
Submitted: 19 May 2015 3:48 Modified: 26 May 2015 10:54
Reporter: Jesper wisborg Krogh Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.2.16 OS:Any
Assigned to: CPU Architecture:Any

[19 May 2015 3:48] Jesper wisborg Krogh
Description:
The cluster sees an excessive number of warnings like:

2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=16921
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183

(up to 1G in less than 17 hours per data node).

One data node is eventually killed by the watchdog:

Time: Monday 18 May 2015 - 11:27:29
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Performing Send
Error object: .../storage/ndb/src/kernel/vm/WatchDog.cpp
Program: ndbmtd
Pid: 83955
Version: mysql-5.5.37 ndb-7.2.16
Trace: /var/lib/mysql-cl

How to repeat:
N/A
[19 May 2015 3:58] Jesper wisborg Krogh
Posted by developer:
 
Cluster log for the crash:

..
2015-05-18 11:26:30 [MgmtSrvr] INFO     -- Node 6: Data usage is 35%(4333579 32K pages of total 12288000)
2015-05-18 11:26:30 [MgmtSrvr] INFO     -- Node 6: Index usage is 16%(1064960 8K pages of total 6553728)
2015-05-18 11:26:40 [MgmtSrvr] INFO     -- Node 2: Data usage is 35%(4328281 32K pages of total 12288000)
2015-05-18 11:26:40 [MgmtSrvr] INFO     -- Node 2: Index usage is 16%(1064769 8K pages of total 6553728)
2015-05-18 11:27:19 [MgmtSrvr] WARNING  -- Node 1: GCP Monitor: GCP_COMMIT lag 10 seconds (no max lag)
2015-05-18 11:27:29 [MgmtSrvr] WARNING  -- Node 1: GCP Monitor: GCP_COMMIT lag 20 seconds (no max lag)
2015-05-18 11:27:31 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 5: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 5: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 4: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 4: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 1: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 3: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 1: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 6: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 6: Communication to Node 2 closed
2015-05-18 11:27:31 [MgmtSrvr] ALERT    -- Node 100: Node 2 Disconnected
2015-05-18 11:27:31 [MgmtSrvr] ALERT    -- Node 1: Arbitration check won - node group majority
2015-05-18 11:27:31 [MgmtSrvr] INFO     -- Node 1: President restarts arbitration thread [state=6]
2015-05-18 11:27:31 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
...
[19 May 2015 4:00] Jesper wisborg Krogh
Posted by developer:
 
The data node out log:

...
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=16921
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=15802
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11688
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17022
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=15902
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11788
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17122
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16003
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11889
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17222
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16103
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11989
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17322
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16203
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12089
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17423
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16303
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12189
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17523
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16404
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12289
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17623
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16504
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12390
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17723
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16604
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12490
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17824
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16704
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:28 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12590
2015-05-18 11:27:28 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:29 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17924
2015-05-18 11:27:29 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:29 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16805
2015-05-18 11:27:29 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:29 [ndbd] WARNING  -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12690
2015-05-18 11:27:29 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:29 [ndbd] WARNING  -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=18024
2015-05-18 11:27:29 [ndbd] INFO     -- Watchdog: User time: 290853392  System time: 103798183
2015-05-18 11:27:29 [ndbd] INFO     -- Performing Send
2015-05-18 11:27:29 [ndbd] INFO     -- .../storage/ndb/src/kernel/vm/WatchDog.cpp
2015-05-18 11:27:29 [ndbd] INFO     -- Watchdog restarting system
2015-05-18 11:27:29 [ndbd] INFO     -- Watchdog shutdown completed - exiting
2015-05-18 11:27:31 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2015-05-18 11:27:31 [ndbd] INFO     -- Ndb has terminated (pid 83955) restarting
2015-05-18 11:27:31 [ndbd] INFO     -- Angel reconnected to 'mgm-node-1:1186'
2015-05-18 11:27:36 [ndbd] INFO     -- Angel reallocated nodeid: 2
2015-05-18 11:27:36 [ndbd] INFO     -- Angel pid: 83954 started child: 59462
2015-05-18 11:27:36 [ndbd] INFO     -- Configuration fetched from 'mgm-node-1:1186', generation: 1
ThreadConfig: input: main,ldm,ldm,ldm,ldm,recv,recv,rep,tc,tc,tc,tc,send,send LockExecuteThreadToCPU:  => parsed: main,ldm,ldm,ldm,ldm,recv,recv,rep,io={cpubind=0},watchdog={cpubind=0},tc,tc,tc,tc,send,send
NDBMT: MaxNoOfExecutionThreads=14
NDBMT: workers=4 threads=4 tc=4 send=2 receive=2
2015-05-18 11:27:36 [ndbd] INFO     -- NDB Cluster -- DB node 2
2015-05-18 11:27:36 [ndbd] INFO     -- mysql-5.5.37 ndb-7.2.16 --
...
[26 May 2015 10:54] Jon Stephens
Documented fix in the NDB 7.2.21, 7.3.10, 7.4.7, and 7.5.0 changelogs, as follows:
    It was possible to end up with a lock on the send buffer mutex
    when send buffers became a limiting resource, due iether to
    insufficient send buffer resource configuration, problems with
    slow or failing communications such that all send buffers became
    exhausted, or slow receivers failing to consume what was sent.
    In this situation worker threads failed to allocate send buffer
    memory for signals, and attempted to force a send in order to
    free up space, while at the same time the send thread was busy
    trying to send to the same node or nodes. All of these threads
    competed for taking the send_buffer mutex, which resulted in the
    lock already described, reported by the watchdog as Stuck in
    Send. This fix is in two parts:

    1. The send thread no longer holds the global send thread mutex
    while getting the send buffer mutex; it now releases the global
    mutex prior to locking the send buffer mutex. This keeps worker
    threads from getting stuck in send in such cases.

    2. Locking of the send buffer mutex done by the send treads now
    uses a try-lock. If the try-lock fails, the node to make the
    send to is reinserted at the end of the list of send nodes in
    order to be retried later. This removes the Stuck in Send
    condition for the send threads.

Closed.