Bug #77081 | Excessive Ndb kernel thread is stuck in: Performing Send; WatchDog terminate | ||
---|---|---|---|
Submitted: | 19 May 2015 3:48 | Modified: | 26 May 2015 10:54 |
Reporter: | Jesper wisborg Krogh | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 7.2.16 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[19 May 2015 3:48]
Jesper wisborg Krogh
[19 May 2015 3:58]
Jesper wisborg Krogh
Posted by developer: Cluster log for the crash: .. 2015-05-18 11:26:30 [MgmtSrvr] INFO -- Node 6: Data usage is 35%(4333579 32K pages of total 12288000) 2015-05-18 11:26:30 [MgmtSrvr] INFO -- Node 6: Index usage is 16%(1064960 8K pages of total 6553728) 2015-05-18 11:26:40 [MgmtSrvr] INFO -- Node 2: Data usage is 35%(4328281 32K pages of total 12288000) 2015-05-18 11:26:40 [MgmtSrvr] INFO -- Node 2: Index usage is 16%(1064769 8K pages of total 6553728) 2015-05-18 11:27:19 [MgmtSrvr] WARNING -- Node 1: GCP Monitor: GCP_COMMIT lag 10 seconds (no max lag) 2015-05-18 11:27:29 [MgmtSrvr] WARNING -- Node 1: GCP Monitor: GCP_COMMIT lag 20 seconds (no max lag) 2015-05-18 11:27:31 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 3: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 5: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 5: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 4: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 4: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 1: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 3: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 1: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 6: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 6: Communication to Node 2 closed 2015-05-18 11:27:31 [MgmtSrvr] ALERT -- Node 100: Node 2 Disconnected 2015-05-18 11:27:31 [MgmtSrvr] ALERT -- Node 1: Arbitration check won - node group majority 2015-05-18 11:27:31 [MgmtSrvr] INFO -- Node 1: President restarts arbitration thread [state=6] 2015-05-18 11:27:31 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. ...
[19 May 2015 4:00]
Jesper wisborg Krogh
Posted by developer: The data node out log: ... 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=16921 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=15802 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11688 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17022 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=15902 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11788 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17122 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16003 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11889 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17222 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16103 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=11989 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17322 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16203 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12089 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17423 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16303 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12189 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17523 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16404 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12289 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17623 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16504 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12390 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17723 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16604 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12490 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17824 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16704 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:28 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12590 2015-05-18 11:27:28 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:29 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=17924 2015-05-18 11:27:29 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:29 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Performing Send elapsed=16805 2015-05-18 11:27:29 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:29 [ndbd] WARNING -- Ndb kernel thread 6 is stuck in: Performing Send elapsed=12690 2015-05-18 11:27:29 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:29 [ndbd] WARNING -- Ndb kernel thread 13 is stuck in: Performing Send elapsed=18024 2015-05-18 11:27:29 [ndbd] INFO -- Watchdog: User time: 290853392 System time: 103798183 2015-05-18 11:27:29 [ndbd] INFO -- Performing Send 2015-05-18 11:27:29 [ndbd] INFO -- .../storage/ndb/src/kernel/vm/WatchDog.cpp 2015-05-18 11:27:29 [ndbd] INFO -- Watchdog restarting system 2015-05-18 11:27:29 [ndbd] INFO -- Watchdog shutdown completed - exiting 2015-05-18 11:27:31 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 6050: 'WatchDog terminate, internal error or massive overload on the machine running this node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2015-05-18 11:27:31 [ndbd] INFO -- Ndb has terminated (pid 83955) restarting 2015-05-18 11:27:31 [ndbd] INFO -- Angel reconnected to 'mgm-node-1:1186' 2015-05-18 11:27:36 [ndbd] INFO -- Angel reallocated nodeid: 2 2015-05-18 11:27:36 [ndbd] INFO -- Angel pid: 83954 started child: 59462 2015-05-18 11:27:36 [ndbd] INFO -- Configuration fetched from 'mgm-node-1:1186', generation: 1 ThreadConfig: input: main,ldm,ldm,ldm,ldm,recv,recv,rep,tc,tc,tc,tc,send,send LockExecuteThreadToCPU: => parsed: main,ldm,ldm,ldm,ldm,recv,recv,rep,io={cpubind=0},watchdog={cpubind=0},tc,tc,tc,tc,send,send NDBMT: MaxNoOfExecutionThreads=14 NDBMT: workers=4 threads=4 tc=4 send=2 receive=2 2015-05-18 11:27:36 [ndbd] INFO -- NDB Cluster -- DB node 2 2015-05-18 11:27:36 [ndbd] INFO -- mysql-5.5.37 ndb-7.2.16 -- ...
[26 May 2015 10:54]
Jon Stephens
Documented fix in the NDB 7.2.21, 7.3.10, 7.4.7, and 7.5.0 changelogs, as follows: It was possible to end up with a lock on the send buffer mutex when send buffers became a limiting resource, due iether to insufficient send buffer resource configuration, problems with slow or failing communications such that all send buffers became exhausted, or slow receivers failing to consume what was sent. In this situation worker threads failed to allocate send buffer memory for signals, and attempted to force a send in order to free up space, while at the same time the send thread was busy trying to send to the same node or nodes. All of these threads competed for taking the send_buffer mutex, which resulted in the lock already described, reported by the watchdog as Stuck in Send. This fix is in two parts: 1. The send thread no longer holds the global send thread mutex while getting the send buffer mutex; it now releases the global mutex prior to locking the send buffer mutex. This keeps worker threads from getting stuck in send in such cases. 2. Locking of the send buffer mutex done by the send treads now uses a try-lock. If the try-lock fails, the node to make the send to is reinserted at the end of the list of send nodes in order to be retried later. This removes the Stuck in Send condition for the send threads. Closed.