| Bug #83785 | Regress of bug #18229003 - Job buffer full crashes node with signal 6 | ||
|---|---|---|---|
| Submitted: | 11 Nov 2016 8:21 | Modified: | 11 Aug 2017 19:14 |
| Reporter: | Сергей Кукуев | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
| Version: | mysql-5.7.16 ndb-7.5.4; mysql-5.6.28 ndb-7.4.10 | OS: | Oracle Linux (2.6.32-300.3.1.el6uek.x86_64) |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | 18229003, error 6000, job buffer full, regress, signal 6 | ||
[11 Nov 2016 8:29]
Сергей Кукуев
Logs and traces collected by ndb_error_reporter uploaded to sftp.oracle.com filename: mysql-bug-83785-ndb_error_report.tar.bz2
[11 Nov 2016 8:52]
Сергей Кукуев
Reproduces on mysql-5.7.16 ndb-7.5.4 and mysql-5.6.28 ndb-7.4.10
[12 Nov 2016 20:50]
MySQL Verification Team
Hi, I can verify this on 7.4.11 too. Not very hard to reproduce in resource low environment (virtual), I could not reproduce on bare metal. thanks for the report Bogdan Kecman
[3 May 2017 10:07]
Johan Andersson
Hi, Any news about this bug? Repeatedly running into this: 2017-04-25 15:10:27 [ndbd] INFO -- NDB start phase 4 completed 2017-04-25 15:10:27 [ndbd] INFO -- Start NDB start phase 5 (only to DBDIH) 2017-04-25 15:10:27 [ndbd] INFO -- Restore Database Off-line Starting job buffer full Dumping non-empty job queues: job buffer 0 --> 1, used 31 FULL! 2017-04-25 15:10:27 [ndbd] INFO -- Received signal 6. Running error handler. 2017-04-25 15:10:27 [ndbd] INFO -- Signal 6 received; Aborted 2017-04-25 15:10:27 [ndbd] INFO -- /export/home/pb2/build/sb_0-21747926-1483613043.11/mysql-cluster-gpl-7.4.14/storage/ndb/src/kernel/ndbd.cpp 2017-04-25 15:10:27 [ndbd] INFO -- Error handler signal shutting down system 2017-04-25 15:10:27 [ndbd] INFO -- Error handler shutdown completed - exiting 2017-04-25 15:10:45 [ndbd] ALERT -- Node 6: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 6. Caused by error 6000\ : 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[7 Jul 2017 4:13]
Jesper wisborg Krogh
Posted by developer: The issue reported during the restart may be a duplicate of Bug 26263397 which is fixed as of 7.3.19, 7.4.17, 7.5.8, and 7.6.4. This is particularly the case if there is a large total number of fragments (tables * no of fragments per table) in the cluster.
[11 Aug 2017 19:14]
MySQL Verification Team
Fixed in 7.3.19, 7.4.17, 7.5.8, and 7.6.4

Description: Seemed that we faced with bug #18229003 regress, which was fixed in MySQL Cluster NDB 7.2.16 (5.5.37-ndb-7.2.16): 2016-11-09 23:56:06 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Job Handling elapsed=400 2016-11-09 23:56:06 [ndbd] INFO -- Watchdog: User time: 42045 System time: 52484 thr_no:0 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads) thr_no:0 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads) thr_no:0 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads) thr_no:0 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads) thr_no:0 - sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads) job buffer full Dumping non-empty job queues: job buffer 0 --> 2, used 31 FULL! 2016-11-09 23:56:06 [ndbd] INFO -- Received signal 6. Running error handler. 2016-11-09 23:56:09 [ndbd] INFO -- Watchdog: User time: 42386 System time: 52484 2016-11-09 23:56:09 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Job Handling elapsed=2805 2016-11-09 23:56:09 [ndbd] INFO -- Watchdog: User time: 42386 System time: 52484 2016-11-09 23:56:09 [ndbd] INFO -- Signal 6 received; Aborted 2016-11-09 23:56:09 [ndbd] INFO -- /export/home/pb2/build/sb_0-17731890-1453887410.3/rpm/BUILD/mysql-cluster-com-7.4.10/mysql-cluster-com-7.4. 10/storage/ndb/src/kernel/ndbd.cpp 2016-11-09 23:56:09 [ndbd] INFO -- Error handler signal shutting down system 2016-11-09 23:56:09 [ndbd] INFO -- Error handler shutdown completed - exiting 2016-11-09 23:56:09 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal recei ved(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. How to repeat: It reproduces repeatedly during high load. I'm not sure but seemed that load pattern looks like: for (int j = 0; j < chunks; ++j) { myTransaction = myNdb->startTransaction(); oper1 = myTransaction->getNdbOperation(table1); oper1->insertTuple(); oper1->equal("id", ids[j]); oper1->setValue("value", values[j]); oper1->setValue("dateFrom", dateFrom[j]); oper1->setValue("dateTo", dateTo[j]); myTransaction->execute(NdbTransaction::NoCommit, NdbOperation::AbortOnError); oper2 = myTransaction->getNdbOperation(table2); oper2->equal("id", ids[j]); oper2->setValue("count", 1); myTransaction->execute(NdbTransaction::NoCommit, NdbOperation::AbortOnError); for (int i = 0; i < chunkSize; ++i) { oper1 = myTransaction->getNdbOperation(table1); oper1->readTuple(NdbOperation::LM_Read); oper1->equal("id", ids[j]); myRecAttr = oper1->getValue("value", NULL); myTransaction->execute(NdbTransaction::NoCommit, NdbOperation::AbortOnError); oper1 = myTransaction->getNdbOperation(table1); oper1->updateTuple(); oper1->equal("id", ids[j]); oper1->setValue("value", CalculateValue(myRecAttr->int64_value())); myTransaction->execute(NdbTransaction::NoCommit, NdbOperation::AbortOnError); oper2 = myTransaction->getNdbOperation(table2); oper2->readTuple(NdbOperation::LM_Read); oper2->equal("id", ids[j]); myRecAttr = oper2->getValue("count", NULL); myTransaction->execute(NdbTransaction::NoCommit, NdbOperation::AbortOnError); oper2 = myTransaction->getNdbOperation(table2); oper2->equal("id", ids[j]); oper2->setValue("count", myRecAttr->int64_value() + i + 1); myTransaction->execute(NdbTransaction::NoCommit, NdbOperation::AbortOnError); } myTransaction->execute(NdbTransaction::Commit, NdbOperation::AbortOnError); myNdb->closeTransaction(myTransaction); }