Bug #28804 | Unknown error returned on a query using joins. | ||
---|---|---|---|
Submitted: | 31 May 2007 10:20 | Modified: | 15 Sep 2007 11:05 |
Reporter: | Geert Vanderkelen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0.40 | OS: | Any |
Assigned to: | Stewart Smith | CPU Architecture: | Any |
Tags: | bfsm_2007_06_21, join, ndb, uknown |
[31 May 2007 10:20]
Geert Vanderkelen
[31 May 2007 10:23]
Geert Vanderkelen
Verified using 5.0.34 (on which it was originally reported) and 5.0.40.
[31 May 2007 10:25]
Geert Vanderkelen
Configuration used with 2 data nodes: [NDBD DEFAULT] NoOfReplicas=2 DataMemory=4G IndexMemory=300M MaxNoOfOrderedIndexes = 256 MaxNoOfUniqueHashIndexes = 256 MaxNoOfConcurrentOperations = 300000 MaxNoOfConcurrentIndexOperations = 32000 NoOfFragmentLogFiles = 32 TimeBetweenLocalCheckpoints = 18
[20 Jun 2007 3:11]
Stewart Smith
have reproduced. spends maybe 5 minutes doing something, then does nothing for another 5 (or so), then unknown error. now working on finding what "something" is and why we stop doing "something"
[20 Jun 2007 7:00]
Stewart Smith
timeout during ha_ndbcluster::unique_index_read error being lost (not returned upwards), and next time there is an error detected, we translate 0 (no error) to 1 unknown error. So really two bugs: 1) loss of correct error 2) why we're timing out. The missing error is 4012, $ perror --ndb 4012 NDB error code 4012: Request ndbd time-out, maybe due to high load or communication problems: Unknown result: Unknown result error Which is certainly interesting... as I'm running on a single machine (so no communications problems) and not too high load... The timeout that is being hit is set by: ndb_mgm_configuration_iterator iter(* props, CFG_SECTION_NODE); ... Uint32 timeout = 120000; iter.first(); for (iter.first(); iter.valid(); iter.next()) { Uint32 tmp1 = 0, tmp2 = 0; iter.get(CFG_DB_TRANSACTION_CHECK_INTERVAL, &tmp1); iter.get(CFG_DB_TRANSACTION_DEADLOCK_TIMEOUT, &tmp2); tmp1 += tmp2; if (tmp1 > timeout) timeout = tmp1; } (where timeout is milliseconds). So it's either 120,000 (2 minutes) OR the maximum TimeBetweenInactiveTransactionAbortCheck+TransactionDeadlockDetectionTimeout across all nodes. The default TimeBetweenInactiveTransactionAbortCheck is 1000 The default TransactionDeadlockDetectionTimeout is 1200 So the default here would be 2200... which is LESS than 120,000, so 2minutes is selected. But where this timout is called, we're using 3 * timeout... so really 6 minutes. Which is consistent with what I've been seeing.
[21 Jun 2007 16:52]
Stewart Smith
So further investigation into this has lead to some interesting points... but it ended up having to solve 2) first to really get a grip on 1). But the good news here is, I've been making good progress. There is this resource in NDB known as "TransactionBufferMemory" and, by default, this is set to 1MB in the config file. This memory is used for a tiny amount of time during unique index lookups for storing some key information in. Only under intense parallel query load should this ever need increasing. However.. my current theory is that we have a bug where in the case of a unique index lookup does not find a match, we keep this memory until the end of the transaction, and not just until the end of that operation. Increasing TransactionBufferMemory does help alleviate the problem, but (in my test) even raising it to 10 or 100MB didn't allow the query to finish (although it ran for a while longer). So using this as a work around is not practical. I currently have a work-in-progress patch here that does seem to free the memory at the right place and the query has been running for at least 18+ times longer than it has before for me (it's still running). So I'm: a) rather hopeful that this is the correct fix b) 4 tracks into The Beatles white and it's still running (where previously it would have stopped well within the first track). c) tomorrow, see the result of the query and start shaping the patch into something sensible and discuss further with other members of the development team.
[29 Jun 2007 17:17]
Stewart Smith
patch 01 in series
Attachment: improve_timeout_waiting_for_reply_handling.patch (text/x-patch), 2.35 KiB.
[29 Jun 2007 17:17]
Stewart Smith
patch 02 in series
Attachment: fix_ha_ndbcluster_error_on_index_read.patch (text/x-patch), 1005 bytes.
[29 Jun 2007 17:17]
Stewart Smith
patch 03 in series
Attachment: release_indexop_on_op_complete_not_txn.patch (text/x-patch), 896 bytes.
[29 Jun 2007 17:18]
Stewart Smith
patch 04 (of 4) in series
Attachment: send_reply_on_failed_saveINDXATTRINFO.patch (text/x-patch), 1.21 KiB.
[16 Jul 2007 5:40]
Stewart Smith
New patches posted to commits: http://lists.mysql.com/commits/30943 fixes based on review
[13 Aug 2007 7:23]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/32429 ChangeSet@1.2473, 2007-08-13 09:22:42+02:00, jonas@perch.ndb.mysql.com +8 -0 ndb - bug#28804 Handle out of transaction buffer in TC for INDX lookups
[14 Aug 2007 3:47]
Stewart Smith
I'm okay with the extra patch that Jonas added. Autotest run seems good (better than the last time we ran a 5.0 tree)
[24 Aug 2007 5:45]
Jonas Oreland
pushed to 50-ndb
[24 Aug 2007 5:46]
Stewart Smith
jonas has pushed to 5.0-ndb
[7 Sep 2007 19:04]
Jon Stephens
Documented in mysql-5.1.22-ndb-6.2.5 changelog as A query using joins between several large tables and requiring unique index lookups failed to complete, eventually returning <errortext>Uknown Error</errortext> after a very long period of time. This occurred due to inadequate handling of instances where the Transaction Coordinator ran out of <literal>TransactionBufferMemory</literal>, when the cluster should have returned NDB error code 4012 (<errortext>Request ndbd time-out</errortext>). (Bug #28804) Left in PQ status pending pushes to mainline trees.
[14 Sep 2007 16:25]
Bugs System
Pushed into 5.0.50
[14 Sep 2007 16:26]
Bugs System
Pushed into 5.1.23-beta
[15 Sep 2007 11:05]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html Also documented fix in 5.0.50 and 5.1.23 changelogs.