Bug #20442 | NDB API: scans with exclusive lock and update of one record cause a deadlock | ||
---|---|---|---|
Submitted: | 13 Jun 2006 21:52 | Modified: | 2 Feb 2007 4:07 |
Reporter: | Anatoly Pidruchny (Candidate Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: NDB API | Severity: | S3 (Non-critical) |
Version: | 5.0.22-max | OS: | Any (all) |
Assigned to: | Jon Stephens | CPU Architecture: | Any |
[13 Jun 2006 21:52]
Anatoly Pidruchny
[13 Jun 2006 21:54]
Anatoly Pidruchny
Test program
Attachment: scan_update1_deadlock.cpp (application/octet-stream, text), 7.67 KiB.
[13 Jun 2006 21:54]
Anatoly Pidruchny
Cluster configuration file
Attachment: config.ini (application/octet-stream, text), 448 bytes.
[14 Jun 2006 1:05]
Anatoly Pidruchny
The test program with added NdbScanOperation::SF_OrderBy parameter in readTuples.
Attachment: scan_update1_deadlock.cpp (application/octet-stream, text), 7.69 KiB.
[14 Jun 2006 1:12]
Anatoly Pidruchny
Adding the NdbScanOperation::SF_OrderBy parameter into the NdbIndexScanOperation::readTuples made the problem even more evident. After this change the two threads in the test program deadlock even earlier in the NdbIndexScanOperation::nextResult(true). The nextResult operation fail and the NdbError object shows that the deadlock really happened. Now the output of the program looks like: Linux: Populating the table Thread: 1084578144 - starting scan Thread: 1095068000 - starting scan Thread: 1084578144 - updating record 0 Thread: 1084578144 - closing the scan Thread: 1084578144 - the scan is closed in 0 ms Thread: 1084578144 - starting scan Error in scan_update1_deadlock.cpp, line: 218, code: 274, msg: Time-out in NDB, probably caused by deadlock. HP-UX: Populating the table Thread: 7 - starting scan Thread: 8 - starting scan Thread: 7 - updating record 0 Thread: 7 - closing the scan Thread: 7 - the scan is closed in 1 ms Thread: 7 - starting scan Error in scan_update1_deadlock.cpp, line: 218, code: 274, msg: Time-out in NDB, probably caused by deadlock. Error in scan_update1_deadlock.cpp, line: 218, code: 274, msg: Time-out in NDB, probably caused by deadlock.
[27 Jun 2006 15:01]
Jonas Oreland
Hi, thx for excellent bug report and test program! 1) The lock-wait-timeout is expected when adding the SF_OrderBy flag * Each fragment is sorted (locally) * ndbapi will perform merge sort with rows from all fragments * two threads can easily deadlock if they obtain locks from different fragments conclusion: * (exclusive) lock scan with SF_OrderBy is not something I would recommend * I cant (atleast right now) think of any good way to avoid problem 2) scan close - lock-wait-timeout * this is certainly unnecessary the reason that it happens is that ndbapi currently wait for all outstanding fragment scan requests before closing the scan. /Jonas
[27 Jun 2006 17:46]
Anatoly Pidruchny
Hi, Jonas, 1) If the lock-wait-timeout problem with SF_OrderBy can not be avoided then it would be really nice to describe it briefly in the NDB API header files. A couple of sentences in the comments for the readTuples functions would be very helpful. 2) Do I understand you correctly, that you acknowledge the scan close - lock-wait-timeout problem and are going to work on the fix? Thank you very much, Anatoly.
[27 Jun 2006 18:14]
Jonas Oreland
> 1) If the lock-wait-timeout problem with SF_OrderBy can not be avoided then it > would be really nice to describe it briefly in the NDB API header files. A > couple of sentences in the comments for the readTuples functions would be very > helpful. * that's a good idea. * we've also briefly discussed adding an option which will sacrifice parallelism, but will reduce likelyhood of deadlock. This will however probably not be impl. in the immediate future... 2) Do I understand you correctly, that you acknowledge the scan close - lock-wait-timeout problem and are going to work on the fix? yes, i expect to have a patch for you to test tomorrow...
[28 Jun 2006 9:27]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8385
[28 Jun 2006 9:31]
Jonas Oreland
Hi, I have now made a patch that fixes the problem. The scan close time as reported by your program still varies a bit on my computer, but this is due to thread scheduling on my machine (2.6.12-gentoo-r10 #8 SMP AMD64 Dual core, libc-2.3.6) And, they are building 5.0.23 today, so I'm afraid that this (or the other patch) will not make that build...but next Please test patch /Jonas
[28 Jun 2006 10:16]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8389
[28 Jun 2006 13:17]
Anatoly Pidruchny
Hi, Jonas, I have just tested the fix. Yes, the deadlock does not happen any more. But what is strange is that the close time varies greatly: Thread: 7 - the scan is closed in 1 ms Thread: 7 - the scan is closed in 14 ms Thread: 8 - the scan is closed in 1 ms Thread: 8 - the scan is closed in 1 ms Thread: 7 - the scan is closed in 28 ms Thread: 8 - the scan is closed in 17 ms Thread: 7 - the scan is closed in 1 ms ... As you can see, the close time is either 1 ms or 14 ms or more and very rarely anything in between. I do not think it can be explained just by thread scheduling. It could be because of the adaptive send algorithm or something like that, even though in my test program the close() is called with forceSend set to true. Bottom line, the fix is OK and should be accepted for the next version. But it would be great if you could find out why the close takes 14 ms or more so often. If you want, I can report another bug for this issue. Thanks, Anatoly.
[29 Jun 2006 9:52]
Tomas Ulin
pushed to 5.1.12
[4 Jul 2006 11:40]
Jonas Oreland
pushed into 5.0.24
[4 Jul 2006 13:21]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://www.mysql.com/doc/en/Installing_source_tree.html
[4 Jul 2006 13:22]
Jon Stephens
Documented bugfix in 5.0.24/5.1.12 changelogs. Closed.
[4 Jul 2006 13:32]
Jon Stephens
Anatoly, Thanks for a really fine bug report.
[6 Jul 2006 7:37]
Jonas Oreland
Hi, Very sorry to inform you that I have to revert current patch, as it breaks some error cases... And I've come to unfortunate conclusion that to fix this, I need to make protocol changes... So I'll revert the patch for now...
[6 Jul 2006 7:38]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8811
[6 Jul 2006 10:07]
Jon Stephens
Commented out bugfix report in 5.0/5.1 changelogs pending further developments.
[1 Feb 2007 17:35]
Jonas Oreland
Comment to be added to NdbIndexScanOperation (and docs) When scanning with exclusive lock, extra care must be taken as if two threads performs this simultaniously over same range likelyhood of deadlock is high. If also scan is also ordered, likelyhood of deadlock is even higher. Currently also NdbIndexScan::close() is affected by this deadlock, as all outstanding requests are waited for before actual scan in closed.
[1 Feb 2007 17:36]
Jonas Oreland
Extra note: The implication on close() will hopefully be resolved in a future release.
[2 Feb 2007 4:07]
Jon Stephens
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products. I've added cautionary notes to the listings for the NdbScanOperation::readTuples() and ::close() methods as well as a pointer in the scan operations overview section. http://lists.mysql.com/commits/19222