Bug #36537 | Race condition can lead to node crash due to failure in epoll-handling | ||
---|---|---|---|
Submitted: | 6 May 2008 14:58 | Modified: | 20 May 2008 9:43 |
Reporter: | Cyril SCETBON | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-5.1.24 ndb-6.3.13 | OS: | Linux (debian etch) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[6 May 2008 14:58]
Cyril SCETBON
[7 May 2008 6:10]
Jonas Oreland
More debug printouts when it happens
Attachment: pp0 (application/octet-stream, text), 782 bytes.
[7 May 2008 6:14]
Jonas Oreland
Hi, Thx a lot for the bug report, the epoll stuff is fairly new. I analyzed the code and can't really find anything wrong with it. (also can't repeat myself) I therefor created a patch that will make a more verbose printout before crashing it's attached to the bug report. I would very much appreciate if you could apply it, test, get crash, and report what the new printout gave, if you can also attach logs that would be great (your last ndb_error_reporter-tarball was empty for unknown reasons) /jonas
[7 May 2008 8:35]
Cyril SCETBON
We found that the error was caused by ndb client using 5.1.20 version and trying to connect to the cluster (~ 1 request per second). However, it's certainly a BUG. Below is the error message : Node 4: Connection attempt from api or mysqld id=19 with ndb-5.1.20 incompatible with mysql-5.1.24 ndb-6.3.13
[7 May 2008 8:47]
Jonas Oreland
Thank you much, i'll immediately try to see if I can reproduce using an incompatible api. If i'll can, i'll fix it today, otherwise I will ask you again to try my patch and give me back the new printout, ok? /jonas
[7 May 2008 8:50]
Cyril SCETBON
Ok
[7 May 2008 8:54]
Jonas Oreland
Hi again, I tried connecting a mysql from 5.1.22 to telco-6.3.13 and did not get any crashes... so... if you can apply the patch and post extended error message - that would be great or if you could make a debug build of ndbd, start it --core (having done "ulimit -Sc unlimited" first) and then pass a backtrace - that would also be great (maybe even greater) /Jonas
[7 May 2008 8:58]
Cyril SCETBON
The ndb api version was mysql 5.1.20 COMMUNITY version.
[7 May 2008 9:00]
Jonas Oreland
5.1.20 and 5.1.22 are equivalent wrt connect/disconnect so it shouldnt matter... can/will you try my suggestions, please ? /Jonas ps. i'll also test with 5.1.20, but i would be very very surprised if it makes a difference ds.
[7 May 2008 9:04]
Cyril SCETBON
but the telco version uses ndb storage 6.[23] version whereas the community version uses version 6.1
[7 May 2008 9:07]
Cyril SCETBON
I'm sorry but we've packaged the version, so I cannot patch and compile again the version
[7 May 2008 9:16]
Jonas Oreland
hmm...tested with 5.1.20...no difference... if you can't patch, i'm not sure how I can proceed on this given that I can't reproduce... maybe someone else will get the problem and can try the patch (or debug compiled with backtrace from code) /jonas
[7 May 2008 9:24]
Cyril SCETBON
Maybe I'll be able to patch & try when the tests we are making will be finished.
[7 May 2008 9:25]
Jonas Oreland
that would be much appreciated /jonas
[7 May 2008 14:05]
Cyril SCETBON
Can I just patch the node manager to test it ?
[7 May 2008 14:35]
Cyril SCETBON
forget it, the error appears on the datanodes so I have to patch the binaries installed on them
[8 May 2008 10:05]
Jonas Oreland
Hi, An update, I've managed to reproduce problem in a micro-test-program (wo/ ndb) and will produce a patch for it now. I've also found another guy getting the problem, and he has confirmed that my fix addresses the problem. So, I'll apply the patch and make a telco-6.3.14 before monday... /Jonas
[8 May 2008 10:07]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/46500 ChangeSet@1.2588, 2008-05-08 12:07:00+02:00, jonas@perch.ndb.mysql.com +3 -0 ndb - bug#36537 Dont remove socket from epoll-set at all (since there can be a race) given the linux-kernel automatically removes it when it's closed.
[8 May 2008 10:27]
Bugs System
Pushed into 5.1.24-ndb-6.3.13
[9 May 2008 7:18]
Bugs System
Pushed into 5.1.23-ndb-6.4.0
[9 May 2008 8:25]
Cyril SCETBON
good news. Thank you Jonas
[9 May 2008 10:12]
Cyril SCETBON
In which cases the version not patched can crash ?
[20 May 2008 9:43]
Jon Stephens
Documented in the 5.1.24-ndb-6.3.14 changelog as follows: A race condition caused by a failure in epoll handling could cause data nodes to fail. Closed, since this is in telco-6.3 only.