Bug #36537 Race condition can lead to node crash due to failure in epoll-handling
Submitted: 6 May 2008 14:58 Modified: 20 May 2008 9:43
Reporter: Cyril SCETBON Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1.24 ndb-6.3.13 OS:Linux (debian etch)
Assigned to: Jonas Oreland CPU Architecture:Any

[6 May 2008 14:58] Cyril SCETBON
Description:
nodes are crashing when inserting data.

Here is the error encountered :

Failed to add fd to epoll-set...giving up!: Bad file descriptor
2008-05-06 16:37:06 [ndbd] INFO     -- Received signal 6. Running error handler.
2008-05-06 16:37:06 [ndbd] INFO     -- Signal 6 received; Aborted
2008-05-06 16:37:06 [ndbd] INFO     -- main.cpp
2008-05-06 16:37:06 [ndbd] INFO     -- Error handler signal shutting down system
2008-05-06 16:37:06 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-05-06 16:37:07 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Interna
l error, programming error or missing error message, please report a bug). Temporary error, restart node'.

We are on a X86_86

How to repeat:
restart node.

Suggested fix:
none
[7 May 2008 6:10] Jonas Oreland
More debug printouts when it happens

Attachment: pp0 (application/octet-stream, text), 782 bytes.

[7 May 2008 6:14] Jonas Oreland
Hi, 

Thx a lot for the bug report, the epoll stuff is fairly new.
I analyzed the code and can't really find anything wrong with it.
(also can't repeat myself)

I therefor created a patch that will make a more verbose printout before crashing
it's attached to the bug report.

I would very much appreciate if you could apply it, test, get crash, and report
what the new printout gave, if you can also attach logs that would be great (your last ndb_error_reporter-tarball was empty for unknown reasons)

/jonas
[7 May 2008 8:35] Cyril SCETBON
We found that the error was caused by ndb client using 5.1.20 version and trying to connect to the cluster (~ 1 request per second).

However, it's certainly a BUG. Below is the error message :

Node 4: Connection attempt from api or mysqld id=19 with ndb-5.1.20 incompatible with mysql-5.1.24 ndb-6.3.13
[7 May 2008 8:47] Jonas Oreland
Thank you much, 
i'll immediately try to see if I can reproduce using an incompatible api.

If i'll can, i'll fix it today,
otherwise I will ask you again to try my patch and give me back the new printout,

ok?

/jonas
[7 May 2008 8:50] Cyril SCETBON
Ok
[7 May 2008 8:54] Jonas Oreland
Hi again,

I tried connecting a mysql from 5.1.22 to telco-6.3.13
and did not get any crashes...

so...
if you can apply the patch and post extended error message 
- that would be great
or
if you could make a debug build of ndbd, start it --core (having done "ulimit -Sc unlimited" first) and then pass a backtrace
- that would also be great (maybe even greater)

/Jonas
[7 May 2008 8:58] Cyril SCETBON
The ndb api version was mysql 5.1.20 COMMUNITY version.
[7 May 2008 9:00] Jonas Oreland
5.1.20 and 5.1.22 are equivalent wrt connect/disconnect
so it shouldnt matter...

can/will you try my suggestions, please ?

/Jonas

ps.
i'll also test with 5.1.20, 
but i would be very very surprised if it makes a difference
ds.
[7 May 2008 9:04] Cyril SCETBON
but the telco version uses ndb storage 6.[23] version whereas the community version uses version 6.1
[7 May 2008 9:07] Cyril SCETBON
I'm sorry but we've packaged the version, so I cannot patch and compile again the version
[7 May 2008 9:16] Jonas Oreland
hmm...tested with 5.1.20...no difference...

if you can't patch, i'm not sure how I can proceed on this given that I can't reproduce...

maybe someone else will get the problem and can try the patch (or debug compiled with backtrace from code)

/jonas
[7 May 2008 9:24] Cyril SCETBON
Maybe I'll be able to patch & try when the tests we are making will be finished.
[7 May 2008 9:25] Jonas Oreland
that would be much appreciated

/jonas
[7 May 2008 14:05] Cyril SCETBON
Can I just patch the node manager to test it ?
[7 May 2008 14:35] Cyril SCETBON
forget it, the error appears on the datanodes so I have to patch the binaries installed on them
[8 May 2008 10:05] Jonas Oreland
Hi,

An update, I've managed to reproduce problem in a micro-test-program
(wo/ ndb) and will produce a patch for it now.

I've also found another guy getting the problem, and he has confirmed that my fix
addresses the problem.

So, I'll apply the patch and make a telco-6.3.14 before monday...

/Jonas
[8 May 2008 10:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46500

ChangeSet@1.2588, 2008-05-08 12:07:00+02:00, jonas@perch.ndb.mysql.com +3 -0
  ndb - bug#36537
    Dont remove socket from epoll-set at all (since there can be a race)
    given the linux-kernel automatically removes it when it's closed.
[8 May 2008 10:27] Bugs System
Pushed into 5.1.24-ndb-6.3.13
[9 May 2008 7:18] Bugs System
Pushed into 5.1.23-ndb-6.4.0
[9 May 2008 8:25] Cyril SCETBON
good news. 

Thank you Jonas
[9 May 2008 10:12] Cyril SCETBON
In which cases the version not patched can crash ?
[20 May 2008 9:43] Jon Stephens
Documented in the 5.1.24-ndb-6.3.14 changelog as follows:

        A race condition caused by a failure in epoll handling could cause data
        nodes to fail.

Closed, since this is in telco-6.3 only.