Bug #55906 Read errors in MGMAPI caused by interrupted 'poll'
Submitted: 11 Aug 2010 8:30 Modified: 12 Aug 2010 7:36
Reporter: Magnus Blåudd Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:6.3.35 OS:Any
Assigned to: Magnus Blåudd CPU Architecture:Any

[11 Aug 2010 8:30] Magnus Blåudd
Description:
The 'poll' and 'select' calls made by the MGM API are not interrupt safe. I.e a signal caught by the process while waiting for an event on socket(s) will return error -1 with "errno" set to EINTR.

This problem mainly causes problems in the MGM API, for example the functions 'ndb_mgm_logevent_get_next' and 'ndb_mgm_get_status2' will return read errors. Other functions in the MGM API are also affected but since they are not spending as much time waiting on socket events the problems are not as noticable there.

The connections between nodes in the cluster are not affected since they are already handling interrupted waits in their send/receive loop.

How to repeat:
Add a signal handler catching SIGUSR1 to for example tools/ndb_waiter or test/tools/eventlog and while running the program send SIGUSR1 in a loop like "while killall -USR1 eventlog; do true; done"

Suggested fix:
Make the 'ndb_socket_poller::poll' function EINTR safe and rename the old 'poll' function to 'poll_unsafe' allowing the parts of NDB that does not need EINTR safe funtion to use that directly.
[11 Aug 2010 10:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/115470
[11 Aug 2010 10:34] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.36 (revid:magnus.blaudd@sun.com-20100811102329-py9ckwovrs2s5ul9) (version source revid:magnus.blaudd@sun.com-20100811102329-py9ckwovrs2s5ul9) (merge vers: 5.1.47-ndb-6.3.36) (pib:20)
[11 Aug 2010 10:34] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.17 (revid:magnus.blaudd@sun.com-20100811102805-eg380653qn5t7wd7) (version source revid:magnus.blaudd@sun.com-20100811102615-u0mpv7hm9z81xseo) (merge vers: 5.1.47-ndb-7.0.17) (pib:20)
[11 Aug 2010 11:25] Magnus Blåudd
Pushed to 6.3.36, 7.0.17 and 7.1.6
[12 Aug 2010 7:36] Jon Stephens
Documented in the NDB-6.3.36, 7.0.17, and 7.1.6 changelogs, as follows:

        The poll and select calls made by the MGM API were not
        interrupt-safe; that is, a signal caught by the process while
        waiting for an event on one or more sockets returned error -1
        with errno set to EINTR. This caused problems with MGM API
        functions such as ndb_mgm_logevent_get_next() and
        ndb_mgm_get_status2().

        To fix this problem, the internal ndb_socket_poller::poll()
        function has been made EINTR safe.

        The old version of this function has been retained as
        poll_unsafe(), for use by those parts of NDB that do not need
        the EINTR-safe version of the function.

Also noted behaviour change in MGM API docs.

Closed.
[8 Feb 2011 2:19] Omer Barnir
Ignore last comment - put on wrong bug