Bug #75041 TransporterFacade::reset_send_buffer might reset a send_buffer in use by \'send\
Submitted: 28 Nov 2014 10:15 Modified: 12 Jan 2015 17:56
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.3.8 OS:Any
Assigned to: CPU Architecture:Any

[28 Nov 2014 10:15] Ole John Aske
TransporterFacade::reset_send_buffer() reset the two 
m_send_buffers[node] buffers: 'm_buffer' and 'm_out_buffer'.

However, these are designed to be protected by :

1)'m_buffer' should only be updated when holding the
   m_send_buffers[node].m_mutex lock.

2)'m_out_buffer' is protected by 'm_send_buffers[node].m_sending'.
   When this flag is set, the buffer is 'owned' by a thread
   actively sending, and consuming the m_out_buffer contents.
   Thus this buffer should not be reset while this flag
   is set.

Currently ::reset_send_buffer breaks both of these rules.

This is likely a regression introduced by WL#3860, the 'ATC patches' (7.3 ->)

Hard to tell which problem this could cause in every day life with
MySQL Cluster. It causes the contents of send_buffers to be undefined
if reset happens during ::performSend(), and garbage can be sent, or
signals simply missing. Could maybe explain some of the instability in
AutoTests doing restart.


How to repeat:
Has been seen by running ./testNodeRestart -l 100 -n MixedPkReadPkUpdate
for a long time. Need a huge 'loop' argument '-l'

Also needed instrumented code which added an
assert(!m_send_buffers[node].m_sending) in 
[12 Jan 2015 17:56] Jon Stephens
Documented fix as follows in the NDB 7.3.8 and 7.4.3 changelogs:

    In the NDB kernel, it was possible for a TransporterFacade
    object to reset a send buffer while the data contained by the buffer
    was being sent, which could lead to a race condition.