MySQL Bugs: #75041: TransporterFacade::reset_send_buffer might reset a send

Bug #75041	TransporterFacade::reset_send_buffer might reset a send_buffer in use by \'send\
Submitted:	28 Nov 2014 10:15	Modified:	12 Jan 2015 17:56
Reporter:	Ole John Aske	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	7.3.8	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
TransporterFacade::reset_send_buffer() reset the two 
m_send_buffers[node] buffers: 'm_buffer' and 'm_out_buffer'.

However, these are designed to be protected by :

1)'m_buffer' should only be updated when holding the
   m_send_buffers[node].m_mutex lock.

2)'m_out_buffer' is protected by 'm_send_buffers[node].m_sending'.
   When this flag is set, the buffer is 'owned' by a thread
   actively sending, and consuming the m_out_buffer contents.
   Thus this buffer should not be reset while this flag
   is set.

Currently ::reset_send_buffer breaks both of these rules.

This is likely a regression introduced by WL#3860, the 'ATC patches' (7.3 ->)

Hard to tell which problem this could cause in every day life with
MySQL Cluster. It causes the contents of send_buffers to be undefined
if reset happens during ::performSend(), and garbage can be sent, or
signals simply missing. Could maybe explain some of the instability in
AutoTests doing restart.

 

How to repeat:
Has been seen by running ./testNodeRestart -l 100 -n MixedPkReadPkUpdate
for a long time. Need a huge 'loop' argument '-l'

Also needed instrumented code which added an
assert(!m_send_buffers[node].m_sending) in 
reset_send_buffer

Documented fix as follows in the NDB 7.3.8 and 7.4.3 changelogs:

    In the NDB kernel, it was possible for a TransporterFacade
    object to reset a send buffer while the data contained by the buffer
    was being sent, which could lead to a race condition.
      
Closed.