Bug #77225 Signals delivered to incorrect or garbled 'trp_client' -> Client crash or hangs
Submitted: 2 Jun 2015 13:08 Modified: 8 Jun 2015 16:33
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S2 (Serious)
Version:7.3.9 OS:Any
Assigned to: CPU Architecture:Any

[2 Jun 2015 13:08] Ole John Aske
Description:
API Signals are delivered to the correct client by TransporterFacade::deliver_signal().
Based on the 'BlockNr' of the destination API client, the correct 'trp_client'
is looked up from the TransporterFacade::m_thread structure, which has Vector of
trp_client* as one of its members.

It turns out that there are no mutex protection when 'get'ing trp_client* 
from this Vector. This will cause issues when other clients connects to
the TransporterFacade, which could possibly cause the trp_client* Vector to
be ::expand'ed. During the expand operation the array of items are reallocated
and the old item array deleted *before* the realloced array is assigned.
Thus the deleted array can be grabbed by an other malloc and garbage 
written into it which the unproteced get'ter reads as a trp_client*.

We see crashes related to this in the AutoTest: testScan -n ScanRead488T
and there are also strange timeout and crashes in other testScan's which
we believe may be related to this.

How to repeat:
testScan -n ScanRead488T.

Force Vector::expand() to do a task switch by inserting
a microsleep just after the old item array has been deleted.
(And before the new is assigned)
[8 Jun 2015 16:33] Jon Stephens
Documented fix in the NDB 7.3.10 and 7.4.7 changelogs, as follows:

    Client lookup for delivery of API signals to the correct client
    by the internal TransporterFacade::deliver_signal() function had
    no mutex protection, which could cause issues such as timeouts
    encountered during testing, when other clients connected to the
    same TransporterFacade.
      
Closed.