Bug #90761 api node crashes under heavy connect disconnect
Submitted: 5 May 2018 8:10 Modified: 14 May 2018 15:50
Reporter: Gilad Odinak Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S3 (Non-critical)
Version:7.5.8 OS:CentOS
Assigned to: MySQL Verification Team CPU Architecture:Any

[5 May 2018 8:10] Gilad Odinak
Description:
This is related to Bug #77225 which added mutex to serialize access to 
NdbScanOperation::receiver_delivered
NdbScanOperation::receiver_completed

However, it does not protect the members m_sent_receivers and m_sent_receivers_count from being reset. 

Sometimes, these functions are called when m_sent_receivers_count == 0 and then

Uint32 last = m_sent_receivers_count - 1;
...
NdbReceiver * move = m_sent_receivers[last];

causes a crash.

How to repeat:
Happens randomly when used in a large production system. I don't have a simple text case/code

Suggested fix:
My simple workaround is to add

if (m_sent_receivers_count == 0) {
   setErrorCodeAbort(Err_scanAlreadyComplete);
   return;
}
[5 May 2018 15:42] Gilad Odinak
Additional logging and work around bug

Attachment: NdbScanOperation.cpp (text/x-c++src), 121.32 KiB.

[5 May 2018 15:45] Gilad Odinak
Sample output from additional logging

In receiver_delivered: idx (3) > last (1)
In receiver_completed: idx (3) > last (0)
In receiver_completed: last (-1) < 0, m_sent_receivers_count 0
In receiver_completed: setting error Err_scanAlreadyComplete
[10 May 2018 19:58] MySQL Verification Team
Hi,
I'm failing to understand what is actually going on. Your client is crashing? You are doing a connect/disconnect for every db query (your synopss say "heavy connect disconnect") or you only reconnect if you lose connection and you are running a lot of queries? Can you please explain what you actually do when you have a crash and how this new mutex affected you?

You say that you work around the problem by checking if "m_sent_receivers_count" is 0 but if it is 0 and you run NdbReceiver * move = m_sent_receivers[last]; and last is m_sent_receivers_count -1 they you obviously out of bounds?

thanks
Bogdan
[11 May 2018 17:34] Gilad Odinak
Hi Bogdan,

I'm sorry I wasn't too clear.

The crash is in mysqld running on an API node.  It happens when you have connectivity problems between the application (using mysql client library) connecting to the API node. What I think happens is the tcp connection dropping mid transmission. I cannot reproduce it when connectivity is reliable.

The attached modified source file is a bit confusing because I added both logging and some checks to prevent the crash

Here I only perform the assignment if m_sent_receivers_count is greater than zero (so last is greater or equal to zero which is fine)

    if (m_sent_receivers_count > 0) {
        NdbReceiver * move = m_sent_receivers[last];
        ....

And here I only set the error code if last is less than zero (but since it is an unsigned value it will contain a positive value, for example -1 will manfiest itself as 0x7fffffff)

    if (last >= 0x7ffffffe) {
      my_safe_printf_stderr("In %s: setting error Err_scanAlreadyComplete\n",__FUNCTION__);
      setErrorCodeAbort(Err_scanAlreadyComplete);
    }

Thanks
[12 May 2018 3:03] MySQL Verification Team
Hi Gilad,

> The crash is in mysqld running on an API node. 

?!
so mysqld is crashing? (the "api node" does not mean anything, there is ndb api application connecting to cluster, there is data node, management node and sql node, you have API slots for applications using ndbapi to connect but "api node" is unclear definition)

> application (using mysql client library) connecting to the API node

This is unclear.

Your application is connecting to mysqld using mysqlclient library?

>  connectivity problems between the application (using mysql client library) connecting to the API node.

Not sure where is the connectivity problem? Between mysqld and cluster or between your client app and mysqld server?

I need more data about the crash, is there a log on the crashed mysqld, a core file?

all best
Bogdan
[12 May 2018 5:08] Gilad Odinak
Here's the tail of the stack trace

/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf73a1b]
/usr/sbin/mysqld(handle_fatal_signal+0x461)[0x82ab01]
/lib64/libpthread.so.0(+0xf5e0)[0x7f9105ce15e0]
/usr/sbin/mysqld(_ZN16NdbScanOperation18receiver_deliveredEP11NdbReceiver+0x2a)[0x1370e1a]
/usr/sbin/mysqld(_ZN14NdbTransaction19receiveSCAN_TABCONFEPK12NdbApiSignalPKjj+0x192)[0x1368442]

NdbTransaction::receiveSCAN_TABCONF  calls NdbScanOperation::receiver_delivered

That is at line 149 below

mysql-cluster-gpl-7.5.8/storage/ndb/src/ndbapi/NdbTransactionScan.cpp

147:      else if (tOp->execSCANOPCONF(tcPtrI, totalLen, opCount))
148:      {
149:        theScanningOp->receiver_delivered(tOp);
150:        retVal = 0;
151:      }

It looks like receiver_delivered accesses invalid memory address at line 1619 below

1610: void
1611: NdbScanOperation::receiver_delivered(NdbReceiver* tRec){
1612:   if(theError.code == 0){
...
1616:     Uint32 idx = tRec->m_list_index;
1617:     Uint32 last = m_sent_receivers_count - 1;
1618:     if(idx != last){
1619:       NdbReceiver * move = m_sent_receivers[last];
1620:       m_sent_receivers[idx] = move;
1621:       move->m_list_index = idx;
1622:     }
1623:     m_sent_receivers_count = last;
[12 May 2018 5:14] Gilad Odinak
The API node (whose crashed back trace is above) is machine with 64GB RAM, running mysqld with this my.cnf

[root@ndbapi1 mysqld]# cat /etc/my.cnf
[mysqld]
ndbcluster=on
default-storage-engine=NDBCLUSTER
ndb-nodeid=49
ndb-connectstring=10.131.10.200,10.131.10.201
bind-address=0.0.0.0

pid-file=/var/run/mysqld/mysqld.pid
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

log-bin=mysql-bin
log-error=/var/log/mysqld/mysqld.log
slow_query_log = 1
slow_query_log_file=/var/log/mysqld/mysqld_slow_queries.log
long_query_time=0.1fast
expire-logs-days=4

skip-name-resolve  
sql_mode=""
default-time-zone='UTC'
server-id=1        
replicate-same-server-id=0
auto_increment_increment=1000
auto_increment_offset=1      
symbolic-links=0

max_connections=10000
open_files_limit=50000
key_buffer_size=8M
query_cache_type=1       
query_cache_size=1000M   
sort_buffer_size=500K    
read_buffer_size=2M      
read_rnd_buffer_size=500K
max_allowed_packet=2M
[14 May 2018 14:57] MySQL Verification Team
Hi,

Thanks for the report. I can't reproduce this at all, and tried with client going around the world to connect to mysqld, connected and disconnected many times. Any chance you can share source/binary that reproduces this issue?

You say this only happens if you have flaky network connection between your client and mysqld and a lot of connections and disconnections?

all best
Bogdan
[14 May 2018 15:50] Gilad Odinak
Hi Bogdan,

I upgraded to mysql 5.7.10. I does not happen anymore. I reviewed 5.7.8 source code of these files vs 5.7.10 and I don't see any material difference.

Thanks,

Gilad