Bug #90761 | api node crashes under heavy connect disconnect | ||
---|---|---|---|
Submitted: | 5 May 2018 8:10 | Modified: | 14 May 2018 15:50 |
Reporter: | Gilad Odinak | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: NDB API | Severity: | S3 (Non-critical) |
Version: | 7.5.8 | OS: | CentOS |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[5 May 2018 8:10]
Gilad Odinak
[5 May 2018 15:42]
Gilad Odinak
Additional logging and work around bug
Attachment: NdbScanOperation.cpp (text/x-c++src), 121.32 KiB.
[5 May 2018 15:45]
Gilad Odinak
Sample output from additional logging In receiver_delivered: idx (3) > last (1) In receiver_completed: idx (3) > last (0) In receiver_completed: last (-1) < 0, m_sent_receivers_count 0 In receiver_completed: setting error Err_scanAlreadyComplete
[10 May 2018 19:58]
MySQL Verification Team
Hi, I'm failing to understand what is actually going on. Your client is crashing? You are doing a connect/disconnect for every db query (your synopss say "heavy connect disconnect") or you only reconnect if you lose connection and you are running a lot of queries? Can you please explain what you actually do when you have a crash and how this new mutex affected you? You say that you work around the problem by checking if "m_sent_receivers_count" is 0 but if it is 0 and you run NdbReceiver * move = m_sent_receivers[last]; and last is m_sent_receivers_count -1 they you obviously out of bounds? thanks Bogdan
[11 May 2018 17:34]
Gilad Odinak
Hi Bogdan, I'm sorry I wasn't too clear. The crash is in mysqld running on an API node. It happens when you have connectivity problems between the application (using mysql client library) connecting to the API node. What I think happens is the tcp connection dropping mid transmission. I cannot reproduce it when connectivity is reliable. The attached modified source file is a bit confusing because I added both logging and some checks to prevent the crash Here I only perform the assignment if m_sent_receivers_count is greater than zero (so last is greater or equal to zero which is fine) if (m_sent_receivers_count > 0) { NdbReceiver * move = m_sent_receivers[last]; .... And here I only set the error code if last is less than zero (but since it is an unsigned value it will contain a positive value, for example -1 will manfiest itself as 0x7fffffff) if (last >= 0x7ffffffe) { my_safe_printf_stderr("In %s: setting error Err_scanAlreadyComplete\n",__FUNCTION__); setErrorCodeAbort(Err_scanAlreadyComplete); } Thanks
[12 May 2018 3:03]
MySQL Verification Team
Hi Gilad, > The crash is in mysqld running on an API node. ?! so mysqld is crashing? (the "api node" does not mean anything, there is ndb api application connecting to cluster, there is data node, management node and sql node, you have API slots for applications using ndbapi to connect but "api node" is unclear definition) > application (using mysql client library) connecting to the API node This is unclear. Your application is connecting to mysqld using mysqlclient library? > connectivity problems between the application (using mysql client library) connecting to the API node. Not sure where is the connectivity problem? Between mysqld and cluster or between your client app and mysqld server? I need more data about the crash, is there a log on the crashed mysqld, a core file? all best Bogdan
[12 May 2018 5:08]
Gilad Odinak
Here's the tail of the stack trace /usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xf73a1b] /usr/sbin/mysqld(handle_fatal_signal+0x461)[0x82ab01] /lib64/libpthread.so.0(+0xf5e0)[0x7f9105ce15e0] /usr/sbin/mysqld(_ZN16NdbScanOperation18receiver_deliveredEP11NdbReceiver+0x2a)[0x1370e1a] /usr/sbin/mysqld(_ZN14NdbTransaction19receiveSCAN_TABCONFEPK12NdbApiSignalPKjj+0x192)[0x1368442] NdbTransaction::receiveSCAN_TABCONF calls NdbScanOperation::receiver_delivered That is at line 149 below mysql-cluster-gpl-7.5.8/storage/ndb/src/ndbapi/NdbTransactionScan.cpp 147: else if (tOp->execSCANOPCONF(tcPtrI, totalLen, opCount)) 148: { 149: theScanningOp->receiver_delivered(tOp); 150: retVal = 0; 151: } It looks like receiver_delivered accesses invalid memory address at line 1619 below 1610: void 1611: NdbScanOperation::receiver_delivered(NdbReceiver* tRec){ 1612: if(theError.code == 0){ ... 1616: Uint32 idx = tRec->m_list_index; 1617: Uint32 last = m_sent_receivers_count - 1; 1618: if(idx != last){ 1619: NdbReceiver * move = m_sent_receivers[last]; 1620: m_sent_receivers[idx] = move; 1621: move->m_list_index = idx; 1622: } 1623: m_sent_receivers_count = last;
[12 May 2018 5:14]
Gilad Odinak
The API node (whose crashed back trace is above) is machine with 64GB RAM, running mysqld with this my.cnf [root@ndbapi1 mysqld]# cat /etc/my.cnf [mysqld] ndbcluster=on default-storage-engine=NDBCLUSTER ndb-nodeid=49 ndb-connectstring=10.131.10.200,10.131.10.201 bind-address=0.0.0.0 pid-file=/var/run/mysqld/mysqld.pid datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock log-bin=mysql-bin log-error=/var/log/mysqld/mysqld.log slow_query_log = 1 slow_query_log_file=/var/log/mysqld/mysqld_slow_queries.log long_query_time=0.1fast expire-logs-days=4 skip-name-resolve sql_mode="" default-time-zone='UTC' server-id=1 replicate-same-server-id=0 auto_increment_increment=1000 auto_increment_offset=1 symbolic-links=0 max_connections=10000 open_files_limit=50000 key_buffer_size=8M query_cache_type=1 query_cache_size=1000M sort_buffer_size=500K read_buffer_size=2M read_rnd_buffer_size=500K max_allowed_packet=2M
[14 May 2018 14:57]
MySQL Verification Team
Hi, Thanks for the report. I can't reproduce this at all, and tried with client going around the world to connect to mysqld, connected and disconnected many times. Any chance you can share source/binary that reproduces this issue? You say this only happens if you have flaky network connection between your client and mysqld and a lot of connections and disconnections? all best Bogdan
[14 May 2018 15:50]
Gilad Odinak
Hi Bogdan, I upgraded to mysql 5.7.10. I does not happen anymore. I reviewed 5.7.8 source code of these files vs 5.7.10 and I don't see any material difference. Thanks, Gilad