Bug #78130 ndb.ndb_suma_handover fail due to segmentation fault for gcov run in pb2 for 7.4
Submitted: 18 Aug 2015 14:15 Modified: 19 Aug 2015 13:11
Reporter: Mauritz Sundell Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.4 OS:Any
Assigned to: CPU Architecture:Any

[18 Aug 2015 14:15] Mauritz Sundell
Description:
Since gcov runs in PB2 for MySQL Cluster 7.4 trees Feb 24 2015, test ndb.ndb_suma_handover have failed regulary due to segmentation fault (signal 11), see logs below:

2015-08-18 00:26:28 [ndbd] INFO     -- DBTC instance 3: Removed node 2 from takeover queue, 0 failed nodes remaining
completing gcp 10/10 in execTAKE_OVERTCCONF
2015-08-18 00:26:28 [ndbd] INFO     -- DBTC instance 2: Removed node 2 from takeover queue, 0 failed nodes remaining
completing gcp 10/10 in execTAKE_OVERTCCONF
2015-08-18 00:26:28 [ndbd] INFO     -- NR Status: node=2,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2015-08-18 00:26:28 [ndbd] INFO     -- Node 2 has completed node fail handling
2015-08-18 00:26:29 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart ongoing
2015-08-18 00:26:40 [ndbd] INFO     -- Suma: handover to node 3 gci: 17 buckets: 00000002 (2)
17/0 (16/4294967295) switchover complete bucket 1 state: 100
shutdown handover
2015-08-18 00:26:49 [ndbd] INFO     -- Restarting system
2015-08-18 00:26:49 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Initiated by signal 11.
-----------FAILED DATA NODE OUTPUT LOG END----------

Running test locally one sometimes got a crash in call to ndb_mgm_get_latest_error_line() with NULL handler.

(gdb) bt
#0  0x0000000001195da2 in ndb_mgm_get_latest_error_line (h=0x0)
    at /home/msundell/dev/mysql-7.4/src/storage/ndb/src/mgmapi/mgmapi.cpp:436
#1  0x000000000114ea5b in TransporterRegistry::start_clients_thread (this=0x1ea2040 <globalTransporterRegistry>)
    at /home/msundell/dev/mysql-7.4/src/storage/ndb/src/common/transporter/TransporterRegistry.cpp:2169
#2  0x000000000114c867 in run_start_clients_C (me=0x1ea2040 <globalTransporterRegistry>)
    at /home/msundell/dev/mysql-7.4/src/storage/ndb/src/common/transporter/TransporterRegistry.cpp:1836
#3  0x00000000011be120 in ndb_thread_wrapper (_ss=0x1f1fdc0)
    at /home/msundell/dev/mysql-7.4/src/storage/ndb/src/common/portlib/NdbThread.c:205
#4  0x00007fb33b863204 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fb33ab8671d in clone () from /lib64/libc.so.6

storage/ndb/src/common/transporter/TransporterRegistry.cpp:
2158                  else
2159                  {
2160                    DBUG_PRINT("info", ("mgmd close connection early"));
2161                    g_eventLogger->info
2162                      ("Management server closed connection early. "
2163                       "It is probably being shut down (or has problems). "
2164                       "We will retry the connection. %d %s %s line: %d",
2165                       ndb_mgm_get_latest_error(m_mgm_handle),
2166                       ndb_mgm_get_latest_error_desc(m_mgm_handle),
2167                       ndb_mgm_get_latest_error_msg(m_mgm_handle),
2168                       ndb_mgm_get_latest_error_line(m_mgm_handle)
2169                       );

How to repeat:
Look in PB2 for myqsl-5.6-cluster-7.4.
Or run something like ./mtr --mem --gcov --repeat=10 ndb.ndb_suma_handover.

Suggested fix:
Backport changes to ndb_mgm_get_latest_error-functions from Bug#11760802 SEVERAL MGMAPI FUNCTIONS RETURN 0(SUCCESS) WHEN NO HANDLE OR NOT CONNECTED allowing these functions to be called with NULL handles without crashing.

Or test for NULL handler in TransporterRegistry::start_clients_thread() before calling ndb_mgm_get_latest_error-functions in printout.
[19 Aug 2015 13:11] Jon Stephens
Documented fix in the NDB 7.4.8 and 7.5.0 changelogs as follows:

    The MGM API error-handling functions ndb_mgm_get_latest_error(),
    ndb_mgm_get_latest_error_msg(), and
    ndb_mgm_get_latest_error_desc() failed when used with a NULL
    handle. You should note that, although these functions are now
    null-safe, values returned in this case are arbitrary and not
    meaningful.

Also updated the descriptions of these functions in the Cluster API Guide.

Closed.