Bug #43974 | ndb_mgmd: SIGSEGV in TransporterRegistry::updateWritePtr() when restarting nodes | ||
---|---|---|---|
Submitted: | 31 Mar 2009 7:55 | Modified: | 26 May 2009 8:11 |
Reporter: | Jan Wedvik | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | mysql-5.1-telco-7.0 | OS: | Solaris (Sparc/Solaris 5.10) |
Assigned to: | Magnus Blåudd | CPU Architecture: | Any |
[31 Mar 2009 7:55]
Jan Wedvik
[31 Mar 2009 11:30]
Jan Wedvik
The dbx transcript below indicates where the race condition can be found. Thread t@12 calls TransporterRegistry::reset_send_buffer() while t@9 is executing in TransporterRegistry::prepareSend(). reset_send_buffer() sets m_send_buffers[x].m_last_page to NULL, which would cause segment violation in updateWritePtr(). t@12 (l@12) terminated by signal ABRT (Abort) 0x7f5caa58: __lwp_kill+0x0008: bcc,a,pt %icc,__lwp_kill+0x18 ! 0x7f5caa68 Current function is JWLock::lock 100 MY_ASSERT(mTId==voidTid || mTId==self); (dbx) where current thread: t@12 [1] __lwp_kill(0x0, 0x6, 0x0, 0x6, 0xffbffeff, 0x0), at 0x7f5caa58 [2] raise(0x6, 0x0, 0x7f642fd8, 0x7f5aa378, 0xffffffff, 0x6), at 0x7f565a5c [3] abort(0xc, 0x1, 0x2e82c0, 0xfcb78, 0x7f6413d8, 0x0), at 0x7f54194c =>[4] JWLock::lock(this = 0x3dcdc0), line 100 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/include/transporter/TransporterRegistry.hpp" [5] Locker::Locker(this = 0x7f4678e0, reg = 0x3dcc98), line 74 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [6] TransporterRegistry::reset_send_buffer(this = 0x3dcc98, node = 3U), line 2098 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [7] TransporterFacade::reset_send_buffer(this = 0x46c048, node = 3U), line 190 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/ndbapi/TransporterFacade.hpp" [8] Transporter::connect_server(this = 0x4030a8, sockfd = STRUCT), line 138 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/Transporter.cpp" [9] TransporterRegistry::connect_server(this = 0x3dcc98, sockfd = STRUCT), line 402 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [10] MgmtSrvr::transporter_connect(this = 0x436048, sockfd = STRUCT), line 3705 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/mgmsrv/MgmtSrvr.cpp" [11] MgmApiSession::transporter_connect(this = 0xd2f0f8, ctx = CLASS, args = CLASS), line 1712 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/mgmsrv/Services.cpp" [12] Parser<MgmApiSession>::run(this = 0xd52350, ctx = CLASS, t = CLASS, stop = (nil)), line 205 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/include/util/Parser.hpp" [13] MgmApiSession::runSession(this = 0xd2f0f8), line 387 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/mgmsrv/Services.cpp" [14] sessionThread_C(_sc = 0xd2f0f8), line 387 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/util/SocketServer.cpp" [15] ndb_thread_wrapper(_ss = 0xd882b0), line 145 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/portlib/NdbThread.c" (dbx) frame 6 Current function is TransporterRegistry::reset_send_buffer 2098 Locker scopeLock(this); (dbx) print this->lock this->lock = { mTId = 9U mDepth = 1 } (dbx) threads t@1 a l@1 ?() LWP suspended in __pollsys() t@2 a l@2 ndb_thread_wrapper() sleep on 0x351f88 in __lwp_park() t@3 a l@3 ndb_thread_wrapper() sleep on 0x351f88 in __lwp_park() t@4 a l@4 ndb_thread_wrapper() sleep on 0x351f88 in __lwp_park() t@5 a l@5 ndb_thread_wrapper() LWP suspended in __pollsys() t@6 a l@6 ndb_thread_wrapper() LWP suspended in __pollsys() t@7 a l@7 ndb_thread_wrapper() LWP suspended in __pollsys() t@8 a l@8 ndb_thread_wrapper() sleep on 0xd9a248 in __lwp_park() t@9 a l@9 ndb_thread_wrapper() LWP suspended in ___nanosleep() t@10 a l@10 ndb_thread_wrapper() LWP suspended in __pollsys() o> t@12 a l@12 ndb_thread_wrapper() signal SIGABRT in __lwp_kill() (dbx) thread t@9 Current function is TransporterRegistry::prepareSend 658 usleep(20000); t@9 (l@9) stopped in ___nanosleep at 0x7f5c9c44 0x7f5c9c44: ___nanosleep+0x0008: bcc,a,pt %icc,___nanosleep+0x18 ! 0x7f5c9c54 (dbx) where current thread: t@9 [1] ___nanosleep(0x4, 0x0, 0x4e20, 0x25d780, 0x975e0, 0x9c400), at 0x7f5c9c44 [2] _usleep(0x4e20, 0x7f4c3a00, 0x0, 0x1, 0x1, 0x1), at 0x7f5b971c =>[3] TransporterRegistry::prepareSend(this = 0x3dcc98, sendHandle = 0x3dcc98, signalHeader = 0x7f487c18, prio = '\001', signalData = 0x7f487c34, nodeId = 2U, ptr = 0x7f487c98), line 658 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [4] TransporterRegistry::prepareSend(this = 0x3dcc98, signalHeader = 0x7f487c18, prio = '\001', signalData = 0x7f487c34, nodeId = 2U, ptr = 0x7f487c98), line 279 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/include/transporter/TransporterRegistry.hpp" [5] SignalSender::sendSignal(this = 0x7f487cc8, nodeId = 2U, s = 0x7f487c18), line 2033 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/ndbapi/TransporterFacade.cpp" [6] MgmtSrvr::setEventReportingLevelImpl(this = 0x436048, nodeId_arg = 2, ll = STRUCT), line 1895 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/mgmsrv/MgmtSrvr.cpp" [7] MgmtSrvr::logLevelThreadRun(this = 0x436048), line 145 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/mgmsrv/MgmtSrvr.cpp" [8] MgmtSrvr::logLevelThread_C(m = 0x436048), line 96 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/mgmsrv/MgmtSrvr.cpp" [9] ndb_thread_wrapper(_ss = 0x33c900), line 145 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/portlib/NdbThread.c" (dbx)
[31 Mar 2009 13:56]
Jan Wedvik
The stack traces in the previous comments were misleading, since the two threads work with different data nodes. Below is another example, where both threads access node 2, meaning that the race condition is real. So far, I have not been able to reproduce this error on Linux. (dbx) where current thread: t@4 [1] ___nanosleep(0x4, 0x0, 0x4e20, 0x25d780, 0x975e0, 0x9c400), at 0x7f5c9c44 [2] _usleep(0x4e20, 0x7f4c1200, 0x0, 0x1, 0x0, 0x1), at 0x7f5b971c =>[3] TransporterRegistry::prepareSend(this = 0x3dcc98, sendHandle = 0x3dcc98, signalHeader = 0x7f657e34, prio = '\0', signalData = 0x7f657e50, nodeId = 2U, ptr = (nil)), line 679 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [4] TransporterRegistry::prepareSend(this = 0x3dcc98, signalHeader = 0x7f657e34, prio = '\0', signalData = 0x7f657e50, nodeId = 2U, ptr = (nil)), line 279 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/include/transporter/TransporterRegistry.hpp" [5] TransporterFacade::sendSignalUnCond(this = 0x46c048, aSignal = 0x7f657e34, aNode = 2U, prio = 0), line 1175 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/ndbapi/TransporterFacade.cpp" [6] ClusterMgr::threadMain(this = 0x47e048), line 311 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/ndbapi/ClusterMgr.cpp" [7] runClusterMgr_C(me = 0x47e048), line 47 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/ndbapi/ClusterMgr.cpp" [8] ndb_thread_wrapper(_ss = 0x33c750), line 145 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/portlib/NdbThread.c" (dbx) thread t@5 t@5 (l@5) stopped in __lwp_kill at 0x7f5caa58 0x7f5caa58: __lwp_kill+0x0008: bcc,a,pt %icc,__lwp_kill+0x18 ! 0x7f5caa68 (dbx) where current thread: t@5 [1] __lwp_kill(0x0, 0x6, 0x0, 0x6, 0xffbffeff, 0x0), at 0x7f5caa58 [2] raise(0x6, 0x0, 0x7f642fd8, 0x7f5aa378, 0xffffffff, 0x6), at 0x7f565a5c [3] abort(0x5, 0x1, 0x2e8484, 0xfcb78, 0x7f6413d8, 0x0), at 0x7f54194c [4] JWLock::lock(this = 0x426078), line 100 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/include/transporter/TransporterRegistry.hpp" [5] Locker::Locker(this = 0x7f637c54, reg = 0x3dcc98, node = 2U), line 92 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" =>[6] TransporterRegistry::reset_send_buffer(this = 0x3dcc98, node = 2U), line 2119 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [7] TransporterFacade::reset_send_buffer(this = 0x46c048, node = 2U), line 190 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/ndbapi/TransporterFacade.hpp" [8] Transporter::doDisconnect(this = 0x402ca0), line 288 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/Transporter.cpp" [9] TransporterRegistry::start_clients_thread(this = 0x3dcc98), line 1640 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [10] run_start_clients_C(me = 0x3dcc98), line 1379 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/transporter/TransporterRegistry.cpp" [11] ndb_thread_wrapper(_ss = 0xcd4148), line 145 in "/log/huge3/jw159207/mysql/trees/telco-7.0/storage/ndb/src/common/portlib/NdbThread.c" (dbx)
[7 Apr 2009 5:46]
Jonas Oreland
Motivation for triage: It can happen to any ndbapi/ndb_mgmd at any given time (since it's a race-condition), although risk is low. (I3.5)
[7 Apr 2009 13:25]
Jan Wedvik
I tried implementing TransporterFacade::lock_transporter() and TransporterFacade::unlock_transporter() as: void lock_transporter(NodeId node){ lock_mutex(); } void unlock_transporter(NodeId node){ unlock_mutex(); } I addition I added lock/unlock calls to TransporterFacade::reset_send_buffer as suggested by Magnus Blåudd: === modified file 'storage/ndb/src/ndbapi/TransporterFacade.hpp' --- storage/ndb/src/ndbapi/TransporterFacade.hpp 2009-02-23 09:59:06 +0000 +++ storage/ndb/src/ndbapi/TransporterFacade.hpp 2009-04-02 12:05:46 +0000 @@ -187,7 +187,9 @@ public: } void reset_send_buffer(NodeId node) { + lock_mutex(); theTransporterRegistry->reset_send_buffer(node); + unlock_mutex(); } These two changes prevented the race condition described in my comment on [31 Mar 15:56] from happening. But when I run my test again, I still get SIGSEGV in TransporterRegistry::updateWritePtr() after some time. So some thread must have modified *m_send_buffers without calling TransporterFacade::reset_send_buffer() or locking the mutex by other means. I then tried to add lock/unlock in MgmtSrvr::start_transporter() and Ndb_cluster_connection_impl::connect(). These methods call TransporterFacade::start_instance() which calls TransporterFacade::configure(), which calls TransporterRegistry::allocate_send_buffers() which modifies *m_send_buffers. The diff for these changes are shown below. When running with these changes, I have so far not seen the SIGSEGV error. But I have not tested this enough. === modified file 'storage/ndb/src/mgmsrv/MgmtSrvr.cpp' --- storage/ndb/src/mgmsrv/MgmtSrvr.cpp 2009-03-13 14:24:21 +0000 +++ storage/ndb/src/mgmsrv/MgmtSrvr.cpp 2009-04-06 10:47:44 +0000 @@ -455,15 +455,18 @@ g_eventLogger->error("Could not create TransporterFacade."); DBUG_RETURN(false); } - + + theFacade->lock_mutex(); if (theFacade->start_instance(_ownNodeId, config->m_configValues) < 0) { g_eventLogger->error("Failed to start transporter"); + theFacade->unlock_mutex(); delete theFacade; theFacade = 0; DBUG_RETURN(false); } + theFacade->unlock_mutex(); === modified file 'storage/ndb/src/ndbapi/ndb_cluster_connection.cpp' --- storage/ndb/src/ndbapi/ndb_cluster_connection.cpp 2009-03-14 20:42:04 +0000 +++ storage/ndb/src/ndbapi/ndb_cluster_connection.cpp 2009-04-07 11:48:39 +0000 @@ -741,11 +741,14 @@ if(props == 0) break; + m_transporter_facade->lock_transporter(nodeId); if (m_transporter_facade->start_instance(nodeId, props) < 0) { + m_transporter_facade->unlock_transporter(nodeId); ndb_mgm_destroy_configuration(props); DBUG_RETURN(-1); } + m_transporter_facade->unlock_transporter(nodeId); if (init_nodes_vector(nodeId, *props)) {
[9 Apr 2009 12:23]
Magnus Blåudd
Can't really understand why we would need to protect all of 'TransporterFacade::start_instance', feels like deadlock prone in combination with the other patches. Also find it "impossible" for anyone to be using the send buffer, but it might be the arbitrator thread. Need to look if there is anything else. If it should be protected I think it can be done inside the function.
[8 May 2009 10:19]
Magnus Blåudd
We will reset the send buffers when state of transporter goes to CONNECTING, that way it does not need to be locked at all. Below, I have removed two calls to reset_send_buffer since the send buffer should be emtpy at that time, but to catch any problems with someone writing into the buffer before connected, they should be added back with a "should_be_empty" flag so that we can catch any problem early. Just need to check if those calls are being made from the correct place. I think they should be later in those two function. Index: 43974/storage/ndb/src/common/transporter/Transporter.cpp =================================================================== --- 43974.orig/storage/ndb/src/common/transporter/Transporter.cpp 2009-04-14 18:57:50.000000000 +0200 +++ 43974/storage/ndb/src/common/transporter/Transporter.cpp 2009-04-14 19:35:37.000000000 +0200 @@ -134,8 +134,6 @@ Transporter::connect_server(NDB_SOCKET_T { DBUG_RETURN(false); // TODO assert(0); } - - get_callback_obj()->reset_send_buffer(remoteNodeId); my_socket_connect_address(sockfd, &m_connect_address); @@ -202,8 +200,6 @@ Transporter::connect_client(NDB_SOCKET_T DBUG_PRINT("info",("server port: %d, isMgmConnection: %d", m_s_port, isMgmConnection)); - get_callback_obj()->reset_send_buffer(remoteNodeId); - // Send "hello" DBUG_PRINT("info", ("Sending own nodeid: %d and transporter type: %d", localNodeId, m_type)); @@ -285,6 +281,5 @@ Transporter::doDisconnect() { m_connected= false; - get_callback_obj()->reset_send_buffer(remoteNodeId); disconnectImpl(); } Index: 43974/storage/ndb/src/common/transporter/TransporterRegistry.cpp =================================================================== --- 43974.orig/storage/ndb/src/common/transporter/TransporterRegistry.cpp 2009-04-14 19:35:26.000000000 +0200 +++ 43974/storage/ndb/src/common/transporter/TransporterRegistry.cpp 2009-04-16 12:50:23.000000000 +0200 @@ -1338,6 +1338,9 @@ TransporterRegistry::do_connect(NodeId n } DBUG_ENTER("TransporterRegistry::do_connect"); DBUG_PRINT("info",("performStates[%d]=CONNECTING",node_id)); + + callbackObj->reset_send_buffer(node_id); + curr_state= CONNECTING; DBUG_VOID_RETURN
[12 May 2009 10:23]
Magnus Blåudd
Patch sent for review
[15 May 2009 10:01]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/74176
[15 May 2009 13:42]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/74213
[15 May 2009 13:48]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/74218
[15 May 2009 13:49]
Bugs System
Pushed into 5.1.34-ndb-7.0.6 (revid:magnus.blaudd@sun.com-20090515134810-n4cjydwoyh15yoft) (version source revid:magnus.blaudd@sun.com-20090515134810-n4cjydwoyh15yoft) (merge vers: 5.1.34-ndb-7.0.6) (pib:6)
[20 May 2009 11:16]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/74584
[20 May 2009 11:28]
Bugs System
Pushed into 5.1.34-ndb-7.0.6 (revid:magnus.blaudd@sun.com-20090520111817-1jru7arp6h4wk0gp) (version source revid:magnus.blaudd@sun.com-20090520111817-1jru7arp6h4wk0gp) (merge vers: 5.1.34-ndb-7.0.6) (pib:6)
[26 May 2009 8:11]
Jon Stephens
Documented bugfix in the NDB-7.0.6 changelog as follows: Repeated starting and stopping of data nodes could cause ndb_mgmd to fail. This issue was observed on Solaris/SPARC.