| Bug #73015 | Node restarts while another node is already starting may crash \'master\' | ||
|---|---|---|---|
| Submitted: | 16 Jun 2014 11:32 | Modified: | 20 Jun 2014 15:19 |
| Reporter: | Ole John Aske | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
| Version: | 7.3.6 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[16 Jun 2014 13:55]
Ole John Aske
Posted by developer:
Add some text from the commit comments possible being helpful
when later documenting this fix:
...............
A regression were introduced in the fix for bug#16007980
DATA NODE STUCK IN PHASE 1 WHEN OTHER NODE LOSES NETWORK
That fix added the following code to Qmgr::failReportLab()
+ /**
+ * If any node is starting now (c_start.startNode != 0)
+ * sendPrepFailReq to that too
+ */
+ if (c_start.m_startNode != 0)
+ {
+ jam();
+ cfailedNodes[cnoFailedNodes++] = c_start.m_startNode;
+ c_start.reset();
+ }
However, we could already have been notified about the
failure of the same node through any of the other 'channels'
which handle failures or disconnects. Thus, we could end
up with duplicates of the same NodeId in cfailedNodes[].
Later, this 'list of nodes' is converted into a BitMask
used in the PREP_FAILREQ-signal, and converted back into
a 'list of nodes' by Qmgr::execPREP_FAILREQ(). During this
BitMask conversion, the duplicated NodeId is eliminated.
However, **the 'noOfNodes' count is kept unchanged**.
Thus we end up with a materialized 'list of nodes' where
the size is of-by-one, and the last item contains garbage.
[20 Jun 2014 15:19]
Jon Stephens
Documented fix as follows in the NDB 7.1.32, 7.2.17, and 7.3.6 changelogs:
Processing a NODE_FAILREP signal that contained an invalid node
ID could cause a data node to fail. Regression of BUG#16007980.
Closed.
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.
If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at
http://dev.mysql.com/doc/en/installing-source.html

Description: This is a regression introduced by the fix for bug#16007980 DATA NODE STUCK IN PHASE 1 WHEN OTHER NODE LOSES NETWORK This regression introduced a situation where a node might crash while handling the Ndbcntr::execNODE_FAILREP signal: Thread 1 (Thread 0x7f614cfc8700 (LWP 6950)): #0 clear (this=0x1e3a790, signal=0x7f614cfbd240) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/include/util/Bitmask.hpp:332 #1 clear (this=0x1e3a790, signal=0x7f614cfbd240) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/include/util/Bitmask.hpp:1178 #2 clear (this=0x1e3a790, signal=0x7f614cfbd240) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/include/util/Bitmask.hpp:1185 #3 Ndbcntr::execNODE_FAILREP (this=0x1e3a790, signal=0x7f614cfbd240) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/src/kernel/blocks/ndbcntr/NdbcntrMain.cpp:2071 #4 0x00000000006fd78a in executeFunction (selfptr=0x7f614e23b360, q=<value optimized out>, h=<value optimized out>, r=<value optimized out>, sig=0x7f614cfbd240, max_signals=100, signalIdCounter=0x7f614cfc7dbc) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1069 #5 execute_signals (selfptr=0x7f614e23b360, q=<value optimized out>, h=<value optimized out>, r=<value optimized out>, sig=0x7f614cfbd240, max_signals=100, signalIdCounter=0x7f614cfc7dbc) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/src/kernel/vm/mt.cpp:3689 #6 0x00000000006fdb57 in run_job_buffers (selfptr=0x7f614e23b360, sig=0x7f614cfbd240, signalIdCounter=0x7f614cfc7dbc) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/src/kernel/vm/mt.cpp:3774 #7 0x0000000000700138 in mt_job_thread_main (thr_arg=0x7f614e23b360) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/src/kernel/vm/mt.cpp:4500 #8 0x00000000006aad9e in ndb_thread_wrapper (_ss=0x1ca0960) at /space/autotest/build/clone-mysql-5.5-cluster-7.2-2014-06-15.18159/storage/ndb/src/common/portlib/NdbThread.c:201 #9 0x00007f61503af851 in start_thread () from /lib64/libpthread.so.0 #10 0x00007f614f33b11d in clone () from /lib64/libc.so.6 ---> signal NdbcntrMain.cpp 02941 --------------- Signal ---------------- r.bn: 251 "NDBCNTR", r.proc: 2, r.sigId: 227112 gsn: 26 "NODE_FAILREP" prio: 1 s.bn: 252 "QMGR", s.proc: 2, s.sigId: 227108 length: 5 trace: 8 #sec: 0 fragInf: 0 H'00000003 H'00000002 H'00000002 H'00000009 H'00000000 The crash happens in the following code: ........... Uint32 nodeId = 0; while(!allFailed.isclear()){ nodeId = allFailed.find(nodeId+1); allFailed.clear(nodeId); << Crash signal->theData[1] = nodeId; sendSignal(CMVMI_REF, GSN_EVENT_REP, signal, 3, JBB); }//for .......... Further debugging shows that the bit corresponding to nodeId=0 has been set in the allFailed BitMask which is sent as part of the NODE_FAILREP signal. This is an illegal nodeId, and should never have been set (garbage?). How to repeat: ./testNodeRestart -n Bug42422 -l 1 T1