MySQL Bugs: #61025: Race condition with CONNECT

Bug #61025	Race condition with CONNECT_REP in ndbmtd
Submitted:	2 May 2011 13:36	Modified:	3 May 2011 8:00
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
In ndbmtd, the "a node connected"-event is detected by CMVMI-thread
which will send a CONNECT_REP (prio a) to QMGR

However, there is a (very very) low risk, that a signal might be transfered
to QMGR directly by transporter before the CONNECT_REP arrives in QMGR.

This will then result in a error log like:
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: qmgr/QmgrMain.cpp
Error object: QMGR (Line: 565) 0x00000002
Program: /path/to/ndbmtd
Pid: 21509 thr: 0
Version: mysql-5.1.56 ndb-7.0.25
Trace: ./ndb_5_trace.log.109 ./ndb_5_trace.log.109_t1 ./ndb_5_trace.log.109_t2 ./ndb_5_trace.log.109_t3

How to repeat:
since twice in autotest

Suggested fix:
Ignore CM_REG_REQ until CONNECT_REP has arrived in QMGR.
This is safe as CM_REG_REQ is resent by starting node until it gets a reply.

pushed to 7.0.25 and 7.1.14

Documented in the NDB-7.0.25 and 7.1.14 changelogs as follows:

        In ndbmtd, a node connect event is detected by a CMVMI thread
        which sends a CONNECT_REP signal to the QMGR kernel block. In a
        few isolated circumstances, a signal might be transfered to QMGR
        directly by the NDB transporter before the CONNECT_REP signal
        actually arrived. This resulted in reports in the error log with
        status and Temporary error, restart node the message Internal
        program error.

Closed.