Bug #61025 Race condition with CONNECT_REP in ndbmtd
Submitted: 2 May 2011 13:36 Modified: 3 May 2011 8:00
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[2 May 2011 13:36] Jonas Oreland
In ndbmtd, the "a node connected"-event is detected by CMVMI-thread
which will send a CONNECT_REP (prio a) to QMGR

However, there is a (very very) low risk, that a signal might be transfered
to QMGR directly by transporter before the CONNECT_REP arrives in QMGR.

This will then result in a error log like:
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: qmgr/QmgrMain.cpp
Error object: QMGR (Line: 565) 0x00000002
Program: /path/to/ndbmtd
Pid: 21509 thr: 0
Version: mysql-5.1.56 ndb-7.0.25
Trace: ./ndb_5_trace.log.109 ./ndb_5_trace.log.109_t1 ./ndb_5_trace.log.109_t2 ./ndb_5_trace.log.109_t3

How to repeat:
since twice in autotest

Suggested fix:
Ignore CM_REG_REQ until CONNECT_REP has arrived in QMGR.
This is safe as CM_REG_REQ is resent by starting node until it gets a reply.
[2 May 2011 13:41] Jonas Oreland
pushed to 7.0.25 and 7.1.14
[3 May 2011 8:00] Jon Stephens
Documented in the NDB-7.0.25 and 7.1.14 changelogs as follows:

        In ndbmtd, a node connect event is detected by a CMVMI thread
        which sends a CONNECT_REP signal to the QMGR kernel block. In a
        few isolated circumstances, a signal might be transfered to QMGR
        directly by the NDB transporter before the CONNECT_REP signal
        actually arrived. This resulted in reports in the error log with
        status and Temporary error, restart node the message Internal
        program error.