MySQL Bugs: #46412: NDBRequire hit in Dbdih::invalidateLcpInfoAfterSr

Bug #46412	NDBRequire hit in Dbdih::invalidateLcpInfoAfterSr
Submitted:	27 Jul 2009 18:02	Modified:	18 Aug 2009 15:03
Reporter:	Andrew Hutchings	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	6.3.24	OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
During a cluster restart a node hit the ndbrequire in the function above during startphase 4.  I believe this is because a node is in LCP when it shouldn't be.

Time: Monday 27 July 2009 - 13:45:59
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdih/DbdihMain.cpp
Error object: DBDIH (Line: 9186) 0x0000000a
Program: /usr/mysql/libexec/ndbd
Pid: 18674
Trace: /user/database/log/ndb_4_trace.log.2
Version: mysql-5.1.32 ndb-6.3.24-GA

DBDIH   000658 000694 000769 014798 014798 014778 014774 014778
        014774 014778 014774 014778 014774 014774 014774 014798
        014798 014798 014798 014798 014798 014798 014798 014798
        014798 014798 014798 014798 014798 014798 014798 014798
        014798 014798 014798 014798 014798 014798 014798 014798
        014798 014798 014798 014798 014798 014798 014798 014798
        014798 014798 014798 014843 014843 014833 014833 014833
        014833 014833 014833 014833 014833 014833 014833 014843
        014843 014843 014843 014843 014843 014843 014843 014843
        014843 014843 014843 014843 014843 014843 014843 014843
        014843 014843 014843 014843 014843 014843 014843 014843
        014843 014843 014843 014843 014843 014843 014843 014843
        014843 014843 014843 014854 014858 014854 014858 014854
        014858 014854 014858 014854 014858 014854 014854 014854
        014854 014854 014854 014854 014854 014854 014854 014854
        014854 014854 014854 014854 014854 014854 014854 014854
        014854 014854 014854 014854 014854 014854 014854 014854
        014854 014854 014854 014854 014854 014854 014854 014854
        014854 014854 014854 014854 014854 014854 014854 014854
        014854 014896 014896 014896 014896 014896 014896 014896
        014896 014896 014896 014896 014896 014896 014896 014896
        014896 014896 014896 014896 014896 014896 014896 014896
        014896 014896 014896 014896 014896 014896 014896 014896
        014896 014896 014896 014896 014896 014896 014896 014896
        014896 014896 014896 014896 014896 014896 014896 014896
        014896 000778 009172 009175 009196 009172 009175 009196
        009172 009175 009188 009172 009172 009175 009188 009172
        009172 009175 009188 009172 009172 009175 009188 009172
        009172 009175 009186

How to repeat:
Unkown

Does restarting the node manually (possibly with --inital) solve the problem?

The cluster was restored from backup before --initial was tried and the problem could not be reproduced since.

reproduced...using 2 new error inserts

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/80966

3010 Jonas Oreland	2009-08-18
      ndb - bug#46412
        Fix/handle incorrectly set lcp-bits during system restart

pushed to 6.3.26 and 7.0.7

docs:
1) lcp starts
2) master dies almost directly afterwards
3) rest of cluster dies within 1-2s
4) crash when restarting

Documented bugfix in the NDB-6.3.26 and 7.0.7 changelogs as follows:

      Killing MySQL Cluster nodes immediately following a local checkpoint could
      lead to a crash of the cluster when later attempting to perform a system
      restart.

      The exact sequence of events causing this issue was as follows:

          1. Local checkpoint occurs.

          2. Immediately following the LCP, kill the master data node.

          3. Kill the remaining data nodes within a few seconds of killing the
          master.

          4. Attempt to restart the cluster.