Bug #46873 ndbmtd: race-condition with drop table / start lcp
Submitted: 23 Aug 2009 22:54 Modified: 20 Sep 2009 19:43
Reporter: Jonathon Coombes Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: 7.0.6, cluster, dbdih

[23 Aug 2009 22:54] Jonathon Coombes
Description:
Cluster has 2 data nodes. Lots of locking/contention before neglecting to save tables and crashing.
..........
jbalock waiting for lock, contentions: 3690000 spins: 1882864984
jbalock waiting for lock, contentions: 3695000 spins: 1885811301
2009-08-20 22:51:36 [ndbd] INFO     -- This is the last table
2009-08-20 22:51:36 [ndbd] INFO     -- TS_DROPPING - Neglecting to save Table: 3
7 Frag: 1 - 
2009-08-20 23:03:45 [ndbd] INFO     -- TS_DROPPING - Neglecting to save Table: 2
5 Frag: 3 - 
2009-08-20 23:03:57 [ndbd] INFO     -- dbdih/DbdihMain.cpp
2009-08-20 23:03:57 [ndbd] INFO     -- DBDIH (Line: 12454) 0x0000000a
2009-08-20 23:03:57 [ndbd] INFO     -- Error handler shutting down system
2009-08-20 23:03:57 [ndbd] INFO     -- Error handler shutdown completed - exitin
g
2009-08-20 23:04:09 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. C
aused by error 2341: 'Internal program error (failed ndbrequire)(Internal error,
 programming error or missing error message, please report a bug). Temporary err
or, restart node'.
2009-08-20 23:04:10 [ndbd] INFO     -- Angel pid: 23822 ndb pid: 23823
NDBMT: MaxNoOfExecutionThreads=4
NDBMT: workers=2 threads=2
2009-08-20 23:04:10 [ndbd] INFO     -- NDB Cluster -- DB node 4
2009-08-20 23:04:10 [ndbd] INFO     -- mysql-5.1.34 ndb-7.0.6 --
.........

Examining the code, the error relates to:

bool
Dbdih::reportLcpCompletion(const LcpFragRep* lcpReport)
....
  ndbrequire(replicaPtr.p->lcpOngoingFlag == true);

How to repeat:
-
[24 Aug 2009 12:13] Jonas Oreland
with ndbmtd, it happen that 2 nodes get different view on which tables
should be in LCP (if there is a "parallel" drop table) causing node to
crash later during LCP
[15 Sep 2009 12:22] Jonas Oreland
bug fixed, running tests over night on it
[19 Sep 2009 6:44] Jonas Oreland
pushed to 7.0.8 and 7.1
[20 Sep 2009 19:43] Jon Stephens
Documented bugfix in the NDB-7.0.8 changelog as follows:

        When using ndbmtd, a parallel DROP TABLE operation could cause
        data nodes to have different views of which tables should be
        included in local checkpoints; this could lead to a node failure
        during the LCP.

Closed.
[27 Sep 2009 15:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/84749

3046 Jonas Oreland	2009-09-27
      ndb - bug#46873 - small bug in previous bug-fix
[27 Sep 2009 15:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/84750

3047 Jonas Oreland	2009-09-27
      ndb - bug#46873 - small bug in previous bug-fix
[30 Sep 2009 8:14] Bugs System
Pushed into 5.1.37-ndb-7.0.9 (revid:jonas@mysql.com-20090930075942-1q6asjcp0gaeynmj) (version source revid:jonas@mysql.com-20090927152829-gbavcppruago3tbl) (merge vers: 5.1.37-ndb-7.0.8) (pib:11)
[30 Sep 2009 8:15] Bugs System
Pushed into 5.1.35-ndb-7.1.0 (revid:jonas@mysql.com-20090930080049-1c8a8cio9qgvhq35) (version source revid:jonas@mysql.com-20090927153301-86ynjzcg4vl33x1m) (merge vers: 5.1.35-ndb-7.1.0) (pib:11)