Bug #55715 multiple node failures caused by DBTC/DIH timeout of COMMIT
Submitted: 3 Aug 2010 14:53 Modified: 12 Aug 2010 7:25
Reporter: Matthew Montgomery Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[3 Aug 2010 14:53] Matthew Montgomery
Description:
During a failure of another node DBTC can timeout waiting for DIH to signal COMMIT of pending transactions.  The current behavior is to crash the node when this timeout is reached.  This crash is likely unnecessary see: Suggested fix.

How to repeat:
Set heartbeat timeout to a high number, e.g. 5 seconds and deadlock detection to its minimum value. Then crashing a node silently while transactions are running should hit the crash code.

Suggested fix:
12:51:28] <mronstrom> magnus, this crash is really unnecessary as the comment implies:
[12:51:29] <mronstrom> // To ensure against strange bugs we crash the system if we have passed
[12:51:30] <mronstrom> // time-out period by a factor of 10 and it is also at least 5 seconds.
[12:52:13] <mronstrom> the reason why I put the crash there was that at the time it was necessary to also safeguard against the GCP protocol hanging in the GCP_PREPARE phase
[12:53:43] <mronstrom> so obviously this crash could occur in some cases when it shouldn't occur since we're still handling a node failure
[12:54:33] <mronstrom> simplest fix is to simply remove the crash code and replace it by a printout when it happens and there is no blockage due to node failures
[3 Aug 2010 14:55] MySQL Verification Team
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: 
Error object: DBTC (Line: 6597) 0x0000000a
Program: /usr/mysql/libexec/ndbd
Pid: 19102
Trace: /user/database/log/ndb_10_trace.log.9
Version: mysql-5.1.44 ndb-6.3.33-GA
***EOM***
[11 Aug 2010 9:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/115463

3245 Jonas Oreland	2010-08-11
      ndb - bug#55715 - don't crash on timeout in CS_PREPARE_TO_COMMIT, but simply log it once...
[11 Aug 2010 9:38] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.47-ndb-6.3.36 (revid:jonas@mysql.com-20100811093036-c1h4p730hkrfy1kg) (version source revid:jonas@mysql.com-20100811093036-c1h4p730hkrfy1kg) (merge vers: 5.1.47-ndb-6.3.36) (pib:20)
[11 Aug 2010 9:38] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.47-ndb-7.0.17 (revid:jonas@mysql.com-20100811093545-3huy9nqun7g2yit7) (version source revid:jonas@mysql.com-20100811093430-i42i0auiuox2lwv3) (merge vers: 5.1.47-ndb-7.0.17) (pib:20)
[11 Aug 2010 9:45] Jonas Oreland
pushed to 6.3.36, 7.0.17 and 7.1.6
[12 Aug 2010 7:25] Jon Stephens
Documented bugfix in the NDB-6.3.36, 7.0.17, and 7.1.6 changelogs, as follows:

        When another data node failed, a given data node DBTC kernel
        block could time out while waiting for DBDIH to signal commits
        of pending transactions, leading to a crash. Now in such cases
        the timeout generates a prinout, and the data node continues to
        operate.

Closed.