Bug #57650 LCP can crash data-node if getting transient errors
Submitted: 22 Oct 2010 5:38 Modified: 4 Nov 2010 14:25
Reporter: Jonathon Coombes Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: 7.0.13, cluster, ndbfs

[22 Oct 2010 5:38] Jonathon Coombes
Description:
2010-10-21 15:44:22 [ndbd] INFO     -- Unable to store fragment during LCP. NDBFS Error: 1217
2010-10-21 15:44:22 [ndbd] INFO     -- DBLQH (Line: 13001) 0x0000000a
2010-10-21 15:44:22 [ndbd] INFO     -- Error handler shutting down system
2010-10-21 15:44:22 [ndbd] INFO     -- Error handler shutdown completed - exiting
2010-10-21 15:44:27 [ndbd] ALERT    -- Node 14: Forced node shutdown completed. Caused by error 1217: 'No message slogan found (please report a bug if you get this error code)(Unknown). Unknown'.

How to repeat:
Not enough diskspace?

Suggested fix:
Supply an appropriate message
[2 Nov 2010 14:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122552

3916 Jonas Oreland	2010-11-02
      ndb - bug#57650 - add retries on transient errors of backup/lcp
[2 Nov 2010 14:57] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:jonas@mysql.com-20101102145326-mqsgv1srv7ns52db) (version source revid:jonas@mysql.com-20101102145326-mqsgv1srv7ns52db) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[2 Nov 2010 15:05] Jonas Oreland
pushed to 7.0.20 and 7.1.9
[2 Nov 2010 15:06] Jonas Oreland
DOCS: If a LCP got a transient error (in this case 1217) it would crash
  data-node. This patch solves this by retrying operation 10 times with
  100ms delay.
[4 Nov 2010 14:25] Jon Stephens
Documented bugfix in the NDB-7.0.20 and 7.1.9 changelogs, as follows:

        Transient errors during a local checkpoint were not retried,
        leading to a crash of the data node. Now when such errors occur,
        they are retried up to 10 times if necessary.

Closed.
[15 Mar 2012 7:19] Jonas Oreland
Note: a follow up fix was made for this bug
This was made in 7.0.30 and 7.1.19

/Jonas