Bug #19020 Failure of one cluster storage node can cascade into failure of another storage
Submitted: 11 Apr 2006 18:40 Modified: 12 Apr 2006 11:41
Reporter: David Dawe Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.19 (max) OS:Sparc Solaris 10
Assigned to: Assigned Account CPU Architecture:Any

[11 Apr 2006 18:40] David Dawe
Description:
Using mysql-max-5.0.19-solaris10-sparc-64bit.tar.gz.

In testing recovery from a storage node failure, I found that sometimes (about once every 4 to 6 times that I killed off one node) the second storage node would also fail.  This problem appears to be closely related to http://bugs.mysql.com/bug.php?id=18349.  However, bug 18349 refers to the error message being incorrect, whereas this bug refers to the problem of the second node shutting down (and causing possible loss of data).

The test configuration used was:
  - 2 machines with a storage node and SQL node each, and
  - a third machine running a management node.

If records are being inserted into the database at the time when the second node fails, there can be data loss.

How to repeat:
- Induce the first storage node to fail by sending it either a SIGKILL or SIGSEGV.
- Repeat this several times (waiting for the node to restart and re-join the cluster each time), until the second storage node also fails.

This problem also occurs if performed over a long period (as opposed to in quick succession).

Suggested fix:
If the cluster does not require that the second node shut itself down upon detecting several failures of the first node, then the second node should be allowed to keep running.  However, if the second node must shut itself down in this situation, then it should first write all its data to disk (I guess that would be a local checkpoint).
[11 Apr 2006 18:46] Miguel Solorzano
Changing Category to Server: Cluster.
[11 Apr 2006 18:59] Jonas Oreland
Hi

Can this be,  http://bugs.mysql.com/bug.php?id=18298

Please upload all trace/error logs + cluster log so that I can verify.

/Jonas
[11 Apr 2006 19:05] David Dawe
Extracted from output log file

Attachment: ndb_4_out.log (application/octet-stream, text), 1.59 KiB.

[11 Apr 2006 19:07] David Dawe
Extracted from error log file

Attachment: ndb_4_error.log (application/octet-stream, text), 393 bytes.

[11 Apr 2006 19:14] Jonas Oreland
bzip2 ?
[11 Apr 2006 19:17] Miguel Solorzano
Could you please upload the file at:

ftp://ftp.mysql.com:/pub/mysql/upload

zip it into a file with a name that identifies this bug report
i.e: bug19020.zip

Thanks in advance.
[11 Apr 2006 19:19] David Dawe
Compressed trace log

Attachment: ndb_4_trace.log.11.bz2 (application/octet-stream, text), 25.92 KiB.

[11 Apr 2006 19:30] David Dawe
It does seem related to http://bugs.mysql.com/bug.php?id=18298 ... there are some ordered unique indexes in the database.  While I usually experience the problem after only 4 to 6 kills, my most recent test did not experience it until after 8 kills.  During that test, there was no database activity.
[11 Apr 2006 20:00] Jonas Oreland
Hi,

I checked log, am currently 99% sure that this is it.
If you want to be 100% sure you can run
>ndb_show_tables 
locate table 3 in output and verify that it is an index.
(using the exact same cluster as the one that crashed...)

I'm closing this as duplicate,
please reopen if you disagree.

/Jonas
[12 Apr 2006 11:41] David Dawe
Table 3 is indeed an index:

3     OrderedIndex         Online   No      c4_ddawe_ndb def      PRIMARY

Thanks.