MySQL Bugs: #19020: Failure of one cluster storage node can cascade into failure of another storage

Bug #19020	Failure of one cluster storage node can cascade into failure of another storage
Submitted:	11 Apr 2006 18:40	Modified:	12 Apr 2006 11:41
Reporter:	David Dawe	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.19 (max)	OS:	Sparc Solaris 10
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Using mysql-max-5.0.19-solaris10-sparc-64bit.tar.gz.

In testing recovery from a storage node failure, I found that sometimes (about once every 4 to 6 times that I killed off one node) the second storage node would also fail.  This problem appears to be closely related to http://bugs.mysql.com/bug.php?id=18349.  However, bug 18349 refers to the error message being incorrect, whereas this bug refers to the problem of the second node shutting down (and causing possible loss of data).

The test configuration used was:
  - 2 machines with a storage node and SQL node each, and
  - a third machine running a management node.

If records are being inserted into the database at the time when the second node fails, there can be data loss.

How to repeat:
- Induce the first storage node to fail by sending it either a SIGKILL or SIGSEGV.
- Repeat this several times (waiting for the node to restart and re-join the cluster each time), until the second storage node also fails.

This problem also occurs if performed over a long period (as opposed to in quick succession).

Suggested fix:
If the cluster does not require that the second node shut itself down upon detecting several failures of the first node, then the second node should be allowed to keep running.  However, if the second node must shut itself down in this situation, then it should first write all its data to disk (I guess that would be a local checkpoint).

Changing Category to Server: Cluster.

Hi

Can this be,  http://bugs.mysql.com/bug.php?id=18298

Please upload all trace/error logs + cluster log so that I can verify.

/Jonas

Extracted from output log file

Attachment: ndb_4_out.log (application/octet-stream, text), 1.59 KiB.

Extracted from error log file

Attachment: ndb_4_error.log (application/octet-stream, text), 393 bytes.

bzip2 ?

Could you please upload the file at:

ftp://ftp.mysql.com:/pub/mysql/upload

zip it into a file with a name that identifies this bug report
i.e: bug19020.zip

Thanks in advance.

Compressed trace log

Attachment: ndb_4_trace.log.11.bz2 (application/octet-stream, text), 25.92 KiB.

It does seem related to http://bugs.mysql.com/bug.php?id=18298 ... there are some ordered unique indexes in the database.  While I usually experience the problem after only 4 to 6 kills, my most recent test did not experience it until after 8 kills.  During that test, there was no database activity.

Hi,

I checked log, am currently 99% sure that this is it.
If you want to be 100% sure you can run
>ndb_show_tables 
locate table 3 in output and verify that it is an index.
(using the exact same cluster as the one that crashed...)

I'm closing this as duplicate,
please reopen if you disagree.

/Jonas

Table 3 is indeed an index:

3     OrderedIndex         Online   No      c4_ddawe_ndb def      PRIMARY

Thanks.