Bug #45096 Forced node shutdown upon restart with full data area
Submitted: 26 May 2009 13:03 Modified: 20 Jan 2016 10:27
Reporter: Guido Ostkamp Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Solaris
Assigned to: MySQL Verification Team CPU Architecture:Any

[26 May 2009 13:03] Guido Ostkamp
Description:
Hello,

during stability tests we are facing the following critical startup failure with full data area:

*****************************************************************************
...
memmanagerlock waiting for lock, contentions: 2400 spins: 49473860
jbalock waiting for lock, contentions: 200 spins: 176660
2009-05-26 14:45:53 [ndbd] INFO     -- Killed by node 3 as copyfrag failed, error: 827
2009-05-26 14:45:53 [ndbd] INFO     -- NDBCNTR (Line: 260) 0x0000000a
2009-05-26 14:45:53 [ndbd] INFO     -- Error handler startup shutting down system
2009-05-26 14:45:53 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-05-26 14:45:53 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2009-05-26 14:45:57 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error
 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, rest
*****************************************************************************

Interestingly the error message at the end is incomplete in the log file.

This situation is re-occuring on each startup (we tried twice).

We are using bzr revid jonas@mysql.com-20090524191743-7gc0xl8kmnpmluux dated Sun 2009-05-24 21:17:43 +0200 compiled on Solaris Sparc with
CC=cc CXX=CC CFLAGS="-xO5 -fast -mt -m64 -xbinopt=prepare" CXXFLAGS="-xO5 -fast -mt -m64 -xbinopt=prepare" LDFLAGS="-xbinopt=prepare" ./configure --prefix=/export/home/wsch/6.4_2009_01_29 --with-plugins=max --without-docs --without-man

Full logs will be uploaded shortly.

How to repeat:
1. Create records in DB tables until DB is full on first node (nodeid=2)
2. Put some background 'insert' load on the nodes
3. kill -9 repeatedly on second node (nodeid=3)
4. Problem occurs
[26 May 2009 13:05] Guido Ostkamp
Full logs uploaded to FTP server, file 'bug-data-45096.tar.bz2'.
[26 May 2009 13:06] Guido Ostkamp
Just for your info:

We tried additional startup with '--initial' but it failed as well.
[1 Jun 2009 12:58] Henrik Ingo
Hi Guido

Thank you for this report. Just to confirm that I've understood correctly:

Question 1:

"2. Put some background 'insert' load on the nodes"
Just wanted to know what happens here? Do the inserts fail because the DB is full?

Question 2:

In the end the node doesn't start even when there is no insert load anymore? And this is both with and without --initial?

Question 3:

Since 1 node is still alive and should contain all data, does it help if you:
 - delete some data first
  - (alternatively if this was unacceptable, you could also restart the node with more DataMemory allocated)
 - then let the other node join the cluster
[3 Jun 2009 14:17] Guido Ostkamp
Hello Henrik,

here are the answers to your questions

Question 1:

The inserts fail (as expected).

Question 2:

I retried the tests. After stopping the load, it is still not possible to restart the second node (forced node shutdown occurs).

I have then stopped the first node as well, and restarted it (which was possible). After it came up, restart of the second node was still unsuccessful. While the restart of the second node was running, we went into a strange state, where it was no longer possible to use the mysql shell at all even on the first node (after "use <dbname>" command mysql shell hung).

Question 3:

Deleting might be unacceptable (in case of customer data, we are not allowed to do that), but it was also technically impossible because a trigger had to be fired when trying to delete, and the update caused by the trigger could not be executed due to missing space.

I expect restart with changed configs to work, but this is not a feasible solution.

Regards

Guido Ostkamp
[20 Jan 2016 10:27] MySQL Verification Team
Hi,

Testing with latest 7.4.9 I can get to the described situation where second node won't start after database is full but contrary to what you encountered with 7.0 you can
 - shutdown mgm node(s)
 - change your config increasing datamemory
 - start mgm node(s)
 - start second node with --initial
 - when the second node is fully started restart the first data node

I tried this twice on 7.4.9 and it worked both times without a hitch.

kind regards
Bogdan Kecman